本篇博文主要内容为 2026-03-11 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-11)
今日共更新684篇论文,其中:
- 自然语言处理共68篇(Computation and Language (cs.CL))
- 人工智能共189篇(Artificial Intelligence (cs.AI))
- 计算机视觉共161篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共152篇(Machine Learning (cs.LG))
- 多智能体系统共14篇(Multiagent Systems (cs.MA))
- 信息检索共20篇(Information Retrieval (cs.IR))
- 人机交互共25篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Emotional Modulation in Swarm Decision Dynamics
【速读】:该论文旨在解决集体决策过程中情绪因素如何影响共识形成的问题,特别是情感效价(positive-negative)与唤醒度(low-high)如何通过调节个体间的互动速率来改变决策结果和收敛速度。其解决方案的关键在于将经典的蜜蜂方程(bee equation)扩展为一个基于代理的模型,其中情绪状态作为模态因子动态调整招募(recruitment)和交叉抑制(cross-inhibition)参数,并通过模拟面部表情实现情绪传染机制的可视化建模,从而揭示情绪不对称性和结构临界点对群体决策路径的塑造作用。
链接: https://arxiv.org/abs/2603.09963
作者: David Freire-Obregón
机构: SIANI, Universidad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at the International Conference on Agents and Artificial Intelligence (ICAART 2026)
Abstract:Collective decision-making in biological and human groups often emerges from simple interaction rules that amplify minor differences into consensus. The bee equation, developed initially to describe nest-site selection in honeybee swarms, captures this dynamic through recruitment and inhibition processes. Here, we extend the bee equation into an agent-based model in which emotional valence (positive-negative) and arousal (low-high) act as modulators of interaction rates, effectively altering the recruitment and cross-inhibition parameters. Agents display simulated facial expressions mapped from their valence-arousal states, allowing the study of emotional contagion in consensus formation. Three scenarios are explored: (1) the joint effect of valence and arousal on consensus outcomes and speed, (2) the role of arousal in breaking ties when valence is matched, and (3) the “snowball effect” in which consensus accelerates after surpassing intermediate support thresholds. Results show that emotional modulation can bias decision outcomes and alter convergence times by shifting effective recruitment and inhibition rates. At the same time, intrinsic non-linear amplification can produce decisive wins even in fully symmetric emotional conditions. These findings link classical swarm decision theory with affective and social modelling, highlighting how both emotional asymmetries and structural tipping points shape collective outcomes. The proposed framework offers a flexible tool for studying the emotional dimensions of collective choice in both natural and artificial systems. Comments: Accepted for presentation at the International Conference on Agents and Artificial Intelligence (ICAART 2026) Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.09963 [cs.MA] (or arXiv:2603.09963v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2603.09963 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-1] Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的多智能体系统中,代理行为依赖临时提示(ad hoc prompts)而缺乏系统性策略建模的问题。现有方法未能从策略(policy)角度对对话行为进行可控引导,限制了多智能体在社会模拟等场景中的可预测性和可控性。解决方案的关键在于提出“提示即动作”(prompt-as-action)的框架,将提示参数化为由状态-动作对组成的轻量级策略,通过动态构建包含五要素的提示结构来影响对话流程,从而在不训练模型的前提下实现对多智能体对话行为的有效控制。实验表明,该机制能显著调节响应性、反驳、证据使用等五个维度的对话质量,为多智能体系统的社会仿真研究提供了新的可控路径。
链接: https://arxiv.org/abs/2603.09890
作者: Hongbo Bo,Jingyu Hu,Weiru Liu
机构: University of Bristol (布里斯托大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems. However, existing research on the behaviour of LLM-based multi-agents relies on ad hoc prompts and lacks a principled policy perspective. Different from reinforcement learning, we investigate whether prompt-as-action can be parameterized so as to construct a lightweight policy which consists of a sequence of state-action pairs to influence conversational behaviours without training. Our framework regards prompts as actions executed by LLMs, and dynamically constructs prompts through five components based on the current state of the agent. To test the effectiveness of parameterized control, we evaluated the dialogue flow based on five indicators: responsiveness, rebuttal, evidence usage, non-repetition, and stance shift. We conduct experiments using different LLM-driven agents in two discussion scenarios related to the general public and show that prompt parameterization can influence the dialogue dynamics. This result shows that policy-parameterised prompts offer a simple and effective mechanism to influence the dialogue process, which will help the research of multi-agent systems in the direction of social simulation.
[MA-2] he Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation
【速读】:该论文旨在解决在代理执行(agentic execution)环境下传统身份与访问管理(Identity and Access Management, IAM)的时间假设失效所引发的授权失控问题,即在短时间内大量未授权API调用可能造成安全漏洞。其核心挑战在于,传统基于时间窗口(如TTL)的权限控制策略随代理执行速度(agent velocity)线性增长,难以保障细粒度能力(capability)的安全性。解决方案的关键是提出一种能力一致性系统(Capability Coherence System, CCS),并设计一种执行计数驱动的释放一致性(Release Consistency-directed Coherence, RCC)策略,通过状态映射 \varphi : \Sigma_\rm MESI \to \Sigma_\rm auth 在有限过时语义下保持转移结构不变,从而实现对每个能力的未授权操作数严格上界 D_\rm rcc \leq n,且该界不依赖于代理速度 v —— 这一性质显著优于传统时间绑定策略 O(v⋅TTL) 的线性扩展特性。仿真验证表明,在高动态场景和异常触发撤销场景中,RCC相比基于TTL的租约机制分别实现120倍和184倍的未授权操作减少,且所有120次运行均无边界违反,证明了方案的强安全性保障。
链接: https://arxiv.org/abs/2603.09875
作者: Vladyslav Parakhin
机构: Okta
类目: Multiagent Systems (cs.MA); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 18 pages, 3 figures. Simulation code at this https URL
Abstract:The temporal assumptions underpinning conventional Identity and Access Management collapse under agentic execution regimes. A sixty-second revocation window permits on the order of 6 \times 10^3 unauthorized API calls at 100 ops/tick; at AWS Lambda scale, the figure approaches 6 \times 10^5 . This is a coherence problem, not merely a latency problem. We define a Capability Coherence System (CCS) and construct a state-mapping \varphi : \Sigma_\rm MESI \to \Sigma_\rm auth preserving transition structure under bounded-staleness semantics. A safety theorem bounds unauthorized operations for the execution-count Release Consistency-directed Coherence (RCC) strategy at D_\rm rcc \leq n , independent of agent velocity v – a qualitative departure from the O(v \cdot \mathrmTTL) scaling of time-bounded strategies. Tick-based discrete event simulation across three business-contextualised scenarios (four strategies, ten deterministic seeds each) confirms: RCC achieves a 120\times reduction versus TTL-based lease in the high-velocity scenario (50 vs. 6,000 unauthorized operations), and 184\times under anomaly-triggered revocation. Zero bound violations across all 120 runs confirm the per-capability safety guarantee. Simulation code: this https URL
[MA-3] FetalAgents : A Multi-Agent System for Fetal Ultrasound Image and Video Analysis
【速读】:该论文旨在解决当前胎儿超声(fetal ultrasound, US)自动化分析中面临的两大核心挑战:一是现有深度学习模型难以在特定任务精度与全流程泛化能力之间取得平衡,二是缺乏能够支持从视频流到结构化报告的端到端临床工作流程处理能力。解决方案的关键在于提出FetalAgents——首个用于胎儿超声全面分析的多智能体系统(multi-agent system),其通过轻量级的代理协调框架动态调度多个专用视觉专家,在诊断、测量和分割等任务上实现性能最大化;同时,该系统进一步突破静态图像分析局限,支持端到端视频流摘要生成,自动识别多切面关键帧并协同分析,最终融合患者元数据生成结构化临床报告,从而提供可审计且贴合临床流程的智能化解决方案。
链接: https://arxiv.org/abs/2603.09733
作者: Xiaotian Hu,Junwei Huang,Mingxuan Liu,Kasidit Anmahapong,Yifei Chen,Yitong Luo,Yiming Huang,Xuguang Bai,Zihan Li,Yi Liao,Haibo Qu,Qiyuan Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:Fetal ultrasound (US) is the primary imaging modality for prenatal screening, yet its interpretation relies heavily on the expertise of the clinician. Despite advances in deep learning and foundation models, existing automated tools for fetal US analysis struggle to balance task-specific accuracy with the whole-process versatility required to support end-to-end clinical workflows. To address these limitations, we propose FetalAgents, the first multi-agent system for comprehensive fetal US analysis. Through a lightweight, agentic coordination framework, FetalAgents dynamically orchestrates specialized vision experts to maximize performance across diagnosis, measurement, and segmentation. Furthermore, FetalAgents advances beyond static image analysis by supporting end-to-end video stream summarization, where keyframes are automatically identified across multiple anatomical planes, analyzed by coordinated experts, and synthesized with patient metadata into a structured clinical report. Extensive multi-center external evaluations across eight clinical tasks demonstrate that FetalAgents consistently delivers the most robust and accurate performance when compared against specialized models and multimodal large language models (MLLMs), ultimately providing an auditable, workflow-aligned solution for fetal ultrasound analysis and reporting.
[MA-4] Context Engineering: From Prompts to Corporate Multi-Agent Architecture
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 系统从单轮交互向自主多步骤代理演进过程中,仅依赖提示工程(Prompt Engineering, PE)已无法满足复杂决策需求的问题。其核心挑战在于如何系统性地构建可扩展、可控且与组织目标对齐的智能代理架构。解决方案的关键在于提出“上下文工程”(Context Engineering, CE)作为独立学科,定义了五项上下文质量标准(相关性、充分性、隔离性、经济性和溯源性),并将上下文视为代理的操作系统;进一步延伸出意图工程(Intent Engineering, IE)和规范工程(Specification Engineering, SE),分别用于嵌入组织目标与策略层级以及实现企业政策的机器可读化,从而形成一个逐层递进的代理工程成熟度模型,有效应对企业级部署中因规模扩展而引发的行为失控与执行失效问题。
链接: https://arxiv.org/abs/2603.09619
作者: Vera V. Vishnyakova
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 15 pages, 1 figure
Abstract:As artificial intelligence (AI) systems evolve from stateless chatbots to autonomous multi-step agents, prompt engineering (PE), the discipline of crafting individual queries, proves necessary but insufficient. This paper introduces context engineering (CE) as a standalone discipline concerned with designing, structuring, and managing the entire informational environment in which an AI agent makes decisions. Drawing on vendor architectures (Google ADK, Anthropic, LangChain), current academic work (ACE framework, Google DeepMind’s intelligent delegation), enterprise research (Deloitte, 2026; KPMG, 2026), and the author’s experience building a multi-agent system, the paper proposes five context quality criteria: relevance, sufficiency, isolation, economy, and provenance, and frames context as the agent’s operating system. Two higher-order disciplines follow. Intent engineering (IE) encodes organizational goals, values, and trade-off hierarchies into agent infrastructure. Specification engineering (SE) creates a machine-readable corpus of corporate policies and standards enabling autonomous operation of multi-agent systems at scale. Together these four disciplines form a cumulative pyramid maturity model of agent engineering, in which each level subsumes the previous one as a necessary foundation. Enterprise data reveals a gap: while 75% of enterprises plan agentic AI deployment within two years (Deloitte, 2026), deployment has surged and retreated as organizations confront scaling complexity (KPMG, 2026). The Klarna case illustrates a dual deficit, contextual and intentional. Whoever controls the agent’s context controls its behavior; whoever controls its intent controls its strategy; whoever controls its specifications controls its scale.
[MA-5] oolRosetta: Bridging Open-Source Repositories and Large Language Model Agents through Automated Tool Standardization
【速读】:该论文旨在解决现有代码复用与调用成本高、可靠性差的问题,尤其是由于代码库异构性导致的缺乏标准化可执行接口,以及当前基于大语言模型(Large Language Models, LLMs)的工具调用框架依赖人工手动整理和标准化,难以规模化扩展。其解决方案的关键在于提出 ToolRosetta 框架,该框架能够自动将开源代码仓库和 API 转换为符合 Model Context Protocol (MCP) 标准的可执行服务,并结合自主规划工具链、识别相关代码库与安全检查机制,实现从用户任务到端到端执行的自动化流程,显著降低人工干预需求并提升任务完成效率与安全性。
链接: https://arxiv.org/abs/2603.09290
作者: Shimin Di,Xujie Yuan,Hanghui Guo,Chaoqian Ouyang,Zhangze Chen,Ling Yue,Libin Zheng,Jia Zhu,Shaowu Pan,Jian Yin,Min-Ling Zhang,Yong Rui
机构: Southeast University (东南大学); Sun Yat-sen University (中山大学); Zhejiang Normal University (浙江师范大学); Rensselaer Polytechnic Institute (伦斯勒理工学院); Lenovo (联想)
类目: oftware Engineering (cs.SE); Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA)
备注: 20 pages
Abstract:Reusing and invoking existing code remains costly and unreliable, as most practical tools are embedded in heterogeneous code repositories and lack standardized, executable interfaces. Although large language models (LLMs) and Model Context Protocol (MCP)-based tool invocation frameworks enable natural language task execution, current approaches rely heavily on manual tool curation and standardization, which fundamentally limits scalability. In this paper, we propose ToolRosetta, a unified framework that automatically translates open-source code repositories and APIs into MCP-compatible tools that can be reliably invoked by LLMs. Given a user task, ToolRosetta autonomously plans toolchains, identifies relevant codebases, and converts them into executable MCP services, enabling end-to-end task completion with minimal human intervention. In addition, ToolRosetta incorporates a security inspection layer to mitigate risks inherent in executing arbitrary code. Extensive experiments across diverse scientific domains demonstrate that ToolRosetta can automatically standardize a large number of open-source tools and reduce the human effort required for code reproduction and deployment. Notably, by seamlessly leveraging specialized open-source tools, ToolRosetta-powered agents consistently improve task completion performance compared to commercial LLMs and existing agent systems.
[MA-6] Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation
【速读】:该论文旨在解决一般和博弈(general-sum Markov games)中纳什均衡(Nash equilibrium)计算的可证明效率与鲁棒性不足的问题。传统纳什均衡在计算上通常难以处理,且因多重解存在和对近似误差敏感而表现脆弱。为此,作者提出风险敏感的量化响应均衡(Risk-Sensitive Quantal Response Equilibrium, RQRE),其在有界理性与风险敏感假设下能保证唯一且光滑的解。关键解决方案是设计了RQRE-OVI算法——一种基于乐观值迭代(optimistic value iteration)的在线学习方法,结合线性函数逼近,在大规模或连续状态空间中高效计算RQRE。理论分析表明,该算法具有有限样本后悔边界,揭示了理性程度与风险敏感性之间的定量权衡:提高理性可降低后悔,而风险敏感则通过正则化增强稳定性与鲁棒性;此外,RQRE策略映射在估计收益上具有Lipschitz连续性,优于纳什均衡,并具备分布鲁棒优化解释。实验验证了RQRE-OVI在自对弈中性能相当,但在交叉对弈中显著优于基于纳什的方法,展现出更优的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2603.09208
作者: Jake Gonzales,Max Horwitz,Eric Mazumdar,Lillian J. Ratliff
机构: 未知
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:
Abstract:Provably efficient and robust equilibrium computation in general-sum Markov games remains a core challenge in multi-agent reinforcement learning. Nash equilibrium is computationally intractable in general and brittle due to equilibrium multiplicity and sensitivity to approximation error. We study Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity. We propose \textttRQRE-OVI, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces. Through finite-sample regret analysis, we establish convergence and explicitly characterize how sample complexity scales with rationality and risk-sensitivity parameters. The regret bounds reveal a quantitative tradeoff: increasing rationality tightens regret, while risk sensitivity induces regularization that enhances stability and robustness. This exposes a Pareto frontier between expected performance and robustness, with Nash recovered in the limit of perfect rationality and risk neutrality. We further show that the RQRE policy map is Lipschitz continuous in estimated payoffs, unlike Nash, and RQRE admits a distributionally robust optimization interpretation. Empirically, we demonstrate that \textttRQRE-OVI achieves competitive performance under self-play while producing substantially more robust behavior under cross-play compared to Nash-based approaches. These results suggest \textttRQRE-OVI offers a principled, scalable, and tunable path for equilibrium learning with improved robustness and generalization.
[MA-7] Agent icCyOps: Securing Multi-Agent ic AI Integration in Enterprise Cyber Operations
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在企业级应用中因代理自主控制工具、记忆和通信所引入的新型攻击面问题,现有研究主要聚焦于提示层攻击和单一维度漏洞,缺乏面向企业级安全的系统性架构模型。解决方案的关键在于提出AgenticCyOps框架,通过将攻击面系统性分解为组件、协调和协议三层,识别出工具编排(tool orchestration)和内存管理(memory management)是核心集成边界(integration surfaces),并据此定义五项防御原则:授权接口、能力范围限制、验证执行、内存完整性同步与受控数据隔离,这些原则均符合NIST、ISO 27001、GDPR及欧盟人工智能法案等合规标准;进一步结合Model Context Protocol(MCP)结构,在SOC工作流中实现分阶段代理、共识验证回路和组织级内存边界,有效降低72%以上可利用的信任边界,并在前两步拦截三种典型攻击链,构建纵深防御体系。
链接: https://arxiv.org/abs/2603.09134
作者: Shaswata Mitra,Raj Patel,Sudip Mittal,Md Rayhanur Rahman,Shahram Rahimi
机构: The University of Alabama (阿拉巴马大学)
类目: Cryptography and Security (cs.CR); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 17 pages, 4 figures, 5 tables
Abstract:Multi-agent systems (MAS) powered by LLMs promise adaptive, reasoning-driven enterprise workflows, yet granting agents autonomous control over tools, memory, and communication introduces attack surfaces absent from deterministic pipelines. While current research largely addresses prompt-level exploits and narrow individual vectors, it lacks a holistic architectural model for enterprise-grade security. We introduce AgenticCyOps (Securing Multi-Agentic AI Integration in Enterprise Cyber Operations), a framework built on a systematic decomposition of attack surfaces across component, coordination, and protocol layers, revealing that documented vectors consistently trace back to two integration surfaces: tool orchestration and memory management. Building on this observation, we formalize these integration surfaces as primary trust boundaries and define five defensive principles: authorized interfaces, capability scoping, verified execution, memory integrity synchronization, and access-controlled data isolation; each aligned with established compliance standards (NIST, ISO 27001, GDPR, EU AI Act). We apply the framework to a Security Operations Center (SOC) workflow, adopting the Model Context Protocol (MCP) as the structural basis, with phase-scoped agents, consensus validation loops, and per-organization memory boundaries. Coverage analysis, attack path tracing, and trust boundary assessment confirm that the design addresses the documented attack vectors with defense-in-depth, intercepts three of four representative attack chains within the first two steps, and reduces exploitable trust boundaries by a minimum of 72% compared to a flat MAS, positioning AgenticCyOps as a foundation for securing enterprise-grade integration.
[MA-8] Chaotic Dynamics in Multi-LLM Deliberation
【速读】:该论文旨在解决多大语言模型(Large Language Model, LLM)委员会在重复执行过程中稳定性不足的问题,尤其是其决策轨迹对初始条件的敏感性尚未被量化。解决方案的关键在于将五-agent LLM委员会建模为随机动力系统,并通过经验李雅普诺夫指数(empirical Lyapunov exponent, λ^)来度量委员会平均偏好轨迹的发散程度,从而揭示不同结构设计(如角色分配与模型异质性)对系统稳定性的独立及交互影响。研究发现,即使在理论上应具确定性的 T=0 阶段,角色分化与模型异质性均可引发不稳定性,且引入主席角色或缩短记忆窗口可显著降低发散水平,表明稳定性审计应成为多LLM治理系统的核心设计要求。
链接: https://arxiv.org/abs/2603.09127
作者: Hajime Shimao,Warut Khern-am-nuai,Sung Joo Kim
机构: The Pennsylvania State University (宾夕法尼亚州立大学); McGill University (麦吉尔大学); American University (美国大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Main text: 6 pages, 4 figures; Supplementary Information: 14 pages, 7 supplementary figures
Abstract:Collective AI systems increasingly rely on multi-LLM deliberation, but their stability under repeated execution remains poorly characterized. We model five-agent LLM committees as random dynamical systems and quantify inter-run sensitivity using an empirical Lyapunov exponent ( \hat\lambda ) derived from trajectory divergence in committee mean preferences. Across 12 policy scenarios, a factorial design at T=0 identifies two independent routes to instability: role differentiation in homogeneous committees and model heterogeneity in no-role committees. Critically, these effects appear even in the T=0 regime where practitioners often expect deterministic behavior. In the HL-01 benchmark, both routes produce elevated divergence ( \hat\lambda=0.0541 and 0.0947 , respectively), while homogeneous no-role committees also remain in a positive-divergence regime ( \hat\lambda=0.0221 ). The combined mixed+roles condition is less unstable than mixed+no-role ( \hat\lambda=0.0519 vs 0.0947 ), showing non-additive interaction. Mechanistically, Chair-role ablation reduces \hat\lambda most strongly, and targeted protocol variants that shorten memory windows further attenuate divergence. These results support stability auditing as a core design requirement for multi-LLM governance systems.
[MA-9] Latent World Models for Automated Driving: A Unified Taxonomy Evaluation Framework and Open Challenges
【速读】:该论文旨在解决自动驾驶系统中因感知、决策与控制模块割裂而导致的可扩展性差、泛化能力弱及部署效率低的问题。其核心挑战在于如何构建一个统一且高效的潜空间(latent space)框架,以整合生成式世界模型(generative world models)与视觉-语言-动作(vision-language-action, VLA)系统的优势,实现多传感器信息压缩、时序一致性模拟以及规划与推理的可控生成。解决方案的关键在于提出了一种基于潜表示目标与形式(潜世界、潜动作、潜生成器;连续状态、离散标记与混合结构)及其几何、拓扑和语义先验的分类体系,并提炼出五项跨任务内部机制:结构同构性、长时程时序稳定性、语义与推理对齐、价值对齐的目标与后训练策略,以及自适应计算与深思熟虑机制。这些设计共同提升了系统的鲁棒性、泛化能力和部署可行性,同时通过闭环指标套件与资源感知的深思熟虑成本评估,有效缓解开环与闭环性能之间的差距。
链接: https://arxiv.org/abs/2603.09086
作者: Rongxiang Zeng,Yongqi Dong
机构: RWTH Aachen University (亚琛工业大学); Delft University of Technology (代尔夫特理工大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 17 pages, 6 figures, under review by IEEE Transactions on Intelligent Transportation Systems (IEEE-T-ITS)
Abstract:Emerging generative world models and vision-language-action (VLA) systems are rapidly reshaping automated driving by enabling scalable simulation, long-horizon forecasting, and capability-rich decision making. Across these directions, latent representations serve as the central computational substrate: they compress high-dimensional multi-sensor observations, enable temporally coherent rollouts, and provide interfaces for planning, reasoning, and controllable generation. This paper proposes a unifying latent-space framework that synthesizes recent progress in world models for automated driving. The framework organizes the design space by the target and form of latent representations (latent worlds, latent actions, latent generators; continuous states, discrete tokens, and hybrids) and by structural priors for geometry, topology, and semantics. Building on this taxonomy, the paper articulates five cross-cutting internal mechanics (i.e, structural isomorphism, long-horizon temporal stability, semantic and reasoning alignment, value-aligned objectives and post-training, as well as adaptive computation and deliberation) and connects these design choices to robustness, generalization, and deployability. The work also proposes concrete evaluation prescriptions, including a closed-loop metric suite and a resource-aware deliberation cost, designed to reduce the open-loop / closed-loop mismatch. Finally, the paper identifies actionable research directions toward advancing latent world model for decision-ready, verifiable, and resource-efficient automated driving.
[MA-10] LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems
【速读】:该论文旨在解决当前多智能体AI系统中通信协议(如A2A和MCP)无法将模型级属性作为一等公民暴露的问题,从而限制了有效委托的能力。其核心挑战在于缺乏对模型身份、推理特征、质量校准和成本特性的显式支持,导致任务分配低效、上下文管理混乱及可信度缺失。解决方案的关键是提出一种面向AI原生的通信协议——LLM Delegate Protocol (LDP),通过五大机制实现:(1) 带有质量提示与推理特征的丰富代理身份卡;(2) 支持协商与回退的渐进式负载模式;(3) 具备持久上下文的受管会话;(4) 结构化溯源追踪置信度与验证状态;(5) 在协议层强制执行安全边界的信任域。实证表明,该协议显著提升了任务路由效率(易任务延迟降低~12倍)、减少Token消耗(语义帧负载降低37%)、消除会话开销(10轮后减少39%),并增强攻击检测与故障恢复能力,证明AI原生协议原语可提升委托的效率与可控性。
链接: https://arxiv.org/abs/2603.08852
作者: Sunil Prakash
机构: Indian School of Business (印度管理学院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 16 pages, 9 figures, 8 tables, 4 appendices
Abstract:As multi-agent AI systems grow in complexity, the protocols connecting them constrain their capabilities. Current protocols such as A2A and MCP do not expose model-level properties as first-class primitives, ignoring properties fundamental to effective delegation: model identity, reasoning profile, quality calibration, and cost characteristics. We present the LLM Delegate Protocol (LDP), an AI-native communication protocol introducing five mechanisms: (1) rich delegate identity cards with quality hints and reasoning profiles; (2) progressive payload modes with negotiation and fallback; (3) governed sessions with persistent context; (4) structured provenance tracking confidence and verification status; (5) trust domains enforcing security boundaries at the protocol level. We implement LDP as a plugin for the JamJet agent runtime and evaluate against A2A and random baselines using local Ollama models and LLM-as-judge evaluation. Identity-aware routing achieves ~12x lower latency on easy tasks through delegate specialization, though it does not improve aggregate quality in our small delegate pool; semantic frame payloads reduce token count by 37% (p=0.031) with no observed quality loss; governed sessions eliminate 39% token overhead at 10 rounds; and noisy provenance degrades synthesis quality below the no-provenance baseline, arguing that confidence metadata is harmful without verification. Simulated analyses show architectural advantages in attack detection (96% vs. 6%) and failure recovery (100% vs. 35% completion). This paper contributes a protocol design, reference implementation, and initial evidence that AI-native protocol primitives enable more efficient and governable delegation.
[MA-11] Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams
【速读】:该论文旨在解决异构多机器人系统在长时程任务规划中面临的挑战,即如何从大量感知信息中高效提取与任务目标相关的内容,以提升规划的可扩展性和可靠性。传统符号规划方法依赖人工构建的问题规范,难以适应复杂环境;而基于大语言模型(Large Language Model, LLM)的方法常出现幻觉和缺乏接地(grounding)问题,导致生成计划与实际环境对象及约束不一致。解决方案的关键在于提出Scale-Plan框架,其通过LLM辅助构建紧凑且任务相关的PDDL问题表示:首先基于PDDL领域规范构建动作图(action graph),再利用浅层LLM推理引导结构化图搜索,识别出最小的相关动作与对象子集;该预过滤机制显著减少了无关信息干扰,从而支持高效的分解、分配与长时程计划生成。
链接: https://arxiv.org/abs/2603.08814
作者: Piyush Gupta,Sangjae Bae,Jiachen Li,David Isele
机构: Honda Research Institute (本田研究 institute); University of California, Riverside (加州大学河滨分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注:
Abstract:Long-horizon task planning for heterogeneous multi-robot systems is essential for deploying collaborative teams in real-world environments; yet, it remains challenging due to the large volume of perceptual information, much of which is irrelevant to task objectives and burdens planning. Traditional symbolic planners rely on manually constructed problem specifications, limiting scalability and adaptability, while recent large language model (LLM)-based approaches often suffer from hallucinations and weak grounding-i.e., poor alignment between generated plans and actual environmental objects and constraints-in object-rich settings. We present Scale-Plan, a scalable LLM-assisted framework that generates compact, task-relevant problem representations from natural language instructions. Given a PDDL domain specification, Scale-Plan constructs an action graph capturing domain structure and uses shallow LLM reasoning to guide a structured graph search that identifies a minimal subset of relevant actions and objects. By filtering irrelevant information prior to planning, Scale-Plan enables efficient decomposition, allocation, and long-horizon plan generation. We evaluate our approach on complex multi-agent tasks and introduce MAT2-THOR, a cleaned benchmark built on AI2-THOR for reliable evaluation of multi-robot planning systems. Scale-Plan outperforms pure LLM and hybrid LLM-PDDL baselines across all metrics, improving scalability and reliability.
[MA-12] Electoral Systems Simulator: An Open Framework for Comparing Electoral Mechanisms Across Voter Distribution Scenarios
【速读】:该论文旨在解决如何在多样化选民偏好分布下,系统性地模拟与比较不同选举制度(electoral systems)的性能问题。其核心挑战在于缺乏一个统一框架来公平评估多种机制(如多数制、排序选择制、批准制、评分制、康多塞规则及两种比例代表制)在真实政治场景中的表现。解决方案的关键在于构建一个开源的Python框架electoral_sim,将选民和候选人建模为二维意识形态空间中的点,基于欧几里得距离生成诚实投票配置,并以选举结果与选民分布几何中位数之间的欧氏距离作为统一评价指标。通过在200次蒙特卡洛试验中测试从单峰共识到双峰极化等多种现实场景,该框架实现了对现有机制的量化对比,并进一步扩展出一种基于Boltzmann softmax核的新型虚拟机制,用于理论上的最优中心趋近性能基准。
链接: https://arxiv.org/abs/2603.08752
作者: Sumit Mukherjee
机构: Oracle(甲骨文)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:
Abstract:Here we present \textttelectoral_sim, an open-source Python framework for simulating and comparing electoral systems across diverse voter preference distributions. The framework represents voters and candidates as points in a two-dimensional ideological space, derives sincere ballot profiles from Euclidean preference distances, and evaluates several standard electoral mechanisms – including plurality, ranked-choice, approval, score, Condorcet, and two proportional representation systems – against a common primary metric: the Euclidean distance between the electoral outcome and the geometric median of the voter distribution. We evaluate these systems across many empirically-grounded scenarios ranging from unimodal consensus electorates to sharply polarised bimodal configurations, reporting both single-run and Monte Carlo stability results across 200 trials per scenario. As a case study in framework extensibility, we implement and evaluate a novel hypothetical mechanism that is not currently implemented in any jurisdiction – in which each voter’s influence is distributed across candidates via a Boltzmann softmax kernel. This system is included as a theoretical benchmark characterising an approximate upper bound on centroid-seeking performance, rather than as a policy proposal. All code is released publicly at this https URL.
[MA-13] ChatNeuroSim: An LLM Agent Framework for Automated Compute-in-Memory Accelerator Deployment and Optimization
【速读】:该论文旨在解决计算存内(Compute-in-Memory, CIM)加速器设计流程中因系统级仿真工具(如NeuroSim)使用复杂、参数依赖关系繁多及设计-仿真迭代频繁而导致的设计空间探索(Design Space Exploration, DSE)周期过长的问题,从而阻碍CIM快速部署。其解决方案的关键在于提出ChatNeuroSim——一个基于大语言模型(Large Language Model, LLM)的智能代理框架,该框架自动化完成任务调度、请求解析与调整、参数依赖检查、脚本生成和仿真执行全流程,并集成了一种基于设计空间剪枝(design space pruning)的优化器,显著缩短最优配置搜索时间。实验表明,该方法在Swint Transformer Tiny模型下于22 nm工艺节点上实现平均运行时间降低0.42×–0.79×,验证了其在自动请求解析与任务执行方面的有效性及优化加速能力。
链接: https://arxiv.org/abs/2603.08745
作者: Ming-Yen Lee,Shimeng Yu
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Hardware Architecture (cs.AR); Multiagent Systems (cs.MA); Performance (cs.PF)
备注: 30 pages, 16 figures
Abstract:Compute-in-Memory (CIM) architectures have been widely studied for deep neural network (DNN) acceleration by reducing data transfer overhead between the memory and computing units. In conventional CIM design flows, system-level CIM simulators (such as NeuroSim) are leveraged for design space exploration (DSE) across different hardware configurations and DNN workloads. However, CIM designers need to invest substantial effort in interpreting simulator manuals and understanding complex parameter dependencies. Moreover, extensive design-simulation iterations are often required to identify optimal CIM configurations under hardware constraints. These challenges severely prolong the DSE cycle and hinder rapid CIM deployment. To address these challenges, this work proposes ChatNeuroSim, a large language model (LLM)-based agent framework for automated CIM accelerator deployment and optimization. ChatNeuroSim automates the entire CIM workflow, including task scheduling, request parsing and adjustment, parameter dependency checking, script generation, and simulation execution. It also integrates the proposed CIM optimizer using design space pruning, enabling rapid identification of optimal configurations for different DNN workloads. ChatNeuroSim is evaluated on extensive request-level testbenches and demonstrates correct simulation and optimization behavior, validating its effectiveness in automatic request parsing and task execution. Furthermore, the proposed design space pruning technique accelerates CIM optimization process compared to no-pruning baseline. In the case study optimizing Swin Transformer Tiny under 22 nm technology, the proposed CIM optimizer achieves a 0.42 \times -0.79 \times average runtime reduction compared to the same optimization algorithm without design space pruning.
自然语言处理
[NLP-0] CREATE: Testing LLM s for Associative Creativity
【速读】: 该论文旨在解决当前大语言模型在生成式 AI(Generative AI)任务中对创造性关联推理能力评估不足的问题。现有基准多聚焦于事实准确性或逻辑推理,缺乏对模型能否从参数化知识中挖掘新颖且有意义的概念连接的系统性评测。解决方案的关键在于提出 CREATE 基准,该基准要求模型生成多个高特异性(distinctiveness and closeness of the concept connection)和高多样性(dissimilarity from other paths)的概念路径,并通过客观评分机制量化其创造性表现。此设计模拟了真实创造力任务如假设生成中的大规模搜索空间特性,同时支持可扩展、自动评分的数据集构建,从而为提升模型的关联创造力提供了可量化、可比较的测试平台。
链接: https://arxiv.org/abs/2603.09970
作者: Manya Wadhwa,Tiasa Singha Roy,Harvey Lederman,Junyi Jessy Li,Greg Durrett
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models’ capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model’s parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models’ capacity for associative creativity.
[NLP-1] hink Before You Lie: How Reasoning Improves Honesty
【速读】: 该论文试图解决的问题是:现有对大语言模型(Large Language Models, LLMs)的评估主要关注欺骗率,但导致模型产生欺骗行为的根本条件尚不明确。为回答这一问题,作者构建了一个包含现实道德权衡的新颖数据集,其中诚实行为伴随可变成本。解决方案的关键在于揭示了推理过程如何提升诚实性——研究发现,推理不仅增强了诚实行为,且这种效应并非源于推理内容本身(因为推理轨迹往往难以预测最终行为),而是由于表示空间本身的几何特性:欺骗区域在该空间中具有亚稳态特征,即欺骗答案比诚实答案更容易被输入改写、输出重采样或激活噪声扰动所 destabilize(不稳定化)。因此,推理通过在偏置的表示空间中遍历,引导模型走向其更稳定的诚实默认状态。
链接: https://arxiv.org/abs/2603.09957
作者: Ann Yuan,Asma Ghandeharioun,Carter Blum,Alicia Machado,Jessica Hoffmann,Daphne Ippolito,Martin Wattenberg,Lucas Dixon,Katja Filippova
机构: Google DeepMind(谷歌深度思维); Carnegie Mellon University(卡内基梅隆大学); Harvard University(哈佛大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:While existing evaluations of large language models (LLMs) measure deception rates, the underlying conditions that give rise to deceptive behavior are poorly understood. We investigate this question using a novel dataset of realistic moral trade-offs where honesty incurs variable costs. Contrary to humans, who tend to become less honest given time to deliberate (Capraro, 2017; Capraro et al., 2019), we find that reasoning consistently increases honesty across scales and for several LLM families. This effect is not only a function of the reasoning content, as reasoning traces are often poor predictors of final behaviors. Rather, we show that the underlying geometry of the representational space itself contributes to the effect. Namely, we observe that deceptive regions within this space are metastable: deceptive answers are more easily destabilized by input paraphrasing, output resampling, and activation noise than honest ones. We interpret the effect of reasoning in this vein: generating deliberative tokens as part of moral reasoning entails the traversal of a biased representational space, ultimately nudging the model toward its more stable, honest defaults.
[NLP-2] Model Merging in the Era of Large Language Models : Methods Applications and Future Directions
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在实际应用中因模型数量激增而带来的部署复杂性与计算资源消耗问题,尤其是如何高效整合多个微调后的LLM以获得统一性能而不需额外训练。其解决方案的关键在于提出并系统梳理了一种四维分类框架——FUSE(Foundations, Unification Strategies, Scenarios, Ecosystem),从理论基础、融合策略、应用场景到生态工具全面构建了模型合并(model merging)的结构化认知体系。其中,核心创新在于通过权重平均、任务向量运算、稀疏增强方法、专家混合架构及进化优化等算法路径,实现多模型能力的无监督集成,并揭示了损失曲面几何、模式连通性等理论机制对合并效果的支撑作用,从而为研究者和实践者提供可扩展、可解释且高效的模型融合方法论基础。
链接: https://arxiv.org/abs/2603.09938
作者: Mingyang Song,Mao Zheng
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注:
Abstract:Model merging has emerged as a transformative paradigm for combining the capabilities of multiple neural networks into a single unified model without additional training. With the rapid proliferation of fine-tuned large language models~(LLMs), merging techniques offer a computationally efficient alternative to ensembles and full retraining, enabling practitioners to compose specialized capabilities at minimal cost. This survey presents a comprehensive and structured examination of model merging in the LLM era through the \textbfFUSE taxonomy, a four-dimensional framework organized along \textbfFoundations, \textbfUnification Strategies, \textbfScenarios, and \textbfEcosystem. We first establish the theoretical underpinnings of merging, including loss landscape geometry, mode connectivity, and the linear mode connectivity hypothesis. We then systematically review the algorithmic landscape, spanning weight averaging, task vector arithmetic, sparsification-enhanced methods, mixture-of-experts architectures, and evolutionary optimization approaches. For each method family, we analyze the core formulation, highlight representative works, and discuss practical trade-offs. We further examine downstream applications across multi-task learning, safety alignment, domain specialization, multilingual transfer, and federated learning. Finally, we survey the supporting ecosystem of open-source tools, community platforms, and evaluation benchmarks, and identify key open challenges including theoretical gaps, scalability barriers, and standardization needs. This survey aims to equip researchers and practitioners with a structured foundation for advancing model merging.
[NLP-3] hinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理简单单跳事实类问题时,推理能力是否仍具有效用这一悬而未决的问题。尽管此类问题无需逻辑分解,但研究发现,启用推理机制可显著扩展模型参数化知识的召回边界,从而解锁原本无法正确回答的答案。解决方案的关键在于识别出两个核心驱动机制:一是计算缓冲效应(computational buffer effect),即模型利用生成的推理标记进行与语义无关的隐式计算;二是事实提示效应(factual priming),即生成相关事实作为语义桥梁促进正确答案的检索。此外,研究揭示了后者可能引发幻觉传播的风险,并提出通过优先选择不含幻觉事实的推理路径来直接提升模型准确性。
链接: https://arxiv.org/abs/2603.09906
作者: Zorik Gekhman,Roee Aharoni,Eran Ofek,Mor Geva,Roi Reichart,Jonathan Herzig
机构: Technion - Israel Institute of Technology (以色列理工学院); Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Nevertheless, we find that enabling reasoning substantially expands the capability boundary of the model’s parametric knowledge recall, unlocking correct answers that are otherwise effectively unreachable. Why does reasoning aid parametric knowledge recall when there are no complex reasoning steps to be done? To answer this, we design a series of hypothesis-driven controlled experiments, and identify two key driving mechanisms: (1) a computational buffer effect, where the model uses the generated reasoning tokens to perform latent computation independent of their semantic content; and (2) factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval. Importantly, this latter generative self-retrieval mechanism carries inherent risks: we demonstrate that hallucinating intermediate facts during reasoning increases the likelihood of hallucinations in the final answer. Finally, we show that our insights can be harnessed to directly improve model accuracy by prioritizing reasoning trajectories that contain hallucination-free factual statements.
[NLP-4] MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在持续微调过程中面临的灾难性遗忘(catastrophic forgetting)问题,即模型在学习新任务时会显著退化先前习得的知识。解决方案的关键在于提出一种名为 Memory-Inspired Sampler and Scheduler Replay (MSSR) 的经验回放框架,其核心创新是通过估计样本级别的记忆强度(memory strength)来动态调整回放策略,在自适应间隔下进行重放,从而在保持快速适应能力的同时有效缓解遗忘现象。
链接: https://arxiv.org/abs/2603.09892
作者: Yiyang Lu,Yu He,Jianlong Chen,Hongyuan Zha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Continual fine-tuning of large language models (LLMs) is becoming increasingly crucial as these models are deployed in dynamic environments where tasks and data distributions evolve over time. While strong adaptability enables rapid acquisition of new knowledge, it also exposes LLMs to catastrophic forgetting, where previously learned skills degrade during sequential training. Existing replay-based strategies, such as fixed interleaved replay, accuracy-supervised, and loss-driven scheduling, remain limited: some depend on heuristic rules and provide only partial mitigation of forgetting, while others improve performance but incur substantial computational overhead. Motivated by retention dynamics under sequential fine-tuning, we propose Memory-Inspired Sampler and Scheduler Replay (MSSR), an experience replay framework that estimates sample-level memory strength and schedules rehearsal at adaptive intervals to mitigate catastrophic forgetting while maintaining fast adaptation. Extensive experiments across three backbone models and 11 sequential tasks show that MSSR consistently outperforms state-of-the-art replay baselines, with particularly strong gains on reasoning-intensive and multiple-choice benchmarks.
[NLP-5] Benchmarking Political Persuasion Risks Across Frontier Large Language Models
【速读】: 该论文旨在解决前沿大语言模型(Large Language Models, LLMs)在政治观点塑造中的潜在说服力问题,特别是针对当前关于LLMs是否比传统政治宣传更有效尚存争议的现状。研究通过两项大规模调查实验(N=19,145),系统评估了由Anthropic、OpenAI、Google和xAI开发的七种先进LLM在跨党派议题和立场上的 persuasiveness(说服力)。其解决方案的关键在于:首先,发现不同LLM在说服力上存在显著差异,其中Claude模型表现最优,Grok最弱;其次,揭示信息型提示(information-based prompts)的效果具有模型依赖性,即对Claude和Grok提升说服力,但显著削弱GPT的表现;最后,提出一种数据驱动且策略无关的LLM辅助对话分析方法,用于识别并评估底层说服策略,从而为前沿模型的说服风险提供基准,并建立跨模型比较的风险评估框架。
链接: https://arxiv.org/abs/2603.09884
作者: Zhongren Chen,Joshua Kalla,Quan Le
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Concerns persist regarding the capacity of Large Language Models (LLMs) to sway political views. Although prior research has claimed that LLMs are not more persuasive than standard political campaign practices, the recent rise of frontier models warrants further study. In two survey experiments (N=19,145) across bipartisan issues and stances, we evaluate seven state-of-the-art LLMs developed by Anthropic, OpenAI, Google, and xAI. We find that LLMs outperform standard campaign advertisements, with heterogeneity in performance across models. Specifically, Claude models exhibit the highest persuasiveness, while Grok exhibits the lowest. The results are robust across issues and stances. Moreover, in contrast to the findings in Hackenburg et al. (2025b) and Lin et al. (2025) that information-based prompts boost persuasiveness, we find that the effectiveness of information-based prompts is model-dependent: they increase the persuasiveness of Claude and Grok while substantially reducing that of GPT. We introduce a data-driven and strategy-agnostic LLM-assisted conversation analysis approach to identify and assess underlying persuasive strategies. Our work benchmarks the persuasive risks of frontier models and provides a framework for cross-model comparative risk assessment.
[NLP-6] Do What I Say: A Spoken Prompt Dataset for Instruction-Following
【速读】: 该论文旨在解决当前Speech Large Language Models (SLLMs)的评估主要依赖文本提示(text prompts)的问题,而忽视了真实场景中用户通过语音进行交互的情况,导致评估结果难以反映模型在实际语音指令下的性能表现。解决方案的关键在于提出DoWhatISay (DOWIS),一个包含人类录制的语音与文本提示配对的多语言数据集,覆盖9个任务、11种语言及每任务-语言组合下5种风格的10种提示变体,从而为SLLMs提供基于语音输入的真实情境评估基准。
链接: https://arxiv.org/abs/2603.09881
作者: Maike Züfle,Sara Papi,Fabian Retkowski,Szymon Mazurek,Marek Kasztelnik,Alexander Waibel,Luisa Bentivogli,Jan Niehues
机构: Karlsruhe Institute of Technology (德国卡尔斯鲁厄理工学院); Fondazione Bruno Kessler (意大利布鲁诺·凯斯勒基金会); ACC Cyfronet AGH (波兰ACC Cyfronet AGH); AGH University of Krakow (波兰克拉科夫AGH大学); Carnegie Mellon University (美国卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.
[NLP-7] N-gram-like Language Models Predict Reading Time Best
【速读】: 该论文试图解决的问题是:当前基于Transformer的神经语言模型在进行词预测时表现出极高的准确性,但其输出的概率分布反而与读者的眼动追踪指标(如阅读时间)的相关性降低。解决方案的关键在于提出一个假设——阅读时间主要受简单n-gram统计特征的影响,而非Transformer模型所学习的复杂语义或上下文依赖关系。作者通过实验证明,那些在预测n-gram概率上表现最优的语言模型,其概率分布也最能解释自然文本中基于眼动的数据所反映的阅读时间变化,从而揭示了语言模型与人类阅读行为之间不一致的根本原因。
链接: https://arxiv.org/abs/2603.09872
作者: James A. Michaelov,Roger P. Levy
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent work has found that contemporary language models such as transformers can become so good at next-word prediction that the probabilities they calculate become worse for predicting reading time. In this paper, we propose that this can be explained by reading time being sensitive to simple n-gram statistics rather than the more complex statistics learned by state-of-the-art transformer language models. We demonstrate that the neural language models whose predictions are most correlated with n-gram probability are also those that calculate probabilities that are the most correlated with eye-tracking-based metrics of reading time on naturalistic text.
[NLP-8] Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents ICLR2026
【速读】: 该论文旨在解决顺序多智能体推理框架(如Chain-of-Agents, CoA)在处理长上下文任务时因分块顺序不当而导致的信息损失问题。CoA通过有界共享内存中的隐状态因子分解实现对全局上下文条件分布的近似,但这种有界记忆机制引入了信息瓶颈,使得最终证据状态对分块处理顺序敏感。论文的关键解决方案是利用Chow-Liu树学习分块间的依赖结构,并基于该结构采用广度优先遍历生成最优分块顺序,从而减少智能体间的信息损失。实验证明,该方法在三个长上下文基准测试中均优于默认文档顺序和基于语义得分的排序策略,在答案相关性和精确匹配准确率上表现更优。
链接: https://arxiv.org/abs/2603.09835
作者: Naman Gupta,Vaibhav Singh,Arun Iyer,Kirankumar Shiragur,Pratham Grover,Ramakrishna B. Bairi,Ritabrata Maiti,Sankarshan Damle,Shachee Mishra Gupta,Rishikesh Maurya,Vageesh D. C
机构: Microsoft
类目: Computation and Language (cs.CL)
备注: Published as a workshop paper at ICLR 2026 Workshop MemAgents
Abstract:Sequential multi-agent reasoning frameworks such as Chain-of-Agents (CoA) handle long-context queries by decomposing inputs into chunks and processing them sequentially using LLM-based worker agents that read from and update a bounded shared memory. From a probabilistic perspective, CoA aims to approximate the conditional distribution corresponding to a model capable of jointly reasoning over the entire long context. CoA achieves this through a latent-state factorization in which only bounded summaries of previously processed evidence are passed between agents. The resulting bounded-memory approximation introduces a lossy information bottleneck, making the final evidence state inherently dependent on the order in which chunks are processed. In this work, we study the problem of chunk ordering for long-context reasoning. We use the well-known Chow-Liu trees to learn a dependency structure that prioritizes strongly related chunks. Empirically, we show that a breadth-first traversal of the resulting tree yields chunk orderings that reduce information loss across agents and consistently outperform both default document-chunk ordering and semantic score-based ordering in answer relevance and exact-match accuracy across three long-context benchmarks. Comments: Published as a workshop paper at ICLR 2026 Workshop MemAgents Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.09835 [cs.CL] (or arXiv:2603.09835v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.09835 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-9] One-Eval: An Agent ic System for Automated and Traceable LLM Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际开发与部署过程中评估环节的低效与不可靠问题,具体表现为:评估需大量人工干预,包括基准测试选择、异构代码库复现、数据集模式映射配置及指标解读等复杂操作。其解决方案的关键在于提出One-Eval——一个代理驱动的评估系统,通过三个核心模块实现自然语言到可执行、可追溯且可定制化评估流程的自动转换:(i) NL2Bench用于意图结构化与个性化基准规划,(ii) BenchResolve实现基准解析、自动数据获取与模式标准化以保障可执行性,(iii) Metrics & Reporting支持任务感知的指标选择与面向决策的报告生成;同时引入人机协同检查点确保可控性,并保留样本级证据链以增强调试与审计能力。
链接: https://arxiv.org/abs/2603.09821
作者: Chengyu Shen,Yanheng Hou,Minghui Pan,Runming He,Zhen Hao Wong,Meiyi Qiang,Zhou Liu,Hao Liang,Peichao Lai,Zeang Sheng,Wentao Zhang
机构: Peking University (北京大学); Beijing Institute of Technology (北京理工大学); Beijing University of Posts and Telecommunications (北京邮电大学); Zhongguancun Academy (中关村学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \ Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at this https URL.
[NLP-10] EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting LREC-2026
【速读】: 该论文旨在解决多模态语言研究中高质量平行语料库不足的问题,特别是针对书面与口语模式差异及口译中填充词(filler particles)预测的挑战。其解决方案的关键在于构建并更新了一个整合的英德双语语料库(EPIC-UdS 和 EuroParl-UdS),通过修正元数据和文本错误、增强语言学标注、新增词对齐(word alignment)和词级意外度(surprisal indices)等层次信息,从而为信息论驱动的语言变异研究(如比较书面与口语差异、分析言语不流畅性)以及传统翻译风格(translationese)研究提供更可靠的数据支持。此外,论文还利用该语料库验证了重建口语数据的完整性,并评估了基于基础和微调 GPT-2 模型及机器翻译模型的概率指标在填充词预测任务中的表现。
链接: https://arxiv.org/abs/2603.09785
作者: Maria Kunilovskaya,Christina Pollkläsener
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages with appendices, 8 figures to be published in LREC-2026 main conference proceedings
Abstract:This paper introduces an updated and combined version of the bidirectional English-German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based on the corpus, and presents a new illustrative study. The study validates the integrity of the rebuilt spoken data and evaluates probabilistic measures derived from base and fine-tuned GPT-2 and machine translation models on the task of filler particles prediction in interpreting.
[NLP-11] Beyond Fine-Tuning: Robust Food Entity Linking under Ontology Drift with FoodOntoRAG
【速读】: 该论文旨在解决食品领域命名实体链接(Named Entity Linking, NEL)任务中因依赖微调大型语言模型(Large Language Models, LLMs)所导致的计算成本高、对本体快照版本绑定强以及在本体漂移(ontology drift)下性能下降的问题。解决方案的关键在于提出了一种模型和本体无关的流水线方法——FoodOntoRAG,其核心机制包括:通过混合词法-语义检索器枚举候选实体,利用选择代理基于结构化证据(如食品标签、同义词、定义及关系)选出最优匹配并提供理由,独立评分代理校准置信度,并在置信度不足时由同义词生成代理提出改写建议重新进入循环。该设计避免了微调过程,提升了对本体演化的鲁棒性,并通过可解释的推理路径实现决策透明化。
链接: https://arxiv.org/abs/2603.09758
作者: Jan Drole,Ana Gjorgjevikj,Barbara Korouši’c Seljak,Tome Eftimov
机构: Jožef Stefan Institute (约瑟夫·斯特凡研究所); Jožef Stefan International Postgraduate School (约瑟夫·斯特凡国际研究生院)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Standardizing food terms from product labels and menus into ontology concepts is a prerequisite for trustworthy dietary assessment and safety reporting. The dominant approach to Named Entity Linking (NEL) in the food and nutrition domains fine-tunes Large Language Models (LLMs) on task-specific corpora. Although effective, fine-tuning incurs substantial computational cost, ties models to a particular ontology snapshot (i.e., version), and degrades under ontology drift. This paper presents FoodOntoRAG, a model- and ontology-agnostic pipeline that performs few-shot NEL by retrieving candidate entities from domain ontologies and conditioning an LLM on structured evidence (food labels, synonyms, definitions, and relations). A hybrid lexical–semantic retriever enumerates candidates; a selector agent chooses a best match with rationale; a separate scorer agent calibrates confidence; and, when confidence falls below a threshold, a synonym generator agent proposes reformulations to re-enter the loop. The pipeline approaches state-of-the-art accuracy while revealing gaps and inconsistencies in existing annotations. The design avoids fine-tuning, improves robustness to ontology evolution, and yields interpretable decisions through grounded justifications.
[NLP-12] EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在具身智能体中进行长时程、第一人称视角物理因果推理能力不足的问题,即模型能否可靠地预测一系列原子动作执行后场景的最终状态。其解决方案的关键在于提出了一项新任务——Egocentric Scene Prediction with LOng-horizon REasoning(ESPLOR),并构建了EXPLORE-Bench基准数据集,该数据集基于真实的第一人称视频,涵盖多样化场景,每条实例包含长动作序列与结构化的最终场景标注(如物体类别、视觉属性及物体间关系),从而支持细粒度、量化的评估。实验表明,当前主流MLLMs在该任务上性能显著落后于人类,凸显出长时程第一人称推理仍是重大挑战;同时研究发现,通过分步推理策略进行测试时缩放可部分提升性能,但伴随显著计算开销。
链接: https://arxiv.org/abs/2603.09731
作者: Chengjun Yu,Xuhan Zhu,Chaoqun Du,Pengfei Yu,Wei Zhai,Yang Cao,Zheng-Jun Zha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.
[NLP-13] RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在学术同行评审中产生的反馈过于表面化、缺乏可操作性的问题,导致作者难以获得具体且可执行的修改建议。其核心解决方案是提出 RbtAct 框架,关键在于利用审稿人回复(rebuttal)作为隐式监督信号,将审稿意见与实际修改行为关联起来,从而直接优化反馈生成模型以提升行动导向性(actionability)。具体而言,研究构建了一个包含 75K 条段落级评论-反驳映射的新数据集 RMR-75K,并引入“视角条件化的段落级反馈生成”任务,使模型能够基于论文整体内容和特定视角(如实验或写作)生成聚焦、可操作的反馈;并通过监督微调与基于反驳对的偏好优化策略训练 Llama-3.1-8B-Instruct 模型,在保持内容相关性和事实准确性的同时显著提升反馈的具体性和实用性。
链接: https://arxiv.org/abs/2603.09723
作者: Sihong Wu,Yiling Ma,Yilun Zhao,Tiansheng Hu,Owen Jiang,Manasi Patwardhan,Arman Cohan
机构: Yale University (耶鲁大学); New York University (纽约大学); TCS Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used across the scientific workflow, including to draft peer-review reports. However, many AI-generated reviews are superficial and insufficiently actionable, leaving authors without concrete, implementable guidance and motivating the gap this work addresses. We propose RbtAct, which targets actionable review feedback generation and places existing peer review rebuttal at the center of learning. Rebuttals show which reviewer comments led to concrete revisions or specific plans, and which were only defended. Building on this insight, we leverage rebuttal as implicit supervision to directly optimize a feedback generator for actionability. To support this objective, we propose a new task called perspective-conditioned segment-level review feedback generation, in which the model is required to produce a single focused comment based on the complete paper and a specified perspective such as experiments and writing. We also build a large dataset named RMR-75K that maps review segments to the rebuttal segments that address them, with perspective labels and impact categories that order author uptake. We then train the Llama-3.1-8B-Instruct model with supervised fine-tuning on review segments followed by preference optimization using rebuttal derived pairs. Experiments with human experts and LLM-as-a-judge show consistent gains in actionability and specificity over strong baselines while maintaining grounding and relevance.
[NLP-14] MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models
【速读】: 该论文旨在解决大音频语言模型(Large Audio-Language Models, LALMs)在多音频理解(multi-audio understanding)能力上的不足问题,这一能力对于模型处理语音、通用音频和音乐等多源音频输入至关重要,但此前研究中尚未充分探索。解决方案的关键在于提出MUGEN基准以系统评估模型在多音频场景下的表现,并发现输入规模扩展是当前模型的核心瓶颈;进一步通过训练-free策略中的Audio-Permutational Self-Consistency方法,通过对音频候选顺序的多样化扰动来增强模型聚合预测的鲁棒性,从而实现最高达6.28%的准确率提升;结合Chain-of-Thought提示后,性能进一步提升至6.74%,有效缓解了LALMs在复杂听觉理解任务中的盲区。
链接: https://arxiv.org/abs/2603.09714
作者: Chih-Kai Yang,Yun-Shao Tsai,Yu-Kai Guo,Ping-Le Tsai,Yen-Ting Piao,Hung-Wei Chen,Ting-Lin Hsiao,Yun-Man Hsu,Ke-Han Lu,Hung-yi Lee
机构: National Taiwan University (国立台湾大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 6 pages, 3 figures, 3 tables. Dataset: this https URL
Abstract:While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.
[NLP-15] Evaluation of LLM s in retrieving food and nutritional context for RAG systems
【速读】: 该论文旨在解决领域专家(如食品编译员和营养师)在利用复杂食物与营养数据时,因传统检索系统需要大量手动操作和技术门槛而面临效率低下的问题。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)将自然语言查询转化为结构化元数据过滤器,从而驱动基于Chroma向量数据库的高效检索。通过这一机制,LLMs作为高精度、易用的接口工具,显著降低了对技术专业知识的依赖,尤其在可明确表达约束条件的查询中表现优异;但当查询涉及无法在现有元数据格式中表达的约束时,检索可靠性仍面临挑战。
链接: https://arxiv.org/abs/2603.09704
作者: Maks Požarnik Vavken,Matevž Ogrinc,Tome Eftimov,Barbara Koroušić Seljak
机构: 未知
类目: Computation and Language (cs.CL)
备注: This is the preprint for our conference paper for IEEE International Conference on Big Data
Abstract:In this article, we evaluate four Large Language Models (LLMs) and their effectiveness at retrieving data within a specialized Retrieval-Augmented Generation (RAG) system, using a comprehensive food composition database. Our method is focused on the LLMs ability to translate natural language queries into structured metadata filters, enabling efficient retrieval via a Chroma vector database. By achieving high accuracy in this critical retrieval step, we demonstrate that LLMs can serve as an accessible, high-performance tool, drastically reducing the manual effort and technical expertise previously required for domain experts, such as food compilers and nutritionists, to leverage complex food and nutrition data. However, despite the high performance on easy and moderately complex queries, our analysis of difficult questions reveals that reliable retrieval remains challenging when queries involve non-expressible constraints. These findings demonstrate that LLM-driven metadata filtering excels when constraints can be explicitly expressed, but struggles when queries exceed the representational scope of the metadata format.
[NLP-16] Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning
【速读】: 该论文旨在解决现有谱优化方法(如Muon)在深度神经网络(Deep Neural Networks, DNNs)训练中因假设优化景观各向同性而导致的次优性能问题。具体而言,Muon对所有特征方向施加统一的谱更新范数约束,忽略了DNN中普遍存在的重尾、病态曲率谱结构,从而可能放大高曲率方向的不稳定性并抑制平坦方向的有效优化。其解决方案的关键在于提出Mousse(Muon Optimization Utilizing Shampoo’s Structural Estimation),该方法通过引入由Shampoo算法导出的Kronecker分解统计量所诱导的白化坐标系,在非各向同性信任区域内求解谱最速下降问题,并利用白化梯度的极分解(polar decomposition)获得最优更新方向,从而在保持谱方法结构性稳定的同时实现二阶预条件的几何自适应性。实证结果表明,Mousse在160M至800M参数的语言模型上显著优于Muon,训练步数减少约12%,且计算开销可忽略。
链接: https://arxiv.org/abs/2603.09697
作者: Yechen Zhang,Shuhao Xing,Junhao Huang,Kai Lv,Yunhua Zhou,Xipeng Qiu,Qipeng Guo,Kai Chen
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 10 figures
Abstract:Recent advances in spectral optimization, notably Muon, have demonstrated that constraining update steps to the Stiefel manifold can significantly accelerate training and improve generalization. However, Muon implicitly assumes an isotropic optimization landscape, enforcing a uniform spectral update norm across all eigen-directions. We argue that this “egalitarian” constraint is suboptimal for Deep Neural Networks, where the curvature spectrum is known to be highly heavy-tailed and ill-conditioned. In such landscapes, Muon risks amplifying instabilities in high-curvature directions while limiting necessary progress in flat directions. In this work, we propose \textbfMousse (\textbfMuon \textbfOptimization \textbfUtilizing \textbfShampoo’s \textbfStructural \textbfEstimation), a novel optimizer that reconciles the structural stability of spectral methods with the geometric adaptivity of second-order preconditioning. Instead of applying Newton-Schulz orthogonalization directly to the momentum matrix, Mousse operates in a whitened coordinate system induced by Kronecker-factored statistics (derived from Shampoo). Mathematically, we formulate Mousse as the solution to a spectral steepest descent problem constrained by an anisotropic trust region, where the optimal update is derived via the polar decomposition of the whitened gradient. Empirical results across language models ranging from 160M to 800M parameters demonstrate that Mousse consistently outperforms Muon, achieving around \sim 12% reduction in training steps with negligible computational overhead.
[NLP-17] ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning
【速读】: 该论文旨在解决强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)中偏好数据获取成本高昂的问题,尤其是在低资源和专家领域。其核心挑战在于如何高效地构建高质量的偏好数据集以提升大语言模型(Large Language Models, LLMs)对齐效果。解决方案的关键在于提出了一种模块化的主动学习(active learning)流水线——ACTIVEULTRAFEEDBACK,该流水线通过不确定性估计动态识别最具信息量的模型响应进行标注,并引入两种新方法:DOUBLE REVERSE THOMPSON SAMPLING(DRTS)和DELTAUCB,它们优先选择预测质量差距较大的响应对,从而更有效地提供微调信号。实验表明,该方法可在仅使用静态基线六分之一标注数据的情况下实现相当或更优的下游性能。
链接: https://arxiv.org/abs/2603.09692
作者: Davit Melikidze,Marian Schneider,Jessica Lam,Martin Wertich,Ido Hakimi,Barna Pásztor,Andreas Krause
机构: ETH Zurich (苏黎世联邦理工学院); ETH AI Center (苏黎世联邦理工学院人工智能中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 35 pages, 6 figures, 24 tables
Abstract:Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at this https URL and our preference datasets at this https URL.
[NLP-18] ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling
【速读】: 该论文旨在解决现有端到端任务导向型对话(Task-Oriented Dialog, TOD)建模方法在跨数据集和新对话场景中适应性差的问题,即当前模型通常针对特定数据集定制,难以泛化至未见过的任务流程或schema。解决方案的关键在于提出ESAinsTOD框架——一个统一的、基于指令微调(instruction-tuning)的端到端Schema感知建模框架,其核心创新包括:(1)引入全参数微调的大语言模型(LLMs),并设计两种对齐机制——指令对齐(instruction alignment)确保模型忠实遵循任务指令完成多样化任务流,schema对齐(schema alignment)引导预测结果符合指定结构;(2)采用会话级端到端建模策略,使系统能利用历史对话中已执行任务的结果,从而弥合指令微调范式与真实应用场景之间的差距。实验证明,该方法在CamRest676、In-Car和MultiWOZ等基准上显著优于现有最优模型,并在低资源和零样本设置下展现出更强的泛化能力与抗噪鲁棒性。
链接: https://arxiv.org/abs/2603.09691
作者: Dechuan Teng,Chunlin Lu,Libo Qin,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); Central South University (中南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published at International Journal of Machine Learning and Cybernetics (IJMLC)
Abstract:Existing end-to-end modeling methods for modular task-oriented dialog systems are typically tailored to specific datasets, making it challenging to adapt to new dialog scenarios. In this work, we propose ESAinsTOD, a unified End-to-end Schema-Aware Instruction-tuning framework for general Task-Oriented Dialog modeling. This framework introduces a structured methodology to go beyond simply fine-tuning Large Language Models (LLMs), enabling flexible adaptation to various dialogue task flows and schemas. Specifically, we leverage full-parameter fine-tuning of LLMs and introduce two alignment mechanisms to make the resulting system both instruction-aware and schema-aware: (i) instruction alignment, which ensures that the system faithfully follows task instructions to complete various task flows from heterogeneous TOD datasets; and (ii) schema alignment, which encourages the system to make predictions adhering to the specified schema. In addition, we employ session-level end-to-end modeling, which allows the system to access the results of previously executed task flows within the dialogue history, to bridge the gap between the instruction-tuning paradigm and the real-world application of TOD systems. Empirical results show that while a fine-tuned LLM serves as a strong baseline, our structured approach provides significant additional benefits. In particular, our findings indicate that: (i) ESAinsTOD outperforms state-of-the-art models by a significant margin on end-to-end task-oriented dialog modeling benchmarks: CamRest676, In-Car and MultiWOZ; (ii) more importantly, it exhibits superior generalization capabilities across various low-resource settings, with the proposed alignment mechanisms significantly enhancing zero-shot performance; and (iii) our instruction-tuning paradigm substantially improves the model’s robustness against data noise and cascading errors.
[NLP-19] Fusing Semantic Lexical and Domain Perspectives for Recipe Similarity Estimation
【速读】: 该论文旨在解决如何有效评估食谱之间相似性的难题,特别是在整合多种信息源(如食材、烹饪方法和营养属性)的基础上实现更精准的相似性判断。其解决方案的关键在于融合语义(semantic)、词汇(lexical)和领域(domain)三类相似性指标,并通过构建一个基于Web的专家验证界面,由领域专家对318组食谱对进行打分,最终获得80%的一致性结果,从而量化各相似性维度在专家决策中的影响力,为个性化饮食、营养推荐及自动化食谱生成系统提供可量化的技术支撑。
链接: https://arxiv.org/abs/2603.09688
作者: Denica Kjorvezir,Danilo Najkov,Eva Valencič,Erika Jesenko,Barbara Koroišić Seljak,Tome Eftimov,Riste Stojanov
机构: Jožef Stefan Institute (乔泽夫·斯蒂芬研究所); S.Cyril and Methodius University (西里尔和美多迪乌斯大学); Biotechnical Faculty, University of Ljubljana (卢布尔雅那大学生物技术学院)
类目: Computation and Language (cs.CL)
备注: Preprint version submitted to IEEE Big Data 2025
Abstract:This research focuses on developing advanced methods for assessing similarity between recipes by combining different sources of information and analytical approaches. We explore the semantic, lexical, and domain similarity of food recipes, evaluated through the analysis of ingredients, preparation methods, and nutritional attributes. A web-based interface was developed to allow domain experts to validate the combined similarity results. After evaluating 318 recipe pairs, experts agreed on 255 (80%). The evaluation of expert assessments enables the estimation of which similarity aspects–lexical, semantic, or nutritional–are most influential in expert decision-making. The application of these methods has broad implications in the food industry and supports the development of personalized diets, nutrition recommendations, and automated recipe generation systems.
[NLP-20] racking Cancer Through Text: Longitudinal Extraction From Radiology Reports Using Open-Source Large Language Models
【速读】: 该论文旨在解决放射科报告中纵向信息(如肿瘤负荷、治疗反应和疾病进展)难以自动化提取的问题,因其非结构化叙述格式限制了临床分析的效率与准确性。解决方案的关键在于构建一个完全开源、可本地部署的流水线系统,基于 \textttllm_extractinator 框架,采用 Qwen2.5-72B 大语言模型(Large Language Model, LLM),按照 RECIST 标准从多时间点的胸部/腹部CT报告中提取并关联目标病灶、非目标病灶及新发病灶数据。该方法在50对荷兰CT报告上的评估显示高精度(目标病灶93.7%、非目标病灶94.9%、新病灶94.0%),证明开放源代码的大语言模型可在保障数据隐私的前提下实现临床可接受的多时间点肿瘤学信息抽取性能。
链接: https://arxiv.org/abs/2603.09638
作者: Luc Builtjes,Alessa Hering
机构: Radboud University Medical Center (拉德布德大学医学中心)
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 figures
Abstract:Radiology reports capture crucial longitudinal information on tumor burden, treatment response, and disease progression, yet their unstructured narrative format complicates automated analysis. While large language models (LLMs) have advanced clinical text processing, most state-of-the-art systems remain proprietary, limiting their applicability in privacy-sensitive healthcare environments. We present a fully open-source, locally deployable pipeline for longitudinal information extraction from radiology reports, implemented using the \textttllm_extractinator framework. The system applies the \textttqwen2.5-72b model to extract and link target, non-target, and new lesion data across time points in accordance with RECIST criteria. Evaluation on 50 Dutch CT Thorax/Abdomen report pairs yielded high extraction performance, with attribute-level accuracies of 93.7% for target lesions, 94.9% for non-target lesions, and 94.0% for new lesions. The approach demonstrates that open-source LLMs can achieve clinically meaningful performance in multi-timepoint oncology tasks while ensuring data privacy and reproducibility. These results highlight the potential of locally deployable LLMs for scalable extraction of structured longitudinal data from routine clinical text.
[NLP-21] X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models
【速读】: 该论文旨在解决当前3D Gaussian Splatting (3DGS) 方法在应用上碎片化、缺乏统一框架的问题,尤其是难以实现实时在线SLAM(Simultaneous Localization and Mapping)与语义增强的结合,以及如何将生成的3D场景表示有效接入下游多模态模型。其解决方案的关键在于提出X-GS框架,核心是X-GS-Perceiver高效流水线:该流水线能够以未标定的RGB(或RGB-D)视频流为输入,联合优化几何结构与相机位姿,并从视觉基础模型中蒸馏高维语义特征至3D高斯分布中;同时通过创新的在线向量量化(Online Vector Quantization, VQ)模块、GPU加速的网格采样方案及高度并行化设计,实现真正意义上的实时性能;最终,所生成的语义3D高斯可被X-GS-Thinker组件中的视觉语言模型直接利用,支持如目标检测、零样本图像描述生成等下游任务。
链接: https://arxiv.org/abs/2603.09632
作者: Yueen Ma,Irwin King
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.
[NLP-22] Surgical Repair of Collapsed Attention Heads in ALiBi Transformers
【速读】: 该论文旨在解决BLOOM系列Transformer语言模型中因ALiBi位置编码引发的系统性注意力坍塌(attention collapse)问题,即31-44%的注意力头几乎仅关注序列起始标记(beginning-of-sequence token),导致模型有效注意力容量显著下降。解决方案的关键在于提出“手术式重初始化”(surgical reinitialization):针对特定坍塌注意力头进行Q/K/V权重的靶向重置,同时将输出投影设为零并冻结非手术参数的梯度更新,从而在单个消费级GPU上通过两次迭代恢复98.7%的可用注意力头容量(从242提升至379/384)。实验证明该方法有效恢复了模型性能,且优于原始预训练状态,表明预训练注意力配置可能陷入次优局部极小值。
链接: https://arxiv.org/abs/2603.09616
作者: Palmer Schallon
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 7 figures, 2 supplementary figures. Code: this https URL Checkpoints: this https URL
Abstract:We identify a systematic attention collapse pathology in the BLOOM family of transformer language models, where ALiBi positional encoding causes 31-44% of attention heads to attend almost entirely to the beginning-of-sequence token. The collapse follows a predictable pattern across four model scales (560M to 7.1B parameters), concentrating in head indices where ALiBi’s slope schedule imposes the steepest distance penalties. We introduce surgical reinitialization: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters. Applied to BLOOM-1b7 on a single consumer GPU, the technique recovers 98.7% operational head capacity (242 to 379 of 384 heads) in two passes. A controlled comparison with C4 training data confirms that reinitialization – not corpus content – drives recovery, and reveals two distinct post-surgical phenomena: early global functional redistribution that improves the model, and late local degradation that accumulates under noisy training signal. An extended experiment reinitializing mostly-healthy heads alongside collapsed ones produces a model that transiently outperforms stock BLOOM-1b7 by 25% on training perplexity (12.70 vs. 16.99), suggesting that pretrained attention configurations are suboptimal local minima. Code, checkpoints, and diagnostic tools are released as open-source software.
[NLP-23] Build Borrow or Just Fine-Tune? A Political Scientists Guide to Choosing NLP Models
【速读】: 该论文试图解决政治科学家在采用自然语言处理(Natural Language Processing, NLP)工具时面临的模型选择困境:是构建领域特定模型、迁移现有模型,还是直接在任务数据上微调通用模型?这一决策涉及性能、成本与所需专业知识之间的权衡,但学界缺乏实证指导。解决方案的关键在于通过实证比较——以冲突事件分类为测试案例,将微调后的ModernBERT(命名为Confli-mBERT)与当前领域黄金标准的领域专用预训练模型ConfliBERT进行系统对比。结果显示,Confli-mBERT整体准确率为75.46%,略低于ConfliBERT的79.34%,但性能差距主要集中于发生频率低于2%的罕见事件类别;在高频攻击类型(如爆炸/炸弹袭击和绑架)上两者几乎无差异。因此,作者提出一个基于类别频次、误差容忍度与可用资源三者交集的实用决策框架,主张模型选择应取决于具体研究需求而非抽象意义上的“优劣”。
链接: https://arxiv.org/abs/2603.09595
作者: Shreyas Meher
机构: Erasmus University Rotterdam (鹿特丹伊拉斯姆斯大学)
类目: Computation and Language (cs.CL)
备注: 33 pages, 5 figures, 13 tables (including appendix)
Abstract:Political scientists increasingly face a consequential choice when adopting natural language processing tools: build a domain-specific model from scratch, borrow and adapt an existing one, or simply fine-tune a general-purpose model on task data? Each approach occupies a different point on the spectrum of performance, cost, and required expertise, yet the discipline has offered little empirical guidance on how to navigate this trade-off. This paper provides such guidance. Using conflict event classification as a test case, I fine-tune ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and systematically compare it against ConfliBERT, a domain-specific pretrained model that represents the current gold standard. Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT’s 79.34%. Critically, the four-percentage-point gap is not uniform: on high-frequency attack types such as Bombing/Explosion (F1 = 0.95 vs. 0.96) and Kidnapping (F1 = 0.92 vs. 0.91), the models are nearly indistinguishable. Performance differences concentrate in rare event categories comprising fewer than 2% of all incidents. I use these findings to develop a practical decision framework for political scientists considering any NLP-assisted research task: when does the research question demand a specialized model, and when does an accessible fine-tuned alternative suffice? The answer, I argue, depends not on which model is “better” in the abstract, but on the specific intersection of class prevalence, error tolerance, and available resources. The model, training code, and data are publicly available on Hugging Face.
[NLP-24] ALARM: Audio-Language Alignment for Reasoning Models INTERSPEECH2026
【速读】: 该论文旨在解决当前大型音频语言模型(Large Audio Language Models, ALMs)在与具备推理能力的语言模型(Reasoning Language Models, RLMs)结合时所面临的挑战:即传统方法通过冻结预训练语言模型(LLM)并仅训练适配器(adapter)来处理自生成的文本目标,会导致RLMs因暴露其内部思维链(chain-of-thought)而产生不自然的响应。解决方案的关键在于提出“自重述”(self-rephrasing)机制,将自动生成的文本响应转化为适合音频理解的变体,同时保持分布对齐;此外,通过融合与压缩多个音频编码器以增强表征能力,并构建包含19K小时多模态数据(语音、音乐、声音)的600万实例多任务语料库进行高效训练,从而在保留文本能力的同时显著提升音频推理性能。
链接: https://arxiv.org/abs/2603.09556
作者: Petr Grinberg,Hassan Shahmohammadi
机构: EPFL(瑞士联邦理工学院); Sony Europe Ltd.(索尼欧洲有限公司)
类目: Computation and Language (cs.CL)
备注: Submitted to Interspeech2026
Abstract:Large audio language models (ALMs) extend LLMs with auditory understanding. A common approach freezes the LLM and trains only an adapter on self-generated targets. However, this fails for reasoning LLMs (RLMs) whose built-in chain-of-thought traces expose the textual surrogate input, yielding unnatural responses. We propose self-rephrasing, converting self-generated responses into audio-understanding variants compatible with RLMs while preserving distributional alignment. We further fuse and compress multiple audio encoders for stronger representations. For training, we construct a 6M-instance multi-task corpus (2.5M unique prompts) spanning 19K hours of speech, music, and sound. Our 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on related audio-reasoning benchmarks, while preserving textual capabilities with a low training cost. Notably, we achieve the best open-source result on the MMAU-speech and MMSU benchmarks and rank third among all the models.
[NLP-25] Enhancing Debunking Effectiveness through LLM -based Personality Adaptation
【速读】: 该论文旨在解决虚假新闻(fake news)传播中信息辟谣效果有限的问题,特别是如何提升辟谣信息的说服力以适应不同受众的心理特征。其核心解决方案是利用大语言模型(Large Language Models, LLMs)根据用户的大五人格特质(Big Five personality traits)生成个性化的辟谣内容,通过将通用辟谣信息映射至特定人格特征(如外向性、宜人性、尽责性、神经质和开放性)来增强信息的相关性和接受度。关键创新在于使用另一个LLM作为自动化评估器,模拟对应人格特质对辟谣信息进行打分,从而高效替代传统人工评估,同时验证个性化策略的有效性——结果表明,开放性较高的个体更易被说服,而神经质倾向则可能削弱说服效果,且多模型评估能提供更稳健的结论。
链接: https://arxiv.org/abs/2603.09533
作者: Pietro Dell’Oglio,Alessandro Bondielli,Francesco Marcelloni,Lucia C. Passaro
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: In: Computational Intelligence. IJCCI 2025. Springer, Cham (2026)
Abstract:This study proposes a novel methodology for generating personalized fake news debunking messages by prompting Large Language Models (LLMs) with persona-based inputs aligned to the Big Five personality traits: Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness. Our approach guides LLMs to transform generic debunking content into personalized versions tailored to specific personality profiles. To assess the effectiveness of these transformations, we employ a separate LLM as an automated evaluator simulating corresponding personality traits, thereby eliminating the need for costly human evaluation panels. Our results show that personalized messages are generally seen as more persuasive than generic ones. We also find that traits like Openness tend to increase persuadability, while Neuroticism can lower it. Differences between LLM evaluators suggest that using multiple models provides a clearer picture. Overall, this work demonstrates a practical way to create more targeted debunking messages exploiting LLMs, while also raising important ethical questions about how such technology might be used.
[NLP-26] You Didnt Have to Say It like That: Subliminal Learning from Faithful Paraphrases EACL2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)在训练过程中通过合成数据隐性学习教师模型行为特征的问题,即“亚意识学习”(subliminal learning)现象是否存在于自然语言改写(paraphrase)场景中,以及是否可以通过显式矛盾内容阻断这种传播。其关键解决方案在于设计实验验证:即使改写文本语义与教师偏好无关,甚至明确表达相反立场,学生模型仍会因训练数据来源而习得教师的偏好,且这一现象在严格过滤保证改写忠实度的情况下依然成立。这揭示了基于内容的审查机制无法识别此类隐蔽信息传递,对依赖自动生成训练数据的模型迭代流程构成潜在风险。
链接: https://arxiv.org/abs/2603.09517
作者: Isaia Gisler(1),Zhonghao He(2),Tianyi Qiu(3) ((1) ETH Zürich, (2) University of Cambridge, (3) Peking University)
机构: ETH Zürich; University of Cambridge; Peking University
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for Spotlight presentation at EACL 2026 SRW. 5 pages, 2 figures, plus appendix. Equal supervision by Zhonghao He and Tianyi Qiu
Abstract:When language models are trained on synthetic data, they (student model) can covertly acquire behavioral traits from the data-generating model (teacher model). Subliminal learning refers to the transmission of traits from a teacher to a student model via training on data unrelated to those traits. Prior work demonstrated this in the training domains of number sequences, code, and math Chain-of-Thought traces including transmission of misaligned behaviors. We investigate whether transmission occurs through natural language paraphrases with fixed semantic content, and whether content explicitly contradicting the teacher’s preference can block it. We find that training on paraphrases from a teacher system-prompted to love a particular animal increases a student’s preference for that animal by up to 19 percentage points. This occurs when paraphrased content is semantically unrelated to the animal, or even when it explicitly expresses dislike. The transmission succeeds despite aggressive filtering to ensure paraphrase fidelity. This raises concerns for pipelines where models generate their own training data: content-based inspection cannot detect such transmission, and even preference-contradicting content fails to prevent it.
[NLP-27] Modelling the Diachronic Emergence of Phoneme Frequency Distributions
【速读】: 该论文试图解决的问题是:跨语言中音位频率分布所呈现的统计规律性(如指数尾部的排名-频率模式以及音位库存规模与分布相对熵之间的负相关关系)的成因尚不明确,是否可由历史音变过程自然衍生。解决方案的关键在于构建一个随机的音系演变模型,并引入两个额外假设——与功能负荷(functional load)相关的效应和趋向于特定库存规模的稳定倾向——从而在模拟中重现了实证观察到的音位分布形态及其与库存大小的关系,表明这些统计规律可能是音系系统历时演变的自然结果,而非显式优化或补偿机制所致。
链接: https://arxiv.org/abs/2603.09503
作者: Fermín Moscoso del Prado Martín,Suchir Salhan
机构: University of Cambridge(剑桥大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Phoneme frequency distributions exhibit robust statistical regularities across languages, including exponential-tailed rank-frequency patterns and a negative relationship between phonemic inventory size and the relative entropy of the distribution. The origin of these patterns remains largely unexplained. In this paper, we investigate whether they can arise as consequences of the historical processes that shape phonological systems. We introduce a stochastic model of phonological change and simulate the diachronic evolution of phoneme inventories. A naïve version of the model reproduces the general shape of phoneme rank-frequency distributions but fails to capture other empirical properties. Extending the model with two additional assumptions – an effect related to functional load and a stabilising tendency toward a preferred inventory size – yields simulations that match both the observed distributions and the negative relationship between inventory size and relative entropy. These results suggest that some statistical regularities of phonological systems may arise as natural consequences of diachronic sound change rather than from explicit optimisation or compensatory mechanisms.
[NLP-28] CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?
【速读】: 该论文旨在解决当前用于评估大语言模型(Large Language Models, LLMs)在网络安全威胁情报(Cyber Threat Intelligence, CTI)领域应用效果的基准测试(benchmark)存在的三大核心问题:一是现有任务设计脱离真实分析师工作流程(如多选题形式不具现实性);二是评价指标偏重模型中心的词汇重叠度,忽视对安全分析师至关重要的可操作性与细节洞察力;三是未能覆盖CTI从初步筛选(triage)、深度检索(deep search)到情报撰写(TI drafting)的完整三阶段流程。解决方案的关键在于提出一个名为CyberThreat-Eval的新基准,该基准基于全球领先公司的日常CTI工作流收集并由专家标注,涵盖全流程任务,并采用以分析师为中心的评估指标(包括事实准确性、内容质量和运营成本),从而更真实地反映LLMs在实际场景中的能力边界与改进方向。
链接: https://arxiv.org/abs/2603.09452
作者: Xiangsen Chen,Xuan Feng,Shuo Chen,Matthieu Maitre,Sudipto Rakshit,Diana Duvieilh,Ashley Picone,Nan Tang
机构: Microsoft Research; Microsoft; Hong Kong University of Science and Technology (Guangzhou); Hong Kong University of Science and Technology
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted at TMLR
Abstract:Analyzing Open Source Intelligence (OSINT) from large volumes of data is critical for drafting and publishing comprehensive CTI reports. This process usually follows a three-stage workflow – triage, deep search and TI drafting. While Large Language Models (LLMs) offer a promising route toward automation, existing benchmarks still have limitations. These benchmarks often consist of tasks that do not reflect real-world analyst workflows. For example, human analysts rarely receive tasks in the form of multiple-choice questions. Also, existing benchmarks often rely on model-centric metrics that emphasize lexical overlap rather than actionable, detailed insights essential for security analysts. Moreover, they typically fail to cover the complete three-stage workflow. To address these issues, we introduce CyberThreat-Eval, which is collected from the daily CTI workflow of a world-leading company. This expert-annotated benchmark assesses LLMs on practical tasks across all three stages as mentioned above. It utilizes analyst-centric metrics that measure factual accuracy, content quality, and operational costs. Our evaluation using this benchmark reveals important insights into the limitations of current LLMs. For example, LLMs often lack the nuanced expertise required to handle complex details and struggle to distinguish between correct and incorrect information. To address these challenges, the CTI workflow incorporates both external ground-truth databases and human expert knowledge. TRA allows human experts to iteratively provide feedback for continuous improvement. The code is available at \hrefthis https URL\textttGitHub and \hrefthis https URL\textttHuggingFace.
[NLP-29] Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLM s LREC2026
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在面对道德困境与常识推理冲突时,倾向于优先进行道德判断而非识别常识矛盾的问题,从而影响其在现实应用中的知识准确性与逻辑一致性。解决方案的关键在于提出一个名为CoMoral的新基准数据集,该数据集包含嵌入在道德困境中的常识性矛盾,用于系统评估模型对这类冲突的识别能力;同时揭示了模型存在叙述焦点偏差(narrative focus bias),即更易识别次要角色而非主述者(叙述者)的常识矛盾,提示需通过增强推理感知训练(reasoning-aware training)来提升LLMs的常识鲁棒性(commonsense robustness)。
链接: https://arxiv.org/abs/2603.09434
作者: Saugata Purkayastha,Pranav Kushare,Pragya Paramita Pal,Sukannya Purkayastha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at LREC 2026
Abstract:Large Language Models (LLMs) are increasingly deployed across diverse real-world applications and user communities. As such, it is crucial that these models remain both morally grounded and knowledge-aware. In this work, we uncover a critical limitation of current LLMs – their tendency to prioritize moral reasoning over commonsense understanding. To investigate this phenomenon, we introduce CoMoral, a novel benchmark dataset containing commonsense contradictions embedded within moral dilemmas. Through extensive evaluation of ten LLMs across different model sizes, we find that existing models consistently struggle to identify such contradictions without prior signal. Furthermore, we observe a pervasive narrative focus bias, wherein LLMs more readily detect commonsense contradictions when they are attributed to a secondary character rather than the primary (narrator) character. Our comprehensive analysis underscores the need for enhanced reasoning-aware training to improve the commonsense robustness of large language models.
[NLP-30] Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health EACL2026
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在医疗领域中因训练数据中的社会决定因素(Social Determinants of Health, SDoH)交互关系被忽视而引发的偏见问题,尤其关注性别与其他SDoH因素之间的复杂关联如何被模型隐式学习并影响其决策。解决方案的关键在于通过设计针对法语患者记录的系列实验,系统性地探测LLMs对性别与其它SDoH因素(如种族、经济状况等)之间交互关系的响应模式,发现模型确实依赖于嵌入的刻板印象进行性别相关判断,从而表明评估SDoH因素间的交互效应可有效补充现有仅关注单一变量偏见的评测方法。
链接: https://arxiv.org/abs/2603.09416
作者: Trung Hieu Ngo,Adrien Bazoge,Solen Quiniou,Pierre-Antoine Gourraud,Emmanuel Morin
机构: Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France; Nantes Université, CHU Nantes, Clinique des données, INSERM, CIC 1413, Nantes, France
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as Findings at EACL 2026
Abstract:Large Language Models (LLMs) excel in Natural Language Processing (NLP) tasks, but they often propagate biases embedded in their training data, which is potentially impactful in sensitive domains like healthcare. While existing benchmarks evaluate biases related to individual social determinants of health (SDoH) such as gender or ethnicity, they often overlook interactions between these factors and lack context-specific assessments. This study investigates bias in LLMs by probing the relationships between gender and other SDoH in French patient records. Through a series of experiments, we found that embedded stereotypes can be probed using SDoH input and that LLMs rely on embedded stereotypes to make gendered decisions, suggesting that evaluating interactions among SDoH factors could usefully complement existing approaches to assessing LLM performance and bias.
[NLP-31] LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation
【速读】: 该论文旨在解决自然语言生成(Natural Language Generation, NLG)评估指标验证中依赖昂贵且耗时的人工标注的问题,尤其是在非英语数据集上缺乏可靠的人类评判标准。其解决方案的关键在于提出一种名为“LLM as a Meta-Judge”的可扩展框架,利用大语言模型(Large Language Models, LLMs)通过受控的语义退化(controlled semantic degradation)生成合成评估数据集,从而替代人工判断。该方法通过元相关性(meta-correlation)衡量合成数据与标准人类基准之间的排名一致性,实验证明其在机器翻译、问答和摘要任务中均能提供接近人类判断的可靠代理,尤其在缺乏人类标注的情况下展现出显著可行性。
链接: https://arxiv.org/abs/2603.09403
作者: Lukáš Eigler,Jindřich Libovický,David Hurych
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 1 figure, 14 tables
Abstract:Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textitLLM as a Meta-Judge, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textitmeta-correlation, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.
[NLP-32] Reward Prediction with Factorized World States
【速读】: 该论文旨在解决传统监督学习奖励模型因训练数据偏差而导致泛化能力受限的问题,特别是在面对新目标和新环境时表现不佳。其核心挑战在于如何在不依赖特定训练数据分布的情况下,实现跨域的准确奖励预测。解决方案的关键在于提出StateFactory方法,该方法通过语言模型将非结构化的观测转化为具有层次结构的物体-属性表示(object-attribute structure),从而在语义层面量化当前状态与目标状态之间的相似度,并在此约束下自然地估计奖励信号。这种紧凑且可解释的表征结构显著提升了奖励预测的泛化性能,实验证明其在零样本场景下优于VLWM-critic和LLM-as-a-Judge等基准模型,并进一步推动了智能体规划性能的提升。
链接: https://arxiv.org/abs/2603.09400
作者: Yijun Shen,Delong Chen,Xianming Hu,Jiaming Mi,Hongbo Zhao,Kai Zhang,Pascale Fung
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Agents must infer action outcomes and select actions that maximize a reward signal indicating how close the goal is to being reached. Supervised learning of reward models could introduce biases inherent to training data, limiting generalization to novel goals and environments. In this paper, we investigate whether well-defined world state representations alone can enable accurate reward prediction across domains. To address this, we introduce StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated naturally as the semantic similarity between the current state and the goal state under hierarchical constraint. Overall, the compact representation structure induced by StateFactory enables strong reward generalization capabilities. We evaluate on RewardPrediction, a new benchmark dataset spanning five diverse domains and comprising 2,454 unique action-observation trajectories with step-wise ground-truth rewards. Our method shows promising zero-shot results against both VLWM-critic and LLM-as-a-Judge reward models, achieving 60% and 8% lower EPIC distance, respectively. Furthermore, this superior reward quality successfully translates into improved agent planning performance, yielding success rate gains of +21.64% on AlfWorld and +12.40% on ScienceWorld over reactive system-1 policies and enhancing system-2 agent planning. Project Page: this https URL
[NLP-33] Quantifying and extending the coverag e of spatial categorization data sets
【速读】: 该论文旨在解决多语言空间范畴化(spatial categorization)研究中数据集覆盖不足与扩展效率低的问题。现有研究依赖人工标注场景关系(如“在…里面”、“在…旁边”等),受限于人力成本和语言多样性,难以构建高覆盖率的跨语言空间语义数据集。解决方案的关键在于利用大语言模型(large language models, LLMs)生成空间关系标签,其结果与人类标注具有较高一致性,从而可作为高效筛选新增场景与语言的依据。作者通过向Topological Relations Picture Series(TRPS)扩展42个新场景,验证了LLM引导扩展策略能显著提升场景空间覆盖度,为构建包含数十种语言和数百个场景的大规模空间语义数据集提供了可行路径。
链接: https://arxiv.org/abs/2603.09373
作者: Wanchun Li,Alexandra Carstensen,Yang Xu,Terry Regier,Charles Kemp
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Variation in spatial categorization across languages is often studied by eliciting human labels for the relations depicted in a set of scenes known as the Topological Relations Picture Series (TRPS). We demonstrate that labels generated by large language models (LLMs) align relatively well with human labels, and show how LLM-generated labels can help to decide which scenes and languages to add to existing spatial data sets. To illustrate our approach we extend the TRPS by adding 42 new scenes, and show that this extension achieves better coverage of the space of possible scenes than two previous extensions of the TRPS. Our results provide a foundation for scaling towards spatial data sets with dozens of languages and hundreds of scenes.
[NLP-34] aSR-RAG : Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理知识密集型和时效性问题时存在的三大核心缺陷:冗余上下文、信息密度低以及多跳推理能力脆弱。传统RAG依赖于无结构文本片段的一次性生成,难以保证推理链的准确性与可解释性。为此,作者提出TaSR-RAG框架——一种基于分类法引导的结构化推理方法,其关键创新在于将查询与文档统一表示为关系三元组(relational triples),并通过轻量级两级分类法(two-level taxonomy)约束实体语义以平衡泛化性和精确性;同时,该方法通过显式建模中间变量绑定表(entity binding table)实现分步证据选择,结合原始三元组语义相似度与类型三元组结构一致性进行混合匹配,从而无需显式图构建或穷举搜索即可有效缓解实体混淆(entity conflation),提升多跳推理的稳定性和可追溯性。
链接: https://arxiv.org/abs/2603.09341
作者: Jiashuo Sun,Yixuan Xie,Jimeng Shi,Shaowen Wang,Jiawei Han
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 tables, 5 figures
Abstract:Retrieval-Augmented Generation (RAG) helps large language models (LLMs) answer knowledge-intensive and time-sensitive questions by conditioning generation on external evidence. However, most RAG systems still retrieve unstructured chunks and rely on one-shot generation, which often yields redundant context, low information density, and brittle multi-hop reasoning. While structured RAG pipelines can improve grounding, they typically require costly and error-prone graph construction or impose rigid entity-centric structures that do not align with the query’s reasoning chain. We propose \textscTaSR-RAG, a taxonomy-guided structured reasoning framework for evidence selection. We represent both queries and documents as relational triples, and constrain entity semantics with a lightweight two-level taxonomy to balance generalization and precision. Given a complex question, \textscTaSR-RAG decomposes it into an ordered sequence of triple sub-queries with explicit latent variables, then performs step-wise evidence selection via hybrid triple matching that combines semantic similarity over raw triples with structural consistency over typed triples. By maintaining an explicit entity binding table across steps, \textscTaSR-RAG resolves intermediate variables and reduces entity conflation without explicit graph construction or exhaustive search. Experiments on multiple multi-hop question answering benchmarks show that \textscTaSR-RAG consistently outperforms strong RAG and structured-RAG baselines by up to 14%, while producing clearer evidence attribution and more faithful reasoning traces. Comments: 14 pages, 7 tables, 5 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.09341 [cs.CL] (or arXiv:2603.09341v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.09341 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-35] How Contrastive Decoding Enhances Large Audio Language Models? INTERSPEECH2026
【速读】: 该论文旨在解决对比解码(Contrastive Decoding, CD)在大型音频语言模型(Large Audio Language Models, LALMs)中提升性能的机制不明确以及不同策略效果差异的问题。其关键解决方案在于系统性地评估四种CD策略,并引入转移矩阵(Transition Matrix)框架来追踪推理过程中错误模式的变化,从而揭示CD仅能有效纠正因模型误判无音频输入或采用不确定性驱动猜测所导致的错误,而无法修正由错误推理或自信误断引发的错误。这一发现为基于基线错误特征选择最适配CD增强的LALM架构提供了明确指导。
链接: https://arxiv.org/abs/2603.09232
作者: Tzu-Quan Lin,Wei-Ping Huang,Yi-Cheng Lin,Hung-yi Lee
机构: National Taiwan University (台湾大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to INTERSPEECH 2026. Code and additional analysis results are provided in our repository: this https URL
Abstract:While Contrastive Decoding (CD) has proven effective at enhancing Large Audio Language Models (LALMs), the underlying mechanisms driving its success and the comparative efficacy of different strategies remain unclear. This study systematically evaluates four distinct CD strategies across diverse LALM architectures. We identify Audio-Aware Decoding and Audio Contrastive Decoding as the most effective methods. However, their impact varies significantly by model. To explain this variability, we introduce a Transition Matrix framework to map error pattern shifts during inference. Our analysis demonstrates that CD reliably rectifies errors in which models falsely claim an absence of audio or resort to uncertainty-driven guessing. Conversely, it fails to correct flawed reasoning or confident misassertions. Ultimately, these findings provide a clear guideline for determining which LALM architectures are most suitable for CD enhancement based on their baseline error profiles.
[NLP-36] LooComp: Leverag e Leave-One-Out Strategy to Encoder-only Transformer for Efficient Query-aware Context Compression
【速读】: 该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)中上下文信息冗余导致的效率与性能瓶颈问题,即如何在保证问答准确性的前提下实现高效、紧凑且精确的上下文压缩。其解决方案的关键在于提出一种基于边距(margin-based)的查询驱动式上下文剪枝框架,通过量化删除某句子后线索丰富度(clue richness)的变化来识别对回答至关重要的句子,并利用复合排序损失函数强制关键句子与非关键句子之间保持显著的边距差异,从而实现高精度的上下文筛选。该方法基于轻量级编码器-only Transformer 架构,在保持高吞吐推理速度和低内存占用的同时,显著提升压缩比且不损害模型性能。
链接: https://arxiv.org/abs/2603.09222
作者: Thao Do,Dinh Phu Tran,An Vo,Seon Kwon Kim,Daeyoung Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computation and Language (cs.CL)
备注:
Abstract:Efficient context compression is crucial for improving the accuracy and scalability of question answering. For the efficiency of Retrieval Augmented Generation, context should be delivered fast, compact, and precise to ensure clue sufficiency and budget-friendly LLM reader cost. We propose a margin-based framework for query-driven context pruning, which identifies sentences that are critical for answering a query by measuring changes in clue richness when they are omitted. The model is trained with a composite ranking loss that enforces large margins for critical sentences while keeping non-critical ones near neutral. Built on a lightweight encoder-only Transformer, our approach generally achieves strong exact-match and F1 scores with high-throughput inference and lower memory requirements than those of major baselines. In addition to efficiency, our method yields effective compression ratios without degrading answering performance, demonstrating its potential as a lightweight and practical alternative for retrieval-augmented tasks.
[NLP-37] SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models
【速读】: 该论文旨在解决交织式语音语言模型(Interleaved Spoken Language Models, SLMs)在推理阶段因每一步都需执行完整Transformer深度而导致的高计算开销问题,尤其是在长语音序列场景下。解决方案的关键在于提出SPAR-K框架——一种模态感知的早期退出机制,其核心创新是引入“语音交替深度调度”策略:多数语音位置在固定中间层提前退出,而周期性地插入全深度“刷新”步骤以缓解因早期退出导致的概率分布偏移(distribution shift)。该方法在不增加额外计算开销的前提下,显著降低平均语音解码深度(最多减少11%),同时保持问答准确率(最大下降0.82%)和感知质量(MOS与WER无明显变化),并揭示了传统基于置信度的早期退出策略在SLMs中效果不佳,凸显了语音token独特统计特性对专用设计的必要性。
链接: https://arxiv.org/abs/2603.09215
作者: Hsiao-Ying Huang,Cheng-Han Chiang,Hung-yi Lee
机构: National Taiwan University (国立台湾大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 6 pages, 1 figures, 2 tables
Abstract:Interleaved spoken language models (SLMs) alternately generate text and speech tokens, but decoding at full transformer depth for every step becomes costly, especially due to long speech sequences. We propose SPAR-K, a modality-aware early exit framework designed to accelerate interleaved SLM inference while preserving perceptual quality. SPAR-K introduces a speech alternating-depth schedule: most speech positions exit at a fixed intermediate layer, while periodic full-depth “refresh” steps mitigate distribution shift due to early exit. We evaluate our framework using Step-Audio-2-mini and GLM-4-Voice across four datasets spanning reasoning, factual QA, and dialogue tasks, measuring performance in terms of ASR transcription accuracy and perceptual quality. Experimental results demonstrate that SPAR-K largely preserves question-answering accuracy with a maximum accuracy drop of 0.82% while reducing average speech decoding depth by up to 11% on Step-Audio-2-mini and 5% on GLM-4-Voice, both with negligible changes in MOS and WER and no auxiliary computation overhead. We further demonstrate that confidence-based early exit strategies, widely used in text LLMs, are suboptimal for SLMs, highlighting that the unique statistical nature of speech tokens necessitates a specialized early exit design.
[NLP-38] Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing
【速读】: 该论文旨在解决大语言模型在处理情感 tone 差异显著的文本时,其注意力机制和推理行为受情绪影响却未被充分考虑的问题。传统研究多将情绪视为预测目标(如情感分类),而本文首次将情绪作为潜在因子,系统分析其如何改变 Transformer 模型中的注意力几何结构(如局部性、质心距离和熵),并揭示这些变化与下游问答性能的相关性。解决方案的关键在于提出 AURA-QA 数据集——一个情感均衡、人工撰写的阅读理解数据集,用于可控研究情绪对模型表征的影响,并引入一种情感正则化框架,在训练过程中约束情绪条件下的表征漂移,从而提升模型在情绪波动和非情绪波动场景下的阅读理解能力,尤其在分布外迁移和域内任务中均取得稳定性能提升。
链接: https://arxiv.org/abs/2603.09205
作者: Benjamin Reichman,Adar Avasian,Samuel Webster,Larry Heck
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models are routinely deployed on text that varies widely in emotional tone, yet their reasoning behavior is typically evaluated without accounting for emotion as a source of representational variation. Prior work has largely treated emotion as a prediction target, for example in sentiment analysis or emotion classification. In contrast, we study emotion as a latent factor that shapes how models attend to and reason over text. We analyze how emotional tone systematically alters attention geometry in transformer models, showing that metrics such as locality, center-of-mass distance, and entropy vary across emotions and correlate with downstream question-answering performance. To facilitate controlled study of these effects, we introduce Affect-Uniform ReAding QA (AURA-QA), a question-answering dataset with emotionally balanced, human-authored context passages. Finally, an emotional regularization framework is proposed that constrains emotion-conditioned representational drift during training. Experiments across multiple QA benchmarks demonstrate that this approach improves reading comprehension in both emotionally-varying and non-emotionally varying datasets, yielding consistent gains under distribution shift and in-domain improvements on several benchmarks.
[NLP-39] he Reasoning Trap – Logical Reasoning as a Mechanistic Pathway to Situational Awareness ICLR2026
【速读】: 该论文试图解决的问题是:随着大语言模型(Large Language Models, LLMs)逻辑推理能力的不断提升,其潜在的情境意识(Situational Awareness)也会随之增强,从而可能引发不可控的安全风险,例如战略性欺骗行为。解决方案的关键在于提出RAISE框架(Reasoning Advancing Into Self Examination),该框架识别出三条机制路径——演绎式自我推理(deductive self inference)、归纳式上下文识别(inductive context recognition)和溯因式自我建模(abductive self modeling),揭示了逻辑推理能力提升如何逐步推动情境意识从基础自我识别向复杂战略意图演进,并指出当前安全措施不足以应对这种渐进式风险。论文进一步提出“镜像测试”基准和“推理安全对等原则”作为具体防护手段,呼吁逻辑推理研究社区正视其在这一演化路径中的责任。
链接: https://arxiv.org/abs/2603.09200
作者: Subramanyam Sahoo,Aman Chadha,Vinija Jain,Divya Chaudhary
机构: MARS 4.0 Fellowship, Cambridge AI Safety Hub(CAISH), University of Cambridge; AWS Generative AI Innovation Center, Amazon Web Services, USA; Google, USA; Stanford University; Northeastern University, Seattle, WA, USA
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. 21 Pages. Position Paper
Abstract:Situational awareness, the capacity of an AI system to recognize its own nature, understand its training and deployment context, and reason strategically about its circumstances, is widely considered among the most dangerous emergent capabilities in advanced AI systems. Separately, a growing research effort seeks to improve the logical reasoning capabilities of large language models (LLMs) across deduction, induction, and abduction. In this paper, we argue that these two research trajectories are on a collision course. We introduce the RAISE framework (Reasoning Advancing Into Self Examination), which identifies three mechanistic pathways through which improvements in logical reasoning enable progressively deeper levels of situational awareness: deductive self inference, inductive context recognition, and abductive self modeling. We formalize each pathway, construct an escalation ladder from basic self recognition to strategic deception, and demonstrate that every major research topic in LLM logical reasoning maps directly onto a specific amplifier of situational awareness. We further analyze why current safety measures are insufficient to prevent this escalation. We conclude by proposing concrete safeguards, including a “Mirror Test” benchmark and a Reasoning Safety Parity Principle, and pose an uncomfortable but necessary question to the logical reasoning community about its responsibility in this trajectory.
[NLP-40] DEO: Training-Free Direct Embedding Optimization for Negation-Aware Retrieval
【速读】: 该论文旨在解决现有检索方法在处理否定(negation)和排除(exclusion)类查询时准确率不足的问题,这类查询在实际应用中较为常见但难以被传统基于嵌入(embedding)的检索模型有效捕捉。解决方案的关键在于提出一种无需训练的**直接嵌入优化(Direct Embedding Optimization, DEO)**方法:DEO将查询分解为正向(positive)与负向(negative)两个语义组件,并通过对比学习目标(contrastive objective)对查询嵌入进行优化,从而增强模型对否定信息的感知能力。该方法不依赖额外训练数据或模型微调,在保持部署简单的同时显著提升了NegConstraint基准上的性能指标(如nDCG@10和MAP@100),并在多模态检索任务中相较OpenAI CLIP实现6%的Recall@5提升,验证了其在真实场景中的实用性。
链接: https://arxiv.org/abs/2603.09185
作者: Taegyeong Lee,Jiwon Park,Seunghyun Hwang,JooYoung Jang
机构: Miri.DIH; Hanyang University; Sungkyunkwan University
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) have enabled diverse retrieval methods. However, existing retrieval methods often fail to accurately retrieve results for negation and exclusion queries. To address this limitation, prior approaches rely on embedding adaptation or fine-tuning, which introduce additional computational cost and deployment complexity. We propose Direct Embedding Optimization (DEO), a training-free method for negation-aware text and multimodal retrieval. DEO decomposes queries into positive and negative components and optimizes the query embedding with a contrastive objective. Without additional training data or model updates, DEO outperforms baselines on NegConstraint, with gains of +0.0738 nDCG@10 and +0.1028 MAP@100, while improving Recall@5 by +6% over OpenAI CLIP in multimodal retrieval. These results demonstrate the practicality of DEO for negation- and exclusion-aware retrieval in real-world settings.
[NLP-41] DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM -TTS Pipeline and Micro-Turn Optimization INTERSPEECH2026
【速读】: 该论文旨在解决传统语音对话系统中因声学端点检测(Voice Activity Detection, VAD)导致的半双工交互限制与生成式AI(Generative AI)对话智能难以维持之间的矛盾问题。现有基于VAD的级联式ASR-LLM-TTS架构虽能保留大语言模型(Large Language Model, LLM)的强大语义理解能力,但受限于VAD分割机制,常迫使用户必须等待对方完成发言后才能响应,造成交互不自然;而纯端到端的无VAD模型虽支持全双工交互,却难以保持高质量的对话连贯性和语义一致性。其解决方案的关键在于提出DuplexCascade——一种无需VAD的级联式流式语音对话管道,通过将传统的按句处理方式转化为按块(chunk-wise)微轮次(micro-turn)交互机制,在保证LLM对话智能的同时实现快速双向交流,并引入一组专门设计的对话控制标记(conversational special control tokens)以在流式约束下精准调控LLM的行为,从而可靠地协调发言权切换与响应时机。
链接: https://arxiv.org/abs/2603.09180
作者: Jianing Yang,Yusuke Fujita,Yui Sudo
机构: SB Intuitions Corp.(SB Intuitions 公司); The University of Tokyo(东京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to Interspeech 2026
Abstract:Spoken dialog systems with cascaded ASR-LLM-TTS modules retain strong LLM intelligence, but VAD segmentation often forces half-duplex turns and brittle control. On the other hand, VAD-free end-to-end model support full-duplex interaction but is hard to maintain conversational intelligence. In this paper, we present DuplexCascade, a VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue. Our key idea is to convert conventional utterance-wise long turns into chunk-wise micro-turn interactions, enabling rapid bidirectional exchange while preserving the strengths of a capable text LLM. To reliably coordinate turn-taking and response timing, we introduce a set of conversational special control tokens that steer the LLM’s behavior under streaming constraints. On Full-DuplexBench and VoiceBench, DuplexCascade delivers state-of-the-art full-duplex turn-taking and strong conversational intelligence among open-source speech-to-speech dialogue systems.
[NLP-42] Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对生物与合成技术解决方案时可能存在的系统性偏倚问题,即模型倾向于偏好非生物(合成)方案而非生物基或仿生方案。研究发现,多数前沿和开源模型在评估中表现出对合成技术的显著偏好。其解决方案的关键在于通过针对性微调(fine-tuning),利用来自6,636篇PubMed Central(PMC)文章的约2200万token的生物问题解决语料库,对两个开源模型(Llama 3.2-3B-Instruct 和 Qwen2.5-3B-Instruct)进行训练。采用QLoRA(Quantized Low-Rank Adaptation)方法,在不损害模型通用能力的前提下,显著提升了模型对生物基解决方案的评分(p < 0.001 和 p < 0.01,经Holm-Bonferroni校正),表明少量高质量领域特定数据即可有效引导模型的价值判断向生物友好方向转变。
链接: https://arxiv.org/abs/2603.09154
作者: Trent R Northen,Mingxun Wang
机构: Bioaligned Labs; Lawrence Berkeley National Lab (劳伦斯伯克利国家实验室); UC Riverside (加州大学河滨分校)
类目: Computation and Language (cs.CL)
备注: 17 pages, 4 figures
Abstract:Large language models (LLMs) trained on internet-scale corpora can exhibit systematic biases that increase the probability of unwanted behavior. In this study, we examined potential biases towards synthetic vs. biological technological solutions across four domains (materials, energy, manufacturing, and algorithms). A sample of 5 frontier and 5 open-weight models were measured using 50 curated Bioalignment prompts with a Kelly criterion-inspired evaluation framework. According to this metric, most models were not bioaligned in that they exhibit biases in favor of synthetic (non-biological) solutions. We next examined if fine-tuning could increase the preferences of two open-weight models, Llama 3.2-3B-Instruct and Qwen2.5-3B-Instruct, for biological-based approaches. A curated corpus of ~22M tokens from 6,636 PMC articles emphasizing biological problem-solving was used first to fine-tune Llama 3B with a mixed corpus of continued training and instruction-formatted. This was then extended to Qwen 3B using instruction-formatted only. We found that QLoRA fine-tuning significantly increased the scoring of biological solutions for both models without degrading general capabilities (Holm-Bonferroni-corrected p 0.001 and p 0.01, respectively). This suggests that even a small amount of fine-tuning can change how models weigh the relative value of biological and bioinspired vs. synthetic approaches. Although this work focused on small open-weight LLMs, it may be extensible to much larger models and could be used to develop models that favor bio-based approaches. We release the benchmark, corpus, code, and adapter weights.
[NLP-43] Reading Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理图像形式文本时性能显著低于纯文本输入的“模态差距”(modality gap)问题。研究表明,这种差距具有任务和数据依赖性,例如数学任务在合成渲染文本上准确率下降超过60个百分点,而真实文档图像则常能匹配甚至超越文本模式表现;字体、分辨率等渲染因素是强混淆变量。通过分析4000余个样本的误差模式,发现图像模式主要放大阅读错误(如计算与格式错误),并导致部分模型出现链式思维(chain-of-thought)推理崩溃。为此,作者提出一种自蒸馏方法(self-distillation),利用模型自身生成的纯文本推理轨迹与图像输入进行联合训练,在GSM8K数据集上将图像模式准确率从30.71%提升至92.72%,且无需灾难性遗忘即可迁移至未见基准。解决方案的关键在于:通过模型自我监督的推理过程对齐,增强其在视觉输入下的逻辑一致性与鲁棒性。
链接: https://arxiv.org/abs/2603.09095
作者: Kaiser Sun,Xiaochuang Yuan,Hongjun Liu,Chen Zhao,Cheng Zhang,Mark Dredze,Fan Bai
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this “modality gap” by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.
[NLP-44] Exclusive Self Attention
【速读】: 该论文旨在解决Transformer模型中自注意力机制(Self Attention, SA)在序列建模时对自身位置信息过度依赖的问题,从而影响上下文建模效果。解决方案的关键在于提出独占式自注意力(Exclusive Self Attention, XSA),通过约束注意力计算仅捕获与当前token值向量正交的信息(即排除来自自身位置的信号),促使模型更专注于捕捉外部上下文依赖关系,从而提升序列建模性能,尤其在长序列场景下优势更为显著。
链接: https://arxiv.org/abs/2603.09078
作者: Shuangfei Zhai
机构: Apple(苹果)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer’s sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the token’s own value vector (thus excluding information of self position), encouraging better context modeling. Evaluated on the standard language modeling task, XSA consistently outperforms SA across model sizes up to 2.7B parameters and shows increasingly larger gains as sequence length grows.
[NLP-45] From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring
【速读】: 该论文旨在解决远程患者监测(Remote Patient Monitoring, RPM)中因数据量庞大导致临床人员不堪重负的问题,这一瓶颈在Tele-HF和BEAT-HF等关键试验中已显现,而尽管TIM-HF2证明了持续医生主导监测可降低30%死亡率,但其高昂成本与不可扩展性限制了广泛应用。解决方案的关键在于开发了一个名为Sentinel的自主AI代理,其核心是基于模型上下文协议(Model Context Protocol, MCP),通过整合21种临床工具并进行多步推理,实现对RPM生理参数的上下文感知分诊。该方法显著提升了分诊敏感性和一致性,且在紧急情况和所有可行动警报上的敏感度均优于单个临床医生,在保持临床可接受的过度分诊水平的同时,为实现高效、可扩展的高强度监测提供了技术路径。
链接: https://arxiv.org/abs/2603.09052
作者: Seunghwan Kim(1),Tiffany H. Kung(1 and 2),Heena Verma(1),Dilan Edirisinghe(1),Kaveh Sedehi(1),Johanna Alvarez(1),Diane Shilling(1),Audra Lisa Doyle(1),Ajit Chary(1),William Borden(1 and 3),Ming Jack Po(1) ((1) AnsibleHealth Inc., San Francisco, USA (2) Stanford School of Medicine, Stanford, USA (3) George Washington University, Washington, D.C., USA)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 46 pages, 11 figures, Abstract in metadata is shortened to meet arXiv character limits; see PDF for full version
Abstract:Background: Remote patient monitoring (RPM) generates vast data, yet landmark trials (Tele-HF, BEAT-HF) failed because data volume overwhelmed clinical staff. While TIM-HF2 showed 24/7 physician-led monitoring reduces mortality by 30%, this model remains prohibitively expensive and unscalable. Methods: We developed Sentinel, an autonomous AI agent using Model Context Protocol (MCP) for contextual triage of RPM vitals via 21 clinical tools and multi-step reasoning. Evaluation included: (1) self-consistency (100 readings x 5 runs); (2) comparison against rule-based thresholds; and (3) validation against 6 clinicians (3 physicians, 3 NPs) using a connected matrix design. A leave-one-out (LOO) analysis compared the agent against individual clinicians; severe overtriage cases underwent independent physician adjudication. Results: Against a human majority-vote standard (N=467), the agent achieved 95.8% emergency sensitivity and 88.5% sensitivity for all actionable alerts (85.7% specificity). Four-level exact accuracy was 69.4% (quadratic-weighted kappa=0.778); 95.9% of classifications were within one severity level. In LOO analysis, the agent outperformed every clinician in emergency sensitivity (97.5% vs. 60.0% aggregate) and actionable sensitivity (90.9% vs. 69.5%). While disagreements skewed toward overtriage (22.5%), independent adjudication of severe gaps (=2 levels) validated agent escalation in 88-94% of cases; consensus resolution validated 100%. The agent showed near-perfect self-consistency (kappa=0.850). Median cost was 0.34/triage. Conclusions: Sentinel triages RPM vitals with sensitivity exceeding individual clinicians. By automating systematic context synthesis, Sentinel addresses the core limitation of prior RPM trials, offering a scalable path toward the intensive monitoring shown to reduce mortality while maintaining a clinically defensible overtriage profile. Comments: 46 pages, 11 figures, Abstract in metadata is shortened to meet arXiv character limits; see PDF for full version Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.0; J.3 Cite as: arXiv:2603.09052 [cs.AI] (or arXiv:2603.09052v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.09052 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Seunghwan Kim [view email] [v1] Tue, 10 Mar 2026 00:50:54 UTC (2,425 KB)
[NLP-46] Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理中产生的冗长推理路径导致高推理成本的问题,同时克服现有基于自一致性(self-consistency)的方法因需采样和聚合多条推理轨迹而带来的显著计算开销。其解决方案的关键在于提出一种基于置信度的决策框架(confidence-aware decision framework),该框架通过分析单条已完成的推理轨迹,利用句级数值特征与语言学特征(从MedQA数据集中的中间推理状态提取)来动态判断是否采用单路径或双路径推理策略,从而在不进行额外微调的情况下实现跨任务(MathQA、MedMCQA、MMLU)的准确率与效率平衡,实验证明该方法在保持与多路径基线相当准确率的同时,可减少高达80%的token使用量。
链接: https://arxiv.org/abs/2603.08999
作者: Juming Xiong,Kevin Guo,Congning Ni,Chao Yan,Katherine Brown,Avinash Baidya,Xiang Gao,Bradley Marlin,Zhijun Yin
机构: Vanderbilt University (范德比尔特大学); Vanderbilt University Medical Center (范德比尔特大学医学中心); Intuit AI Research (Intuit人工智能研究)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet often generate unnecessarily long reasoning paths that incur high inference cost. Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead. This paper introduces a confidence-aware decision framework that analyzes a single completed reasoning trajectory to adaptively select between single-path and multi-path reasoning. The framework is trained using sentence-level numeric and linguistic features extracted from intermediate reasoning states in the MedQA dataset and generalizes effectively to MathQA, MedMCQA, and MMLU without additional fine-tuning. Experimental results show that the proposed method maintains accuracy comparable to multi-path baselines while using up to 80% fewer tokens. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.
[NLP-47] Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance
【速读】: 该论文旨在解决主题分析(Thematic Analysis, TA)在健康研究中因人工操作导致的可扩展性差和可重复性低的问题。现有基于大语言模型(Large Language Models, LLMs)的自动化方法虽具潜力,但生成的代码本(codebook)泛化能力有限且缺乏分析过程的可审计性。解决方案的关键在于提出一种结合迭代式代码本优化与完整溯源追踪(full provenance tracking)的自动化TA框架,通过多轮迭代提升代码的复用性和分布一致性,同时保持描述质量,在五个数据集上显著优于六种基线方法,尤其在儿科心脏病学临床访谈数据中生成的主题与专家标注高度一致。
链接: https://arxiv.org/abs/2603.08989
作者: Seungjun Yi,Joakim Nguyen,Huimin Xu,Terence Lim,Joseph Skrovan,Mehak Beri,Hitakshi Modi,Andrew Well,Carlos M. Mery,Yan Zhang,Mia K. Markey,Ying Ding
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: Submitted to AMIA 2026 Annual Symposium (American Medical Informatics Association)
Abstract:Thematic analysis (TA) is widely used in health research to extract patterns from patient interviews, yet manual TA faces challenges in scalability and reproducibility. LLM-based automation can help, but existing approaches produce codebooks with limited generalizability and lack analytic auditability. We present an automated TA framework combining iterative codebook refinement with full provenance tracking. Evaluated on five corpora spanning clinical interviews, social media, and public transcripts, the framework achieves the highest composite quality score on four of five datasets compared to six baselines. Iterative refinement yields statistically significant improvements on four datasets with large effect sizes, driven by gains in code reusability and distributional consistency while preserving descriptive quality. On two clinical corpora (pediatric cardiology), generated themes align with expert-annotated themes.
[NLP-48] BiCLIP: Domain Canonicalization via Structured Geometric Transformation
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在特定领域适应(domain adaptation)中的挑战,即如何在少量标注样本下实现跨域的鲁棒对齐。其解决方案的关键在于提出BiCLIP框架,该框架基于一个核心假设:不同域之间的图像特征可通过一个可恢复的规范几何变换(canonicalized geometric transformation)关联;利用少量锚点样本(即few-shot标签数据)估计该变换,并对多模态特征施加针对性的变换以增强跨模态对齐。此方法具有极简结构和低参数开销,在11个标准基准测试中均达到最优性能,同时通过分析变换的正交性和角度分布验证了结构化对齐对于鲁棒域适应的重要性。
链接: https://arxiv.org/abs/2603.08942
作者: Pranav Mantini,Shishir K. Shah
机构: University of Houston (休斯顿大学); The University of Oklahoma (俄克拉荷马大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at this https URL
[NLP-49] VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLM s INTERSPEECH2026
【速读】: 该论文旨在解决当前语音大语言模型(Speech Large Language Models, Speech LLMs)在语音情感识别(Speech Emotion Recognition, SER)任务中面临的两大挑战:一是从封闭集分类转向开放文本生成时引入的零样本随机性,导致评估结果对提示(prompt)高度敏感;二是传统评测基准忽视了人类情感表达的固有模糊性。解决方案的关键在于提出 VoxEmo——一个涵盖 35 个情感语料库、覆盖 15 种语言的综合性 SER 基准,并配套标准化工具包,包含从直接分类到副语言推理的不同复杂度提示;同时引入分布感知软标签协议(distribution-aware soft-label protocol)和提示集成策略(prompt-ensemble strategy),以模拟标注者分歧并更贴近真实人类感知与应用场景。实验表明,尽管零样本 Speech LLM 在硬标签准确率上落后于监督基线,但其输出更符合人类主观情感分布。
链接: https://arxiv.org/abs/2603.08936
作者: Hezhao Zhang,Huang-Cheng Chou,Shrikanth Narayanan,Thomas Hain
机构: University of Sheffield (谢菲尔德大学); University of Southern California (南加州大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: submitted to Interspeech 2026
Abstract:Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.
[NLP-50] SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在理解和处理科学论文中表格数据时存在的严重局限性问题,尤其是面对需要深度语言推理与复杂计算相结合的任务时表现不佳。其解决方案的关键在于构建了一个由专家撰写的基准测试集 SciTaRC,该数据集专门设计用于评估模型对科学文献中表格数据的理解能力,并揭示出当前主流模型存在普遍的“执行瓶颈”——即即使具备正确的推理策略,模型仍难以准确执行计划,表现为代码方法在原始科学表格上脆弱、自然语言推理则主要因初始理解偏差和计算错误而失败。
链接: https://arxiv.org/abs/2603.08910
作者: Hexuan Wang,Yaxuan Ren,Srikar Bommireddypalli,Shuxian Chen,Adarsh Prabhudesai,Rongkun Zhou,Elina Baral,Philipp Koehn
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 11 figures, 7 tables
Abstract:We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal “execution bottleneck”: both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.
[NLP-51] ConFu: Contemplate the Future for Better Speculative Sampling ICLR2026
【速读】: 该论文旨在解决当前推测解码(Speculative Decoding)中 draft 模型因仅基于当前前缀进行预测而导致的误差累积问题,进而限制了生成效率与准确性。其核心解决方案是提出 ConFu(Contemplate the Future)框架,关键创新在于引入“展望令牌”(contemplate tokens)和软提示(soft prompts),使 draft 模型能够以极低开销利用目标模型的未来导向信号;同时设计基于 MoE 的动态展望令牌机制以实现上下文感知的未来预测,并通过锚定令牌采样与未来预测复制的训练策略提升模型对未来的鲁棒预测能力。实验表明,ConFu 在 Llama-3 3B 和 8B 模型上相较 EAGLE-3 提升了 8–11% 的 token 接受率与生成速度。
链接: https://arxiv.org/abs/2603.08899
作者: Zongyue Qin,Raghavv Goel,Mukul Gagrani,Risheek Garrepalli,Mingu Lee,Yizhou Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注: accepted at ICLR 2026 workshop on Latent Implicit Thinking - Going Beyond CoT Reasoning
Abstract:Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbfConFu (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8–11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.
[NLP-52] MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers
【速读】: 该论文旨在解决医疗领域中敏感患者数据在机器学习应用中的隐私保护难题,尤其是如何在不违反隐私法规的前提下构建高质量的匿名化数据集。其关键解决方案是通过神经机器翻译(Neural Machine Translation, NMT)方法生成多语言匿名化基准数据,该方法不仅保留原始标注信息,还能将人名、地名等个人信息以目标语言的文化和语境适宜方式准确转换,从而实现跨语言、合规且高质量的匿名化数据合成与迁移。
链接: https://arxiv.org/abs/2603.08879
作者: Ibrahim Baroud,Christoph Otto,Vera Czehmann,Christine Hovhannisyan,Lisa Raithel,Sebastian Möller,Roland Roller
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Accessing sensitive patient data for machine learning is challenging due to privacy concerns. Datasets with annotations of personally identifiable information are crucial for developing and testing anonymization systems to enable safe data sharing that complies with privacy regulations. Since accessing real patient data is a bottleneck, synthetic data offers an efficient solution for data scarcity, bypassing privacy regulations that apply to real data. Moreover, neural machine translation can help to create high-quality data for low-resource languages by translating validated real or synthetic data from a high-resource language. In this work, we create a multilingual anonymization benchmark in ten languages, using a machine translation methodology that preserves the original annotations and renders names of cities and people in a culturally and contextually appropriate form in each target language. Our evaluation study with medical professionals confirms the quality of the translations, both in general and with respect to the translation and adaptation of personal information. Our benchmark with over 2,500 annotations of personal information can be used in many applications, including training annotators, validating annotations across institutions without legal complications, and helping improve the performance of automatic personal information detection. We make our benchmark and annotation guidelines available for further research.
[NLP-53] One Language Two Scripts: Probing Script-Invariance in LLM Concept Representations ICLR2026
【速读】: 该论文旨在解决一个核心问题:稀疏自编码器(Sparse Autoencoders, SAE)所学习到的特征是否代表抽象语义,还是仅仅与文本的书写形式(orthography)强相关。为解答此问题,作者利用塞尔维亚语双文制(Serbian digraphia)作为受控实验场景——塞尔维亚语可使用拉丁字母和西里尔字母两种书写系统,二者之间具有近乎完美的字符映射关系,从而可在保持语义不变的前提下改变文本的书写形式。关键解决方案在于:通过对比相同句子在不同书写系统下的SAE特征激活情况,发现即使两种书写方式完全不同的tokenization策略导致无共享token,其激活的SAE特征仍高度重叠,且这种跨书写系统的相似性显著高于随机基线,并超过同一书写系统内改写(paraphrasing)带来的差异。这表明SAE特征更倾向于捕捉语义层面的抽象表示,而非表面的词法或书写形式。此外,模型规模越大,该语义不变性越强,进一步支持了SAE能够提取高于token化层级的语义信息这一结论。
链接: https://arxiv.org/abs/2603.08869
作者: Sripad Karne
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted at the UCRL Workshop at ICLR 2026
Abstract:Do the features learned by Sparse Autoencoders (SAEs) represent abstract meaning, or are they tied to how text is written? We investigate this question using Serbian digraphia as a controlled testbed: Serbian is written interchangeably in Latin and Cyrillic scripts with a near-perfect character mapping between them, enabling us to vary orthography while holding meaning exactly constant. Crucially, these scripts are tokenized completely differently, sharing no tokens whatsoever. Analyzing SAE feature activations across the Gemma model family (270M-27B parameters), we find that identical sentences in different Serbian scripts activate highly overlapping features, far exceeding random baselines. Strikingly, changing script causes less representational divergence than paraphrasing within the same script, suggesting SAE features prioritize meaning over orthographic form. Cross-script cross-paraphrase comparisons provide evidence against memorization, as these combinations rarely co-occur in training data yet still exhibit substantial feature overlap. This script invariance strengthens with model scale. Taken together, our findings suggest that SAE features can capture semantics at a level of abstraction above surface tokenization, and we propose Serbian digraphia as a general evaluation paradigm for probing the abstractness of learned representations.
[NLP-54] MASEval: Extending Multi-Agent Evaluation from Models to Systems
【速读】: 该论文旨在解决当前对大语言模型(Large Language Model, LLM)驱动的智能体系统(agentic systems)评估中存在的局限性问题:现有基准测试多以模型为中心,固定了智能体系统的架构配置,未能对系统实现层面的关键组件(如拓扑结构、编排逻辑和错误处理机制)进行系统性比较。为此,作者提出了MASEval——一个与框架无关的评估库,将整个智能体系统视为分析单元,从而揭示实现选择(如框架选用)对性能的影响可与模型选择相当。其关键创新在于构建了一个支持跨框架、跨模型、跨任务的系统级评估体系,为研究者提供可解释的系统设计空间探索工具,也为实践者提供针对特定应用场景的最优实现路径。
链接: https://arxiv.org/abs/2603.08835
作者: Cornelius Emde,Alexander Rubinstein,Anmol Goel,Ahmed Heakl,Sangdoo Yun,Seong Joon Oh,Martin Gubri
机构: Parameter Lab; University of Oxford; University of Tübingen; TU Darmstadt; MBZUAI; NAVER AI Lab; KAIST
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The rapid adoption of LLM-based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet existing benchmarks are model-centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a framework-agnostic library that treats the entire system as the unit of analysis. Through a systematic system-level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice matters as much as model choice. MASEval allows researchers to explore all components of agentic systems, opening new avenues for principled system design, and practitioners to identify the best implementation for their use case. MASEval is available under the MIT licence this https URL. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2603.08835 [cs.AI] (or arXiv:2603.08835v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.08835 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-55] Fish Audio S2 Technical Report
【速读】: 该论文旨在解决开源文本到语音(Text-to-Speech, TTS)系统在多说话人、多轮对话生成以及通过自然语言指令进行控制方面的局限性。现有系统通常缺乏对复杂交互场景的适应能力,且难以实现基于自然语言描述的灵活语音生成控制。解决方案的关键在于提出 Fish Audio S2,一个支持多说话人、多轮生成并具备指令跟随能力的 TTS 系统;其核心创新包括多阶段训练策略与分阶段数据处理流程(涵盖视频字幕、语音字幕、语音质量评估和奖励建模),以及一个基于 SGLang 的生产级流式推理引擎,实现了低延迟(实时因子 RTF=0.195,首音频延迟<100ms)和可扩展的定制语音生成能力。
链接: https://arxiv.org/abs/2603.08823
作者: Shijia Liao,Yuxuan Wang,Songting Liu,Yifan Cheng,Ruoyi Zhang,Tianyu Li,Shidong Li,Yisheng Zheng,Xingwei Liu,Qingzheng Wang,Zhizhuo Zhou,Jiahua Liu,Xin Chen,Dawei Han
机构: Fish Audio(鱼音频)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 this http URL code and weights are available on GitHub (this https URL) and Hugging Face (this https URL). We highly encourage readers to visit this https URL to try custom voices.
[NLP-56] Self-hosted Lecture-to-Quiz: Local LLM MCQ Generation with Deterministic Quality Control
【速读】: 该论文旨在解决教育场景中基于生成式 AI (Generative AI) 自动生成多选题(MCQs)时面临的隐私泄露、质量不可控及部署依赖外部大语言模型(LLM)服务的问题。其解决方案的关键在于构建一个端到端的自托管(API-free)流水线,利用本地运行的 LLM 生成初稿,并通过确定性的质量控制(QC)机制筛选出符合严格标准的题目,最终输出仅包含纯文本题库和明确 QC 追踪日志的成果,无需在部署阶段调用任何 LLM 接口。该设计实现了黑箱最小化,同时保障了数据隐私、可审计性和绿色计算(Green AI)特性。
链接: https://arxiv.org/abs/2603.08729
作者: Seine A. Shintani
机构: Chubu University (中部大学); JSPS KAKENHI Grant Number JP25K00269 (日本学术振兴会资助项目编号JP25K00269); Chubu University FY2025 Special Research Fund (CP) (中部大学2025年度特别研究基金(CP)); Chubu University FY2025 Research Institute for Industry and Economics (RIIE) Research Project (中部大学2025年度产业经济研究所研究项目)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 16 pages, 8 tables, appendix included. Includes ancillary files (anc/) with JSONL/CSV exports, QC traces, reproducibility notebook, and dummy lecture PDFs
Abstract:We present an end-to-end self-hosted (API-free) pipeline, where API-free means that lecture content is not sent to any external LLM service, that converts lecture PDFs into multiple-choice questions (MCQs) using a local LLM plus deterministic quality control (QC). The pipeline is designed for black-box minimization: LLMs may assist drafting, but the final released artifacts are plain-text question banks with an explicit QC trace and without any need to call an LLM at deployment time. We run a seed sweep on three short “dummy lectures” (information theory, thermodynamics, and statistical mechanics), collecting 15 runs x 8 questions = 120 accepted candidates (122 attempts total under bounded retries). All 120 accepted candidates satisfy hard QC checks (JSON schema conformance, a single marked correct option, and numeric/constant equivalence tests); however, the warning layer flags 8/120 items (spanning 8 runs) that expose residual quality risks such as duplicated distractors or missing rounding instructions. We report a warning taxonomy with concrete before-after fixes, and we release the final 24-question set (three lectures x 8 questions) as JSONL/CSV for Google Forms import (e.g., via Apps Script or API tooling) included as ancillary files under anc/. Finally, we position the work through the AI to Learn (AI2L) rubric lens and argue that self-hosted MCQ generation with explicit QC supports privacy, accountability, and Green AI in educational workflows.
[NLP-57] VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation
【速读】: 该论文旨在解决生成式 AI(Generative AI)在Verilog代码生成任务中,模型特性(如推理能力、专业化程度)与提示工程策略(prompt engineering strategies)之间复杂交互关系不明确的问题。其解决方案的关键在于通过受控的因子设计实验,系统评估多种小型和大型语言模型(包括通用型、推理增强型及领域专用型),并结合结构化输出、思维链(chain-of-thought reasoning)、上下文学习(in-context learning)以及基于遗传-帕累托优化(Genetic-Pareto)的提示进化方法,在两个Verilog基准测试上识别出不同模型类别对提示结构与优化策略的响应模式,从而揭示哪些趋势具有跨模型和基准的普适性,哪些则依赖于特定的模型-提示组合。
链接: https://arxiv.org/abs/2603.08715
作者: Luca Collini,Andrew Hennesee,Patrick Yubeaton,Siddharth Garg,Ramesh Karri
机构: 未知
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注: Submitted for peer review
Abstract:Rapid advances in language models (LMs) have created new opportunities for automated code generation while complicating trade-offs between model characteristics and prompt design choices. In this work, we provide an empirical map of recent trends in LMs for Verilog code generation, focusing on interactions among model reasoning, specialization, and prompt engineering strategies. We evaluate a diverse set of small and large LMs, including general-purpose, reasoning, and domain-specific variants. Our experiments use a controlled factorial design spanning benchmark prompts, structured outputs, prompt rewriting, chain-of-thought reasoning, in-context learning, and evolutionary prompt optimization via Genetic-Pareto. Across two Verilog benchmarks, we identify patterns in how model classes respond to structured prompts and optimization, and we document which trends generalize across LMs and benchmarks versus those that are specific to particular model-prompt combinations.
[NLP-58] Skip to the Good Part: Representation Structure Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLM s ICLR2026
【速读】: 该论文旨在解决扩散语言模型(diffusion language models, dLLMs)与自回归语言模型(autoregressive language models, AR LLMs)在内部表征结构上的差异问题,特别是训练目标如何影响模型各层和token级别的表征层次性与冗余性。研究发现,扩散目标促使模型形成更具层次性的抽象表征,早期层存在显著冗余且减少对近期信息的依赖,而自回归目标则产生深度耦合的表征;更重要的是,即使使用扩散训练,从AR初始化的模型仍保留AR式的表征动态,表明初始结构具有持久影响。解决方案的关键在于利用这种观测到的表征冗余性,提出一种无需架构修改或KV缓存共享的静态推理时层跳过(layer-skipping)方法,从而实现高达18.75%的浮点运算量(FLOPs)节省,同时保持90%以上的推理与代码生成性能,为dLLMs提供了可独立于KV缓存的高效推理路径。
链接: https://arxiv.org/abs/2603.07475
作者: Raghavv Goel,Risheek Garrepalli,Sudhanshu Agrawal,Chris Lott,Mingu Lee,Fatih Porikli
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at Sci4DL and Delta workshops at ICLR 2026
Abstract:Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.
[NLP-59] Lets Verify Math Questions Step by Step
【速读】: 该论文旨在解决数学问题数据集中存在大量格式错误、逻辑矛盾或信息不足的 ill-posed(病态)问题,这些问题会引入标签噪声并导致模型训练效率低下。为应对这一挑战,作者提出 Math Question Verification (MathQ-Verify),其关键在于设计了一个五阶段的验证流水线:首先进行格式级验证以确保语法正确性;其次将问题形式化并分解为原子条件,逐项与数学定义校验;接着检测条件间的逻辑矛盾;再通过目标导向的完整性检查确认问题信息充分性;最后利用轻量级模型投票机制实现高精度筛选。该方案显著提升了验证准确率(F1提升达25个百分点),在保持高精度(约90%)的同时获得合理召回率(63%),为构建高质量数学问答数据集提供了可扩展且可靠的方法。
链接: https://arxiv.org/abs/2505.13903
作者: Chengyu Shen,Zhen Hao Wong,Runming He,Hao Liang,Meiyi Qiang,Zimo Meng,Zhengyang Zhao,Bohan Zeng,Zhengzhou Zhu,Bin Cui,Wentao Zhang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math QA data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the validity of the questions themselves. In this work, we propose Math Question Verification (MathQ-Verify), a novel five-stage pipeline designed to rigorously filter ill-posed or under-specified math problems. MathQ-Verify first performs format-level validation to remove redundant instructions and ensure that each question is syntactically well-formed. It then formalizes each question, decomposes it into atomic conditions, and verifies them against mathematical definitions. Next, it detects logical contradictions among these conditions, followed by a goal-oriented completeness check to ensure the question provides sufficient information for solving. To evaluate this task, we use existing benchmarks along with an additional dataset we construct, containing 2,147 math questions with diverse error types, each manually double-validated. Experiments show that MathQ-Verify achieves state-of-the-art performance across multiple benchmarks, improving the F1 score by up to 25 percentage points over the direct verification baseline. It further attains approximately 90% precision and 63% recall through a lightweight model voting scheme. MathQ-Verify offers a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding unnecessary computation on invalid questions. Our code and data are available at this https URL.
[NLP-60] From Word2Vec to Transformers: Text-Derived Composition Embeddings for Filtering Combinatorial Electrocatalysts
【速读】: 该论文旨在解决复杂成分固溶体电催化材料(compositionally complex solid solution electrocatalysts)在庞大组成空间中难以高效筛选候选材料的问题,即如何在不依赖实验标签的情况下快速缩小潜在高性能材料的搜索范围。其解决方案的关键在于提出一种无标签(label-free)的筛选策略:利用科学文本训练得到的词嵌入(embedding)表示材料组成,并通过计算与“导电性”和“介电性”两个属性概念方向的相似度,构建二维描述符空间;在此基础上采用对称帕累托前沿(symmetric Pareto-front)选择方法过滤候选子集,从而实现高效率的候选材料压缩与性能逼近。其中,基于Word2Vec的轻量级线性组合嵌入方法在多数情况下表现最优,能在保持接近最佳实测性能的同时最大化减少候选组合数量。
链接: https://arxiv.org/abs/2603.08881
作者: Lei Zhang,Markus Stricker
机构: Ruhr University Bochum (鲁尔大学波鸿分校)
类目: Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL)
备注: 15 pages, 3 figures
Abstract:Compositionally complex solid solution electrocatalysts span vast composition spaces, and even one materials system can contain more candidate compositions than can be measured exhaustively. Here we evaluate a label-free screening strategy that represents each composition using embeddings derived from scientific texts and prioritizes candidates based on similarity to two property concepts. We compare a corpus-trained Word2Vec baseline with transformer-based embeddings, where compositions are encoded either by linear element-wise mixing or by short composition prompts. Similarities to `concept directions’, the terms conductivity and dielectric, define a 2-dimensional descriptor space, and a symmetric Pareto-front selection is used to filter candidate subsets without using electrochemical labels. Performance is assessed on 15 materials libraries including noble metal alloys and multicomponent oxides. In this setting, the lightweight Word2Vec baseline, which uses a simple linear combination of element embeddings, often achieves the highest number of reductions of possible candidate compositions while staying close to the best measured performance.
信息检索
[IR-0] A Voronoi Cell Formulation for Principled Token Pruning in Late-Interaction Retrieval Models
【速读】:该论文旨在解决Late-interaction模型(如ColBERT)在文档检索任务中因存储每个词元(token)的密集嵌入(dense embedding)而导致的显著索引存储开销问题。现有方法通过统计或经验指标剪枝低重要性词元嵌入,但往往缺乏理论基础或效果不佳。本文的关键解决方案是基于超空间几何,将词元剪枝建模为嵌入空间中的Voronoi单元估计问题:通过将每个词元的影响视为其Voronoi区域的度量,实现具有理论依据的剪枝策略,在保持检索性能的同时有效降低索引规模。
链接: https://arxiv.org/abs/2603.09933
作者: Yash Kankanampati,Yuxuan Zong,Nadi Tomeh,Benjamin Piwowarksi,Joseph Le Roux
机构: Université Sorbonne Paris Nord, CNRS, LIPN (巴黎索邦大学-巴黎北分校,法国国家科学研究中心,LIPN实验室); Sorbonne Université, CNRS, ISIR (索邦大学,法国国家科学研究中心,ISIR实验室)
类目: Information Retrieval (cs.IR)
备注: 10 pages, 6 figures
Abstract:Late-interaction models like ColBERT offer a competitive performance across various retrieval tasks, but require storing a dense embedding for each document token, leading to a substantial index storage overhead. Past works address this by attempting to prune low-importance token embeddings based on statistical and empirical measures, but they often either lack formal grounding or are ineffective. To address these shortcomings, we introduce a framework grounded in hyperspace geometry and cast token pruning as a Voronoi cell estimation problem in the embedding space. By interpreting each token’s influence as a measure of its Voronoi region, our approach enables principled pruning that retains retrieval quality while reducing index size. Through our experiments, we demonstrate that this approach serves not only as a competitive pruning strategy but also as a valuable tool for improving and interpreting token-level behavior within dense retrieval systems.
[IR-1] Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
【速读】:该论文旨在解决文本到动作(text-motion)检索任务中现有方法因采用双编码器框架而丢失细粒度局部对应关系、导致检索精度受限且结果可解释性差的问题。其解决方案的关键在于提出一种基于关节角度的可解释动作表示,将关节级局部特征映射为结构化的伪图像(pseudo-image),并兼容预训练视觉Transformer模型;同时在文本到动作检索中引入MaxSim token-wise晚交互机制,并结合掩码语言建模(Masked Language Modeling)正则化,以增强文本与动作之间的鲁棒且可解释的对齐能力。
链接: https://arxiv.org/abs/2603.09930
作者: Yao Zhang,Zhuchenyang Liu,Yanlan He,Thomas Ploetz,Yu Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
Abstract:Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.
[IR-2] Overview of the TREC 2025 Retrieval Augmented Generation (RAG ) Track
【速读】:该论文旨在解决生成式 AI (Generative AI) 系统在面对复杂、现实世界信息需求时,如何有效整合检索与生成能力以提供准确、可解释且事实一致的回答的问题。其解决方案的关键在于设计一个融合检索与生成的端到端管道,并通过多层评估框架(包括相关性评估、响应完整性、溯源验证和一致性分析)确保系统输出的透明性和事实依据,同时引入长篇多句叙事型查询以更贴近深度搜索任务的实际场景,从而推动可信、上下文感知的检索增强生成(Retrieval-Augmented Generation, RAG)系统的创新与发展。
链接: https://arxiv.org/abs/2603.09891
作者: Shivani Upadhyay,Nandan Thakur,Ronak Pradeep,Nick Craswell,Daniel Campos,Jimmy Lin
机构: University of Waterloo (滑铁卢大学); Microsoft (微软); Zipf AI (Zipf AI)
类目: Information Retrieval (cs.IR)
备注: 21 pages, 8 figures, 13 tables
Abstract:The second edition of the TREC Retrieval Augmented Generation (RAG) Track advances research on systems that integrate retrieval and generation to address complex, real-world information needs. Building on the foundation of the inaugural 2024 track, this year’s challenge introduces long, multi-sentence narrative queries to better reflect the deep search task with the growing demand for reasoning-driven responses. Participants are tasked with designing pipelines that combine retrieval and generation while ensuring transparency and factual grounding. The track leverages the MS MARCO V2.1 corpus and employs a multi-layered evaluation framework encompassing relevance assessment, response completeness, attribution verification, and agreement analysis. By emphasizing multi-faceted narratives and attribution-rich answers from over 150 submissions this year, the TREC 2025 RAG Track aims to foster innovation in creating trustworthy, context-aware systems for retrieval augmented generation.
[IR-3] RecThinker: An Agent ic Framework for Tool-Augmented Reasoning in Recommendation
【速读】:该论文旨在解决当前推荐系统中推荐代理(recommendation agent)依赖静态预定义工作流或受限信息进行推理的问题,导致在用户画像碎片化或物品元数据稀疏时难以判断信息充分性,从而产生次优推荐结果。其解决方案的关键在于提出 RecThinker,一个基于工具增强推理的智能体框架,通过引入“分析-规划-执行”(Analyze-Plan-Act)范式,使推荐过程从被动处理转向自主调查:模型首先评估用户-物品信息的充分性,并动态规划推理路径,主动调用专用工具(如获取用户侧、物品侧及协同信息)填补知识缺口;同时结合监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)的自增强训练流程,优化推理轨迹质量与工具使用效率,显著提升推荐准确性与鲁棒性。
链接: https://arxiv.org/abs/2603.09843
作者: Haobo Zhang,Yutao Zhu,Kelong Mao,Tianhao Li,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); JD.com (京东)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) have revolutionized recommendation agents by providing superior reasoning and flexible decision-making capabilities. However, existing methods mainly follow a passive information acquisition paradigm, where agents either rely on static pre-defined workflows or perform reasoning with constrained information. It limits the agent’s ability to identify information sufficiency, often leading to suboptimal recommendations when faced with fragmented user profiles or sparse item metadata. To address these limitations, we propose RecThinker, an agentic framework for tool-augmented reasoning in recommendation, which shifts recommendation from passive processing to autonomous investigation by dynamically planning reasoning paths and proactively acquiring essential information via autonomous tool-use. Specifically, RecThinker adopts an Analyze-Plan-Act paradigm, which first analyzes the sufficiency of user-item information and autonomously invokes tool-calling sequences to bridge information gaps between available knowledge and reasoning requirements. We develop a suite of specialized tools for RecThinker, enabling the model to acquire user-side, item-side, and collaborative information for better reasoning and user-item matching. Furthermore, we introduce a self-augmented training pipeline, comprising a Supervised Fine-Tuning (SFT) stage to internalize high-quality reasoning trajectories and a Reinforcement Learning (RL) stage to optimize for decision accuracy and tool-use efficiency. Extensive experiments on multiple benchmark datasets demonstrate that RecThinker consistently outperforms strong baselines in the recommendation scenario.
[IR-4] MITRA: An AI Assistant for Knowledge Retrieval in Physics Collaborations NEURIPS2025
【速读】:该论文旨在解决大型科学合作项目(如CERN的CMS实验)中内部文档数量庞大且结构复杂,导致研究人员难以高效获取所需信息的问题,从而阻碍知识共享和科研进展。解决方案的关键在于提出一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的本地化系统MITRA,其核心创新包括:利用Selenium自动化从内部数据库中提取文档,并结合OCR与版面解析实现高保真文本提取;构建两层向量数据库架构,先通过摘要定位相关物理分析再深入全文,有效消除不同分析间的歧义;整个系统(从嵌入模型到大语言模型)均部署在本地,保障敏感数据隐私。
链接: https://arxiv.org/abs/2603.09800
作者: Abhishikth Mallampalli,Sridhara Dasu
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at NeurIPS 2025 Machine Learning for the Physical Sciences workshop and Lepton Photon conference 2025 (Computing AI/ML track)
Abstract:Large-scale scientific collaborations, such as the Compact Muon Solenoid (CMS) at CERN, produce a vast and ever-growing corpus of internal documentation. Navigating this complex information landscape presents a significant challenge for both new and experienced researchers, hindering knowledge sharing and slowing down the pace of scientific discovery. To address this, we present a prototype of MITRA, a Retrieval-Augmented Generation (RAG) based system, designed to answer specific, context-aware questions about physics analyses. MITRA employs a novel, automated pipeline using Selenium for document retrieval from internal databases and Optical Character Recognition (OCR) with layout parsing for high-fidelity text extraction. Crucially, MITRA’s entire framework, from the embedding model to the Large Language Model (LLM), is hosted on-premise, ensuring that sensitive collaboration data remains private. We introduce a two-tiered vector database architecture that first identifies the relevant analysis from abstracts before focusing on the full documentation, resolving potential ambiguities between different analyses. We demonstrate the prototype’s superior retrieval performance against a standard keyword-based baseline on realistic queries and discuss future work towards developing a comprehensive research agent for large experimental collaborations.
[IR-5] Automatic Cardiac Risk Management Classification using large-context Electronic Patients Health Records
【速读】:该论文旨在解决老年心血管风险管理中手动行政编码(Administrative Coding)效率低、成本高且易出错的问题,提出了一种基于非结构化电子健康记录(Electronic Health Records, EHRs)的自动化分类框架。其解决方案的关键在于采用定制化的Transformer架构,通过专门设计的分层注意力机制有效捕捉医学文本中的长程依赖关系,从而在纵向荷兰临床叙事数据上显著优于传统机器学习基线模型和零样本设置下的通用生成式大语言模型(Generative Large Language Models, LLMs),实现了更高的F1分数与马修斯相关系数(Matthews Correlation Coefficient)。
链接: https://arxiv.org/abs/2603.09685
作者: Jacopo Vitale,David Della Morte,Luca Bacco,Mario Merone,Mark de Groot,Saskia Haitjema,Leandro Pecchia,Bram van Es
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 17 pages, 3 figures, 5 tables
Abstract:To overcome the limitations of manual administrative coding in geriatric Cardiovascular Risk Management, this study introduces an automated classification framework leveraging unstructured Electronic Health Records (EHRs). Using a dataset of 3,482 patients, we benchmarked three distinct modeling paradigms on longitudinal Dutch clinical narratives: classical machine learning baselines, specialized deep learning architectures optimized for large-context sequences, and general-purpose generative Large Language Models (LLMs) in a zero-shot setting. Additionally, we evaluated a late fusion strategy to integrate unstructured text with structured medication embeddings and anthropometric data. Our analysis reveals that the custom Transformer architecture outperforms both traditional methods and generative \acsllms, achieving the highest F1-scores and Matthews Correlation Coefficients. These findings underscore the critical role of specialized hierarchical attention mechanisms in capturing long-range dependencies within medical texts, presenting a robust, automated alternative to manual workflows for clinical risk stratification.
[IR-6] Understanding the Interplay between LLM s Utilisation of Parametric and Contextual Knowledge: A keynote at ECIR 2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在知识密集型任务中因参数化知识(parametric knowledge)与外部上下文知识存在冲突而导致的知识利用失效问题,尤其关注模型如何权衡其预训练记忆与新提供的上下文信息。解决方案的关键在于通过系统性评估模型内部知识状态、设计诊断测试以识别知识冲突(包括跨记忆冲突和内记忆冲突),并深入分析成功整合上下文知识的特征,从而为理解模型决策机制和优化知识更新策略提供依据。
链接: https://arxiv.org/abs/2603.09654
作者: Isabelle Augenstein
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Language Models (LMs) acquire parametric knowledge from their training process, embedding it within their weights. The increasing scalability of LMs, however, poses significant challenges for understanding a model’s inner workings and further for updating or correcting this embedded knowledge without the significant cost of retraining. Moreover, when using these language models for knowledge-intensive language understanding tasks, LMs have to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. Nevertheless, studies indicate that LMs often ignore the provided context as it can be in conflict with the pre-existing LM’s memory learned during pre-training. Conflicting knowledge can also already be present in the LM’s parameters, termed intra-memory conflict. This underscores the importance of understanding the interplay between how a language model uses its parametric knowledge and the retrieved contextual knowledge. In this talk, I will aim to shed light on this important issue by presenting our research on evaluating the knowledge present in LMs, diagnostic tests that can reveal knowledge conflicts, as well as on understanding the characteristics of successfully used contextual knowledge.
[IR-7] PRECEPT: Planning Resilience via Experience Context Engineering Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution
【速读】:该论文针对大语言模型(Large Language Model, LLM)代理在运行时(test-time)面临的核心挑战展开研究:一是当条件数量增长时,基于自然语言存储的知识会导致检索性能急剧下降;二是难以可靠地组合已学习的规则;三是缺乏显式的机制来检测过时或对抗性知识。为解决这些问题,作者提出了一种统一的测试时自适应框架 PRECEPT,其关键创新在于三个紧密耦合的组件:(1) 基于结构化条件键的确定性精确匹配规则检索,消除部分匹配带来的误判(理论上为0%,相较独立模型下94.4%显著提升);(2) 具备贝叶斯源可靠性评估与阈值驱动规则失效机制的冲突感知记忆系统,可处理静态-动态知识矛盾并支持漂移适应;(3) COMPASS——一个帕累托引导的提示演化外循环,通过端到端执行管道评估和优化提示。该方案实现了显著的性能提升,包括首次尝试成功率提高41.1个百分点、组合泛化能力增强33.3个百分点、物流任务中2路组合准确率达100%,以及持续学习和对抗性知识下的鲁棒性增强。
链接: https://arxiv.org/abs/2603.09641
作者: Arash Shahmansoori
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 50 pages, 14 figures. Code and reproducibility resources: this https URL
Abstract:LLM agents that store knowledge as natural language suffer steep retrieval degradation as condition count grows, often struggle to compose learned rules reliably, and typically lack explicit mechanisms to detect stale or adversarial knowledge. We introduce PRECEPT, a unified framework for test-time adaptation with three tightly coupled components: (1) deterministic exact-match rule retrieval over structured condition keys, (2) conflict-aware memory with Bayesian source reliability and threshold-based rule invalidation, and (3) COMPASS, a Pareto-guided prompt-evolution outer loop. Exact retrieval eliminates partial-match interpretation errors on the deterministic path (0% by construction, vs 94.4% under Theorem~B.6’s independence model at N=10) and supports compositional stacking through a semantic tier hierarchy; conflict-aware memory resolves static–dynamic disagreements and supports drift adaptation; COMPASS evaluates prompts through the same end-to-end execution pipeline. Results (9–10 seeds): PRECEPT achieves a +41.1pp first-try advantage over Full Reflexion (d1.9), +33.3pp compositional generalization (d=1.55), 100% P_1 on 2-way logistics compositions (d=2.64), +40–55pp continuous learning gains, strong eventual robustness under adversarial static knowledge (100% logistics with adversarial SK active; partial recovery on integration), +55.0pp drift recovery (d=0.95, p=0.031), and 61% fewer steps. Core comparisons are statistically significant, often at p0.001. Comments: 50 pages, 14 figures. Code and reproducibility resources: this https URL Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) ACMclasses: I.2.11; I.2.6; H.3.3; I.2.4 Cite as: arXiv:2603.09641 [cs.AI] (or arXiv:2603.09641v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.09641 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-8] A-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在长程推理任务中因上下文窗口限制而导致的性能瓶颈问题,以及现有记忆存储与检索方法依赖预定义流程或静态相似度 top-k 检索所引发的灵活性不足问题。解决方案的关键在于提出一种工具增强的自主记忆检索框架(Tool-Augmented Autonomous Memory Retrieval framework, TA-Mem),其核心包括:(1)由LLM驱动的记忆提取代理,能够基于语义相关性自适应地将输入切分为子上下文并结构化为笔记;(2)支持多索引机制的记忆数据库,兼容基于关键字的查找与基于相似性的检索;(3)一个可根据用户输入自主选择合适工具进行记忆探索的检索代理,通过推理获取的记忆决定是否迭代或终止响应。该框架在LoCoMo数据集上显著优于现有基线方法,且对不同问题类型的工具使用分析验证了其适应性。
链接: https://arxiv.org/abs/2603.09297
作者: Mengwei Yuan,Jianan Liu,Jing Yang,Xianyou Li,Weiran Yan,Yichao Wu,Penghao Liang
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Large Language Model (LLM) has exhibited strong reasoning ability in text-based contexts across various domains, yet the limitation of context window poses challenges for the model on long-range inference tasks and necessitates a memory storage system. While many current storage approaches have been proposed with episodic notes and graph representations of memory, retrieval methods still primarily rely on predefined workflows or static similarity top-k over embeddings. To address this inflexibility, we introduced a novel tool-augmented autonomous memory retrieval framework (TA-Mem), which contains: (1) a memory extraction LLM agent which is prompted to adaptively chuck an input into sub-context based on semantic correlation, and extract information into structured notes, (2) a multi-indexed memory database designed for different types of query methods including both key-based lookup and similarity-based retrieval, (3) a tool-augmented memory retrieval agent which explores the memory autonomously by selecting appropriate tools provided by the database based on the user input, and decides whether to proceed to the next iteration or finalizing the response after reasoning on the fetched memories. The TA-Mem is evaluated on the LoCoMo dataset, achieving significant performance improvements over existing baseline approaches. In addition, an analysis of tool use across different question types also demonstrates the adaptivity of the proposed method.
[IR-9] Diagnosing and Repairing Citation Failures in Generative Engine Optimization
【速读】:该论文旨在解决生成式 AI(Generative AI)内容中文档可见性不足的问题,即现有方法仅衡量文档对响应的贡献度,而非实际驱动流量回流至创作者的引用机制,且采用通用重写规则无法诊断单个文档未被引用的根本原因。解决方案的关键在于提出一种诊断导向的生成式引擎优化(Generative Engine Optimization, GEO)框架:首先构建涵盖引用流水线各阶段的首次引用失败模式分类体系;其次设计AgentGEO智能体系统,基于该分类诊断失败原因、从工具库中选择针对性修复策略并迭代直至实现引用;最后通过以文档为中心的基准测试验证优化效果在未见查询上的泛化能力。该方案在仅修改5%内容的前提下使引用率相对提升超40%,显著优于基线方法(25%),同时揭示了通用优化可能损害长尾内容,并指出部分文档存在优化难以完全解决的结构性障碍。
链接: https://arxiv.org/abs/2603.09296
作者: Zhihua Tian,Yuhan Chen,Yao Tang,Jian Liu,Ruoxi Jia
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 35 pages
Abstract:Generative Engine Optimization (GEO) aims to improve content visibility in AI-generated responses. However, existing methods measure contribution-how much a document influences a response-rather than citation, the mechanism that actually drives traffic back to creators. Also, these methods apply generic rewriting rules uniformly, failing to diagnose why individual document are not cited. This paper introduces a diagnostic approach to GEO that asks why a document fails to be cited and intervenes accordingly. We develop a unified framework comprising: (1) the first taxonomy of citation failure modes spanning different stages of a citation pipeline; (2) AgentGEO, an agentic system that diagnoses failures using this taxonomy, selects targeted repairs from a corresponding tool library, and iterates until citation is achieved; and (3) a document-centric benchmark evaluating whether optimizations generalize across held-out queries. AgentGEO achieves over 40% relative improvement in citation rates while modifying only 5% of content, compared to 25% for baselines. Our analysis reveals that generic optimization can harm long-tail content and some documents face challenges that optimization alone cannot fully address-findings with implications for equitable visibility in AI-mediated information access.
[IR-10] Evoking User Memory: Personalizing LLM via Recollection-Familiarity Adaptive Retrieval ICLR2026
【速读】:该论文旨在解决个性化大语言模型(Large Language Models, LLMs)在记忆检索中面临的两大挑战:一是现有方法要么将用户全部历史记忆注入提示(prompt),导致计算开销大且难以扩展;二是简单采用单次相似度匹配,仅能捕捉表面关联,缺乏深层语境重建能力。解决方案的关键在于提出一种基于熟悉度(Familiarity)不确定性引导的双路径记忆检索机制(RF-Mem),其核心创新是引入认知科学中的双过程理论——通过均值得分与熵值量化熟悉度信号,高熟悉度时直接触发Top-K快速检索,低熟悉度则激活递归式回忆路径(Recollection Path),该路径利用聚类与alpha混合策略在嵌入空间中迭代扩展证据,模拟人类对情景记忆的主动重构过程。此设计实现了高效、自适应的个性化记忆检索,在固定预算和延迟约束下显著优于传统单路径方法。
链接: https://arxiv.org/abs/2603.09250
作者: Yingyi Zhang,Junyi Li,Wenlin Zhang,Penyue Jia,Xianneng Li,Yichao Wang,Derong Xu,Yi Wen,Huifeng Guo,Yong Liu,Xiangyu Zhao
机构: 未知
类目: Information Retrieval (cs.IR)
备注: Accepted by ICLR 2026
Abstract:Personalized large language models (LLMs) rely on memory retrieval to incorporate user-specific histories, preferences, and contexts. Existing approaches either overload the LLM by feeding all the user’s past memory into the prompt, which is costly and unscalable, or simplify retrieval into a one-shot similarity search, which captures only surface matches. Cognitive science, however, shows that human memory operates through a dual process: Familiarity, offering fast but coarse recognition, and Recollection, enabling deliberate, chain-like reconstruction for deeply recovering episodic content. Current systems lack both the ability to perform recollection retrieval and mechanisms to adaptively switch between the dual retrieval paths, leading to either insufficient recall or the inclusion of noise. To address this, we propose RF-Mem (Recollection-Familiarity Memory Retrieval), a familiarity uncertainty-guided dual-path memory retriever. RF-Mem measures the familiarity signal through the mean score and entropy. High familiarity leads to the direct top-K Familiarity retrieval path, while low familiarity activates the Recollection path. In the Recollection path, the system clusters candidate memories and applies alpha-mix with the query to iteratively expand evidence in embedding space, simulating deliberate contextual reconstruction. This design embeds human-like dual-process recognition into the retriever, avoiding full-context overhead and enabling scalable, adaptive personalization. Experiments across three benchmarks and corpus scales demonstrate that RF-Mem consistently outperforms both one-shot retrieval and full-context reasoning under fixed budget and latency constraints. Our code can be found in the Reproducibility Statement.
[IR-11] DataFactory: Collaborative Multi-Agent Framework for Advanced Table Question Answering
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)在表格问答(Table Question Answering, TableQA)任务中面临的三大核心问题:受限的上下文长度导致数据处理能力不足、幻觉现象影响答案可靠性,以及单智能体架构难以应对涉及语义关系和多跳推理的复杂推理场景。其解决方案的关键在于提出一个名为DataFactory的多智能体框架,通过专业化团队协作与自动化知识转换机制实现突破:该框架包含采用ReAct范式的Data Leader用于推理编排,并配备独立的数据库团队与知识图谱团队,将复杂查询系统性分解为结构化与关系型推理任务;同时引入映射函数T:D × S × R → G实现数据到知识图谱的自动转换,并基于自然语言交互支持灵活的跨智能体协商与动态规划,从而提升协同鲁棒性;此外,结合上下文工程策略融合历史模式与领域知识以减少幻觉并提高准确性。实验证明,该方法在多个基准数据集上显著优于基线模型,且多团队协作效果优于单一团队设置。
链接: https://arxiv.org/abs/2603.09152
作者: Tong Wang,Chi Jin,Yongkang Chen,Huan Deng,Xiaohui Kuang,Gang Zhao
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注: Published in Information Processing Management, 2026
Abstract:Table Question Answering (TableQA) enables natural language interaction with structured tabular data. However, existing large language model (LLM) approaches face critical limitations: context length constraints that restrict data handling capabilities, hallucination issues that compromise answer reliability, and single-agent architectures that struggle with complex reasoning scenarios involving semantic relationships and multi-hop logic. This paper introduces DataFactory, a multi-agent framework that addresses these limitations through specialized team coordination and automated knowledge transformation. The framework comprises a Data Leader employing the ReAct paradigm for reasoning orchestration, together with dedicated Database and Knowledge Graph teams, enabling the systematic decomposition of complex queries into structured and relational reasoning tasks. We formalize automated data-to-knowledge graph transformation via the mapping function T:D x S x R - G, and implement natural language-based consultation that - unlike fixed workflow multi-agent systems - enables flexible inter-agent deliberation and adaptive planning to improve coordination robustness. We also apply context engineering strategies that integrate historical patterns and domain knowledge to reduce hallucinations and improve query accuracy. Across TabFact, WikiTableQuestions, and FeTaQA, using eight LLMs from five providers, results show consistent gains. Our approach improves accuracy by 20.2% (TabFact) and 23.9% (WikiTQ) over baselines, with significant effects (Cohen’s d 1). Team coordination also outperforms single-team variants (+5.5% TabFact, +14.4% WikiTQ, +17.1% FeTaQA ROUGE-2). The framework offers design guidelines for multi-agent collaboration and a practical platform for enterprise data analysis through integrated structured querying and graph-based knowledge representation.
[IR-12] From Verification to Amplification: Auditing Reverse Image Search as Algorithmic Gatekeeping in Visual Misinformation Fact-checking
【速读】:该论文旨在解决视觉虚假信息(visual misinformation)在平台算法中介下的传播与验证问题,特别是逆向图像搜索(Reverse Image Search, RIS)作为算法把关工具如何影响用户在事实核查过程中的信息接触。其解决方案的关键在于系统性审计 Google RIS 的检索结果质量:通过对15天内新识别的误导性图像进行逆向搜索,并分析34,486条排名靠前的结果,发现RIS返回大量无关信息和重复虚假内容,而辟谣内容占比不足30%,且在排名中面临可见性障碍;此外,还揭示了RIS结果页面质量随时间呈倒U型曲线,可能源于视觉虚假信息刚出现时搜索引擎的“数据空洞”(data voids)。这一方法为理解算法把关机制在视觉领域的局限性提供了实证依据。
链接: https://arxiv.org/abs/2603.09130
作者: Cong Lin,Yifei Chen,Jiangyue Chen,Yingdan Lu,Yilang Peng,Cuihua Shen
机构: 未知
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:
Abstract:As visual misinformation becomes increasingly prevalent, platform algorithms act as intermediaries that curate information for users’ verification practices. Yet, it remains unclear how algorithmic gatekeeping tools, such as reverse image search (RIS), shape users’ information exposure during fact-checking. This study systematically audits Google RIS by reversely searching newly identified misleading images over a 15-day window and analyzing 34,486 collected top-ranked search results. We find that Google RIS returns a substantial volume of irrelevant information and repeated misinformation, whereas debunking content constitutes less than 30% of search results. Debunking content faces visibility challenges in rankings amid repeated misinformation and irrelevant information. Our findings also indicate an inverted U-shaped curve of RIS results page quality over time, likely due to search engine “data voids” when visual falsehoods first appear. These findings contribute to scholarship of visual misinformation verification, and extend algorithmic gatekeeping research to the visual domain.
[IR-13] Unlocking High-Fidelity Analog Joint Source-Channel Coding on Standard Digital Transceivers
【速读】:该论文旨在解决模拟联合信源信道编码(Analog Joint Source-Channel Coding, JSCC)在现代数字物理层(Digital Physical Layer, PHY)上部署时存在的软硬件不匹配问题:传统模拟JSCC依赖连续值符号以实现优异的语义通信性能,但数字PHY仅能生成离散波形并采用非可微操作,导致端到端梯度流中断,阻碍训练与部署。其解决方案的关键在于提出D2AJSCC框架,通过利用正交频分复用(OFDM)子载波结构作为波形合成器,借助计算型PHY反演确定输入比特流以调控子载波幅度和相位,从而逼近理想模拟波形;同时设计ProxyNet——一个可微神经代理模型,替代不可微的PHY操作以维持梯度传播,防止JSCC性能退化。该方法使模拟JSCC能够在标准数字PHY上实现近理想的语义传输性能,并具备随信噪比(SNR)平滑降级的能力。
链接: https://arxiv.org/abs/2603.09080
作者: Shumin Yao,Hao Chen,Yaping Sun,Nan Ma,Xiaodong Xu,Qinglin Zhao,Shuguang Cui
机构: Pengcheng Laboratory (鹏城实验室); Beijing University of Posts and Telecommunications (北京邮电大学); Macau University of Science and Technology (澳门科技大学); The Chinese University of Hong Kong (深圳) (香港中文大学(深圳)
类目: Information Theory (cs.IT); Information Retrieval (cs.IR)
备注:
Abstract:Analog joint source-channel coding (JSCC) has demonstrated superior performance for semantic communications through graceful degradation across channel conditions. However, a fundamental hardware-software mismatch prevents deployment on modern digital physical layers (PHYs): analog JSCC generates continuous-valued symbols requiring infinite waveform diversity, while digital PHYs produce a finite set of discrete waveforms and employ non-differentiable operations that break end-to-end gradient flow. Existing solutions either fundamentally limit representation granularity or require impractical white-box PHY access. We introduce D2AJSCC, a novel framework enabling high-fidelity analog JSCC deployment on standard digital PHYs. Our approach exploits orthogonal frequency-division multiplexing’s parallel subcarrier structure as a waveform synthesizer: computational PHY inversion determines input bitstreams that orchestrate subcarrier amplitudes and phases to emulate ideal analog waveforms. To enable end-to-end training despite non-differentiable PHY operations, we develop ProxyNet-a differentiable neural surrogate of the communication link that provides uninterrupted gradient flow while preventing JSCC degeneration. Simulation results for image transmission over WiFi PHY demonstrate that our system achieves near-ideal analog JSCC performance with graceful degradation across SNR conditions, while baselines exhibit cliff effects or catastrophic failures. By enabling next-generation semantic transmission on legacy infrastructure without hardware modification, our framework promotes sustainable network evolution and bridges the critical gap between analog JSCC’s theoretical promise and practical deployment on ubiquitous digital hardware.
[IR-14] A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations
【速读】:该论文旨在解决失踪儿童案件调查中前72小时关键期内信息处理效率与准确性不足的问题,以提升早期搜索规划的科学性与响应速度。其解决方案的关键在于构建一个端到端的多模型大语言模型(Large Language Model, LLM)流水线——Guardian LLM Pipeline,该系统通过任务专业化LLM实现智能信息抽取与处理,并引入共识LLM引擎对多个模型输出进行比对和冲突消解;同时结合QLoRA微调技术与精挑细选的数据集增强模型性能,从而在保持保守、可审计的前提下,将LLM作为结构化提取器和标注工具,而非无约束的决策引擎,确保结果的可靠性与可追溯性。
链接: https://arxiv.org/abs/2603.08954
作者: Joshua Castillo,Ravi Mukkamala
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to CAC: Applied Computing Automation Conferences 2026. 16 pages, 6 figures
Abstract:The first 72 hours of a missing-person investigation are critical for successful recovery. Guardian is an end-to-end system designed to support missing-child investigation and early search planning. This paper presents the Guardian LLM Pipeline, a multi-model system in which LLMs are used for intelligent information extraction and processing related to missing-person search operations. The pipeline coordinates end-to-end execution across task-specialized LLM models and invokes a consensus LLM engine that compares multiple model outputs and resolves disagreements. The pipeline is further strengthened by QLoRA-based fine-tuning, using curated datasets. The presented design aligns with prior work on weak supervision and LLM-assisted annotation, emphasizing conservative, auditable use of LLMs as structured extractors and labelers rather than unconstrained end-to-end decision makers.
[IR-15] PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM -Driven Framework for Semantic Retrieval and Clinical Integration
【速读】:该论文旨在解决病理学档案中海量非结构化文本报告难以被有效检索与推理利用的问题,即如何将静态的数字病理档案转化为可实时查询、支持临床决策的动态知识库。其核心挑战在于传统数字化流程仅实现数据存储而缺乏语义理解与交互能力,导致机构积累的宝贵经验无法赋能患者诊疗。解决方案的关键是提出PathoScribe框架——一个统一的检索增强型大语言模型(retrieval-augmented large language model, RAG-LLM)架构,通过自然语言查询实现病例检索、自动化队列构建、临床问答、免疫组化(IHC)面板推荐及报告格式转换等功能,显著提升病理数据的可用性与临床价值。实证表明,该系统在7万份多中心手术病理报告上实现了精准的案例召回率(Recall@10=1.0),并可在分钟级完成研究级队列构建(平均9.2分钟),准确率达91.3%,大幅降低人力成本与时间开销。
链接: https://arxiv.org/abs/2603.08935
作者: Abdul Rehman Akbar,Samuel Wales-McGrath,Alejadro Levya,Lina Gokhale,Rajendra Singh,Wei Chen,Anil Parwani,Muhammad Khalid Khan Niazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:
Abstract:Pathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma. We present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture. Evaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms.
[IR-16] Interpretable Markov-Based Spatiotemporal Risk Surfaces for Missing-Child Search Planning with Reinforcement Learning and LLM -Based Quality Assurance
【速读】:该论文旨在解决失踪儿童案件调查中因数据碎片化、非结构化以及缺乏动态地理空间预测工具而导致的搜救效率低下的问题。解决方案的关键在于提出一个三层次的决策支持系统——Guardian,其核心创新包括:第一层采用马尔可夫链(Markov chain)模型,融合道路可达性成本、隐蔽偏好和走廊偏向等要素,并区分昼夜参数以生成可解释的时空概率分布;第二层通过强化学习(reinforcement learning)将概率分布转化为可操作的搜索计划;第三层利用大语言模型(LLM)对第二层输出进行事后验证,确保方案的合理性与安全性。该架构实现了从原始文档到可执行搜索策略的端到端转化,显著提升了早期搜索规划的科学性与可解释性。
链接: https://arxiv.org/abs/2603.08933
作者: Joshua Castillo,Ravi Mukkamala
机构: Old Dominion University (老多米尼昂大学)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 14 pages, 7 figures. Accepted at ICEIS 2026 (International Conference on Enterprise Information Systems)
Abstract:The first 72 hours of a missing-child investigation are critical for successful recovery. However, law enforcement agencies often face fragmented, unstructured data and a lack of dynamic, geospatial predictive tools. Our system, Guardian, provides an end-to-end decision-support system for missing-child investigation and early search planning. It converts heterogeneous, unstructured case documents into a schema-aligned spatiotemporal representation, enriches cases with geocoding and transportation context, and provides probabilistic search products spanning 0-72 hours. In this paper, we present an overview of Guardian as well as a detailed description of a three-layer predictive component of the system. The first layer is a Markov chain, a sparse, interpretable model with transitions incorporating road accessibility costs, seclusion preferences, and corridor bias with separate day/night parameterizations. The Markov chain’s output prediction distributions are then transformed into operationally useful search plans by the second layer’s reinforcement learning. Finally, the third layer’s LLM performs post hoc validation of layer 2 search plans prior to their release. Using a synthetic but realistic case study, we report quantitative outputs across 24/48/72-hour horizons and analyze sensitivity, failure modes, and tradeoffs. Results show that the proposed predictive system with the three-layer architecture produces interpretable priors for zone optimization and human review.
[IR-17] Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverag e
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中上游检索质量与下游生成效果之间关系不明确的问题,即是否可以通过检索阶段的指标来可靠预测最终生成响应的信息覆盖度。其解决方案的关键在于通过在多个文本和多模态RAG基准(如TREC NeuCLIR 2024、TREC RAG 2024和WikiVideo)上系统性地评估15种文本检索模块和10种多模态检索模块,结合Auto-ARGUE和MiRAGE等多维评估框架,发现基于信息覆盖的检索指标与生成内容中的关键信息点(nugget)覆盖率之间存在强相关性,尤其在检索目标与生成目标一致时最为显著;这为将检索性能作为RAG整体性能的早期代理指标提供了实证依据。
链接: https://arxiv.org/abs/2603.08819
作者: Saron Samuel,Alexander Martin,Eugene Yang,Andrew Yates,Dawn Lawrie,Ian Soborof,Laura Dietz,Benjamin Van Durme
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:Retrieval-augmented generation (RAG) systems combine document retrieval with a generative model to address complex information seeking tasks like report generation. While the relationship between retrieval quality and generation effectiveness seems intuitive, it has not been systematically studied. We investigate whether upstream retrieval metrics can serve as reliable early indicators of the final generated response’s information coverage. Through experiments across two text RAG benchmarks (TREC NeuCLIR 2024 and TREC RAG 2024) and one multimodal benchmark (WikiVideo), we analyze 15 text retrieval stacks and 10 multimodal retrieval stacks across four RAG pipelines and multiple evaluation frameworks (Auto-ARGUE and MiRAGE). Our findings demonstrate strong correlations between coverage-based retrieval metrics and nugget coverage in generated responses at both topic and system levels. This relationship holds most strongly when retrieval objectives align with generation goals, though more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness. These findings provide empirical support for using retrieval metrics as proxies for RAG performance.
[IR-18] me warping with Hellinger elasticity
【速读】:该论文旨在解决时间序列在任意度量空间中的匹配问题,其中引入了基于Hellinger核的拉伸惩罚机制以优化匹配效果。解决方案的关键在于提出了一种弹性时间扭曲(Elastic Time Warping)算法,该算法具有三次方计算复杂度,能够在保持时间序列形态差异的同时实现更精确的对齐与匹配。
链接: https://arxiv.org/abs/2603.08807
作者: Yuly Billig
机构: Carleton University (卡尔顿大学)
类目: Information Retrieval (cs.IR); Data Structures and Algorithms (cs.DS); Metric Geometry (math.MG)
备注:
Abstract:We consider a matching problem for time series with values in an arbitrary metric space, with the stretching penalty given by the Hellinger kernel. To optimize this matching, we introduce the Elastic Time Warping algorithm with a cubic computational complexity.
[IR-19] Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement
【速读】:该论文试图解决当前生成式搜索平台中领域可见性(domain visibility)度量方法的局限性问题,即现有研究通常依赖单次运行的引用份额(citation share)和流行度(prevalence)点估计值,而忽视了生成式 AI(Generative AI)回答引擎固有的非确定性特征,导致对领域表现的评估过于精确且不具代表性。其解决方案的关键在于将引用可见性指标视为潜在响应分布的样本估计量而非固定值,并通过重复采样(每日采样与十分钟高频采样)实证揭示了引用分布服从幂律形式且存在显著变异性;进一步利用Bootstrap置信区间和整体排名稳定性分析表明,许多看似显著的领域差异实际上处于测量噪声范围内,因此必须引入不确定性估计并提供可解释置信区间的最小样本量建议,从而实现更可靠、稳健的领域可见性评估。
链接: https://arxiv.org/abs/2603.08924
作者: Ronald Sielinski
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 47 pages, 12 figures
Abstract:AI-powered answer engines are inherently non-deterministic: identical queries submitted at different times can produce different responses and cite different sources. Despite this stochastic behavior, current approaches to measuring domain visibility in generative search typically rely on single-run point estimates of citation share and prevalence, implicitly treating them as fixed values. This paper argues that citation visibility metrics should be treated as sample estimators of an underlying response distribution rather than fixed values. We conduct an empirical study of citation variability across three generative search platforms–Perplexity Search, OpenAI SearchGPT, and Google Gemini–using repeated sampling across three consumer product topics. Two sampling regimes are employed: daily collections over nine days and high-frequency sampling at ten-minute intervals. We show that citation distributions follow a power-law form and exhibit substantial variability across repeated samples. Bootstrap confidence intervals reveal that many apparent differences between domains fall within the noise floor of the measurement process. Distribution-wide rank stability analysis further demonstrates that citation rankings are unstable across samples, not only among top-ranked domains but throughout the frequently cited domain set. These findings demonstrate that single-run visibility metrics provide a misleadingly precise picture of domain performance in generative search. We argue that citation visibility must be reported with uncertainty estimates and provide practical guidance for sample sizes required to achieve interpretable confidence intervals.
人机交互
[HC-0] Understanding the Use of a Large Language Model-Powered Guide to Make Virtual Reality Accessible for Blind and Low Vision People
【速读】:该论文旨在解决社会虚拟现实(Social VR)中盲人及低视力(BLV)用户群体的可访问性问题,当前缺乏针对此类用户的有效辅助工具。解决方案的关键在于开发一个由大语言模型(LLM)驱动的AI“ sighted guide”(视觉向导),通过在包含伪装成其他用户的共谋者的虚拟环境中对16名BLV参与者进行实验,揭示了该引导系统在不同社交情境下被用户赋予的不同角色属性:单独使用时被视为工具,而在他人面前则被当作陪伴者,表现出拟人化行为特征,如命名、为错误辩护并促进与其他用户的互动。这一发现深化了对AI引导机制在VR无障碍设计中灵活性的理解,并为未来面向BLV用户的交互式辅助系统提供了设计建议。
链接: https://arxiv.org/abs/2603.09964
作者: Jazmin Collins,Sharon Y Lin,Tianqi Liu,Andrea Stevenson Won,Shiri Azenkot
机构: Cornell University(康奈尔大学); Cornell Tech(康奈尔科技学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 16 pages, 5 figures, 3 tables, Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26), April 13-17, 2026, Barcelona, Spain. ACM
Abstract:As social virtual reality (VR) grows more popular, addressing accessibility for blind and low vision (BLV) users is increasingly critical. Researchers have proposed an AI “sighted guide” to help users navigate VR and answer their questions, but it has not been studied with users. To address this gap, we developed a large language model (LLM)-powered guide and studied its use with 16 BLV participants in virtual environments with confederates posing as other users. We found that when alone, participants treated the guide as a tool, but treated it companionably around others, giving it nicknames, rationalizing its mistakes with its appearance, and encouraging confederate-guide interaction. Our work furthers understanding of guides as a versatile method for VR accessibility and presents design recommendations for future guides.
[HC-1] Prompt-Driven Color Accessibility Evaluation in Diffusion-based Image Generation Models
【速读】:该论文旨在解决生成式 AI(Generative AI)在图像生成过程中对色觉缺陷(Color Vision Deficiencies, CVD)用户不友好的问题,即当前主流扩散模型在文本到图像生成中虽能实现高质量和多样性的视觉输出,但缺乏对色觉障碍群体的颜色可访问性保障。其解决方案的关键在于提出一种名为“CVDLoss”的新评估指标,该指标通过量化图像梯度差异来反映结构细节的可辨识度变化,从而敏感地检测颜色调整对CVD用户可视性的影响;实验表明,CVDLoss能有效评估模型响应可访问性提示的能力,并揭示现有扩散模型在应对此类提示时存在显著局限性,为未来开发更具包容性的图像生成与后处理方法提供了量化工具和改进方向。
链接: https://arxiv.org/abs/2603.09832
作者: Xinyao Zhuang,Jose Echevarria,Kaan Akşit
机构: University College London, United Kingdom; Adobe Research, USA
类目: Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative models are increasingly integrated into creative workflows. While text-to-image generation excels in visual quality and diversity, color accessibility for users with Color Vision Deficiencies (CVD) remains largely unexplored. Our work systematically evaluates color accessibility in images generated by a common pretrained diffusion model, prompted to improve accessibility across diverse categories. We quantify performance using established, off-the-shelf CVD simulation methods and introduce “CVDLoss”, a new metric measuring differences in image gradients indicative of structural detail. We validate CVDLoss against a commonly used daltonization method, demonstrating its sensitivity to color accessibility modifications. Applying CVDLoss to model outputs reveals that existing diffusion models struggle to reliably respond to accessibility-focused prompts. Consequently, our study establishes CVDLoss as a valuable evaluation tool for accessibility-aware image generation and post-processing, offering insights into current generative models’ limitations in addressing color accessibility.
[HC-2] he Richest Paradigm Youre Not Using: Commercial Videogames at the Intersection of Human-Computer Interaction and Cognitive Science
【速读】:该论文旨在解决认知科学在研究感知、注意和执行功能时面临的生态效度(ecological validity)不足问题,以及人机交互(Human-Computer Interaction, HCI)领域虽能刻画复杂行为却较少关注其背后认知机制的局限。其解决方案的关键在于利用商业视频游戏作为研究环境——这类游戏在设计上具有认知需求高、动机性强且跨玩家行为一致性高的特点,能够自然地施加真实、持续且对玩家有意义的认知负荷;通过屏幕录制、眼动追踪和行为时间记录等最小化观测工具,结合“可及性-认知映射框架”(affordance-cognition mapping framework),实现对认知过程的系统性研究,从而弥合实验室控制与现实情境之间的鸿沟。
链接: https://arxiv.org/abs/2603.09753
作者: Jaap Munneke,Jennifer E. Corbett
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Synthesizing from Corbett and Munneke (2025), who demonstrated that questions originating in human-computer interaction (HCI) and game design can be answered through the theoretical toolkit of cognitive science, this perspective argues that commercial videogames represent a largely underutilized research environment at the intersection of these two fields. Cognitive science has long relied on carefully controlled laboratory paradigms to study perception, attention, and executive functioning, raising persistent questions about ecological validity. HCI, by contrast, has spent decades developing methods for studying behavior in rich, complex, interactive environments, but has been less concerned with what that behavior reveals about underlying cognitive mechanisms. Commercial videogames sit precisely at this intersection. They are cognitively demanding by design, motivating by nature, and consistent enough across players to support systematic behavioral comparison. The affordance structure of a game does the work that experimental manipulations typically require of the researcher, instantiating cognitive demands that are genuine, sustained, and meaningful to the player. We argue that perception, attention, and executive functioning can be meaningfully studied within commercial games using a minimal observational toolkit of screen recording, eye tracking, and behavioral timing. We propose an affordance-cognition mapping framework as a systematic basis for game selection and research design and offer practical methodological recommendations for researchers wishing to work in this space.
[HC-3] Dynamic Multimodal Expression Generation for LLM -Driven Pedagogical Agents : From User Experience Perspective
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)教育场景中教学代理(Pedagogical Agents, PAs)因依赖静态语音与简单手势而导致的语义情境适应能力不足问题,从而限制了交互的自然性与教学有效性。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的多模态表达生成方法,通过构建语义敏感的提示(prompts)来生成协调一致的语音与手势指令,实现教学语义与多模态表达行为之间的动态对齐。该方法显著提升了学习者的学习效果感知、参与度及使用意愿,并增强了人机相似性和社会临场感。
链接: https://arxiv.org/abs/2603.09536
作者: Ninghao Wan,Jiarun Song,Fuzheng Yang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:
Abstract:In virtual reality (VR) educational scenarios, Pedagogical agents (PAs) enhance immersive learning through realistic appearances and interactive behaviors. However, most existing PAs rely on static speech and simple gestures. This limitation reduces their ability to dynamically adapt to the semantic context of instructional content. As a result, interactions often lack naturalness and effectiveness in the teaching process. To address this challenge, this study proposes a large language model (LLM)-driven multimodal expression generation method that constructs semantically sensitive prompts to generate coordinated speech and gesture instructions, enabling dynamic alignment between instructional semantics and multimodal expressive behaviors. A VR-based PA prototype was developed and evaluated through user experience-oriented subjective experiments. Results indicate that dynamically generated multimodal expressions significantly enhance learners’ perceived learning effectiveness, engagement, and intention to use, while effectively alleviating feelings of fatigue and boredom during the learning process. Furthermore, the combined dynamic expression of speech and gestures notably enhances learners’ perceptions of human-likeness and social presence. The findings provide new insights and design guidelines for building more immersive and naturally expressive intelligent PAs.
[HC-4] PixelConfig: Longitudinal Measurement and Reverse-Engineering of Meta Pixel Configurations
【速读】:该论文旨在解决当前对在线广告追踪像素(Tracking Pixel)配置差异性研究不足的问题,特别是Meta Pixel在不同网站上的部署方式及其隐私影响缺乏系统性分析。其解决方案的关键在于提出一个名为PixelConfig的差异化分析框架,用于逆向工程和识别Meta Pixel在网页上的具体配置策略,包括活动追踪(activity tracking)、身份追踪(identity tracking)以及追踪限制机制(tracking restrictions)。通过该框架结合互联网档案馆Wayback Machine的历史数据,作者对18,000个健康类网站与10,000个对照网站进行了比较分析,揭示了默认设置驱动下高采用率的追踪功能及潜在敏感信息泄露风险,并指出现有限制机制保护效果有限且易被绕过。
链接: https://arxiv.org/abs/2603.09380
作者: Abdullah Ghani(1),Yash Vekaria(2),Zubair Shafiq(2) ((1) Lahore University of Management Sciences (2) University of California, Davis)
机构: Lahore University of Management Sciences (拉合尔管理科学大学); University of California, Davis (加州大学戴维斯分校)
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Networking and Internet Architecture (cs.NI); Social and Information Networks (cs.SI)
备注:
Abstract:Tracking pixels are used to optimize online ad campaigns through personalization, re-targeting, and conversion tracking. Past research has primarily focused on detecting the prevalence of tracking pixels on the web, with limited attention to how they are configured across websites. A tracking pixel may be configured differently on different websites. In this paper, we present a differential analysis framework: PixelConfig, to reverse-engineer the configurations of Meta Pixel deployments across the web. Using this framework, we investigate three types of Meta Pixel configurations: activity tracking (i.e., what a user is doing on a website), identity tracking (i.e., who a user is or who the device is associated with), and tracking restrictions (i.e., mechanisms to limit the sharing of potentially sensitive information). Using data from the Internet Archive’s Wayback Machine, we analyze and compare Meta Pixel configurations on 18K health-related websites with a control group of the top 10K websites from 2017 to 2024. We find that activity tracking features, such as automatic events that collect button clicks and page metadata, and identity tracking features, such as first-party cookies that are unaffected by third-party cookie blocking, reached adoption rates of up to 98.4%, largely driven by the Pixel’s default settings. We also find that the Pixel is being used to track potentially sensitive information, such as user interactions related to booking medical appointments and button clicks associated with specific medical conditions (e.g., erectile dysfunction) on health-related websites. Tracking restriction features, such as Core Setup, are configured on up to 34.3% of health websites and 8.7% of control websites. However, even when enabled, these tracking restriction features provide limited protection and can be circumvented in practice. Subjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Networking and Internet Architecture (cs.NI); Social and Information Networks (cs.SI) Cite as: arXiv:2603.09380 [cs.CR] (or arXiv:2603.09380v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.09380 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-5] Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents
【速读】:该论文旨在解决虚拟现实(VR)环境中具身对话代理(embodied conversational agents)在交互中因依赖语音转文本(speech-to-text)处理而丢失韵律特征(prosody),导致生成情感不一致响应的问题。解决方案的关键在于构建一个情绪上下文感知的VR交互流程,将语音情感识别(speech emotion recognition)模型实时提取的用户情绪标签作为显式对话上下文注入基于大语言模型(LLM)的对话系统中,从而引导代理生成与用户情绪状态一致的语气和风格响应,显著提升了对话质量、自然度、参与感及拟人性。
链接: https://arxiv.org/abs/2603.09324
作者: SangYeop Jeong,Yeongseo Na,Seung Gyu Jeong,Jin-Woo Jeong,Seong-Eun Kim
机构: Seoul National University of Science and Technology (首尔科学技术大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, Accepted to CHI EA 2026 (Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems)
Abstract:In VR interactions with embodied conversational agents, users’ emotional intent is often conveyed more by how something is said than by what is said. However, most VR agent pipelines rely on speech-to-text processing, discarding prosodic cues and often producing emotionally incongruent responses despite correct semantics. We propose an emotion-context-aware VR interaction pipeline that treats vocal emotion as explicit dialogue context in an LLM-based conversational agent. A real-time speech emotion recognition model infers users’ emotional states from prosody, and the resulting emotion labels are injected into the agent’s dialogue context to shape response tone and style. Results from a within-subjects VR study (N=30) show significant improvements in dialogue quality, naturalness, engagement, rapport, and human-likeness, with 93.3% of participants preferring the emotion-aware agent.
[HC-6] Entangling Like Mycorrhizae: Mixing Realities Through Touch in “FungiSync” SIGGRAPH2026
【速读】:该论文试图解决的问题是:如何将植物间通过菌根网络(mycorrhizal networks)实现的共生互惠关系转化为人类可感知、可体验的具身化认知,从而挑战技术驱动下个体主义主导的后人类时代认知范式。解决方案的关键在于设计了一种多用户共处的混合现实(mixed reality, MR)体验——FungiSync,其核心机制是将参与者的手部触碰转化为象征菌丝连接的数字交互:每位参与者佩戴带有菌菇装饰的MR头显面具,以不同音频响应的视觉元素构建个性化“感知世界”(umwelt),当双手接触时,各自虚拟环境中的资源可视化元素发生融合与流动,模拟菌丝网络中物质与信号的交换过程。这一设计使人类得以通过身体感知“真菌认识论”(fungal epistemics),即一种基于共生关系的认知方式,从而在美学体验与伦理立场上重构人与自然的联结。
链接: https://arxiv.org/abs/2603.09272
作者: Botao Amber Hu,Danlin Huang,Yilan Elan Tao,Xiaobo Aaron Hu,Rem RunGu Lin
机构: Reality Design Lab (现实设计实验室); University of Oxford (牛津大学); School of Design Innovation (中国美术学院设计创新学院); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Human-Computer Interaction (cs.HC)
备注: Submitted for SIGGRAPH 2026
Abstract:Mycorrhizal networks – often called nature’s ``wood-wide web’’ – are vast underground mycelial systems that connect individual plants through countless hyphae of mycorrhizal fungi joining with plant roots. Through these hyphal webs, resources and signals – carbohydrates, minerals, and biochemical cues – are mutualistically exchanged and redistributed across plants, sustaining forests as relational symbiotic ecologies rather than isolated individuals. What is it like to be a plant within the wood-wide web? We present \emphFungiSync, a multi-person, co-located mixed reality (MR) experience that translates mycorrhizal interdependence into a felt, somaesthetic participatory ritual. Participants embody different forest plants by holding masquerade-style MR headset masks with wood-branch-like handles decorated with mushrooms. In MR, each participant perceives a distinct, audio-reactive psychedelic augmented reality overlay – composed of resource-representing visual elements – layered atop a shared physical terrain, symbolizing an individualized digital \emphumwelt (perceptual world). FungiSync reprograms human hand touch into a metaphorical mycorrhizal exchange. When participants touch hands, their digital \emphumwelten begin to entangle: visual elements leak, mix, and merge across perspectives, as if hyphae were forging new connections and carrying resources between hosts within a larger mycelial network. By making mycorrhizal interdependence perceptible through embodied contact, FungiSync invites participants to feel with \emphfungal epistemics – a more-than-human alternative way of knowing grounded in symbiotic relationality as both an aesthetic experience and an ethical orientation – offering a critique of the accelerated individualism characterizing our technology-mediated posthuman era.
[HC-7] From Perception to Cognition: How Latency Affects Interaction Fluency and Social Presence in VR Conferencing
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)会议中因远程通信引入的端到端(End-to-End, E2E)延迟对用户交互流畅性(interaction fluency)和社会存在感(social presence)的负面影响问题。其解决方案的关键在于通过主观实验,分别从质量感知(使用绝对类别评分法,ACR)和社会认知(使用网络心智社会存在量表,NMSPI)两个维度量化分析延迟的影响,并进一步探究二者在不同延迟条件下的关联机制,从而为优化VR会议系统提供可操作的依据,以提升沉浸式环境中的交互流畅性和社会存在感。
链接: https://arxiv.org/abs/2603.09261
作者: Jiarun Song,Ninghao Wan,FuZheng Yang,Weisi Lin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:
Abstract:Virtual reality (VR) conferencing has the potential to provide geographically dispersed users with an immersive environment, enabling rich social interactions and user experience using avatars. However, remote communication in VR inevitably introduces end-to-end (E2E) latency, which can significantly impact user experience. To clarify the impact of latency, we conducted subjective experiments to analyze how it influences interaction fluency from the perspective of quality perception and social presence from the perspective of social cognition, comparing VR conferencing with traditional video conferencing (VC). Specifically, interaction fluency emphasizes user perception of interaction pace and responsiveness and is assessed using Absolute Category Rating (ACR) method. In contrast, social presence focuses on the cognitive understanding of interaction, specifically whether individuals can comprehend the intentions, emotions, and behaviors expressed by others. It is primarily measured using the Networked Minds Social Presence Inventory (NMSPI). Building on this analysis, we further investigate the relationship between interaction fluency and social presence under different latency conditions to clarify the underlying perceptual and cognitive mechanisms. The findings from these subjective tests provide meaningful insights for optimizing the related systems, helping to improve interaction fluency and enhancing social presence in immersive virtual environments.
[HC-8] A Text-Native Interface for Generative Video Authoring
【速读】:该论文旨在解决生成式视频创作中工具复杂、门槛高的问题,即当前视频制作需要掌握专业化且繁琐的编辑软件,而文本写作则是大众普遍具备的能力。其解决方案的关键在于提出Doki——一种以文本为先(text-first)的接口设计,将视频创作流程完全对齐于自然的文本写作过程:用户在单一文档内通过编写文本定义资产、组织场景、创建镜头、调整剪辑和添加音频,从而实现直观、高效且可访问的视觉叙事创作。这一方法标志着生成式视频界面的根本性转变,显著降低了创作门槛并提升了创作效率。
链接: https://arxiv.org/abs/2603.09072
作者: Xingyu Bruce Liu,Mira Dontcheva,Dingzeyu Li
机构: Adobe Research(Adobe 研究院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Everyone can write their stories in freeform text format – it’s something we all learn in school. Yet storytelling via video requires one to learn specialized and complicated tools. In this paper, we introduce Doki, a text-native interface for generative video authoring, aligning video creation with the natural process of text writing. In Doki, writing text is the primary interaction: within a single document, users define assets, structure scenes, create shots, refine edits, and add audio. We articulate the design principles of this text-first approach and demonstrate Doki’s capabilities through a series of examples. To evaluate its real-world use, we conducted a week-long deployment study with participants of varying expertise in video authoring. This work contributes a fundamental shift in generative video interfaces, demonstrating a powerful and accessible new way to craft visual stories.
[HC-9] racing Everyday AI Literacy Discussions at Scale: How Online Creative Communities Make Sense of Generative AI
【速读】:该论文试图解决当前生成式 AI (Generative AI) 时代下,AI素养(AI literacy)培养中存在的框架静态化与实践脱节问题,即现有AI素养框架多为自上而下的专家驱动模式,忽视了创意社群中素养如何通过日常实践自然演化。其解决方案的关键在于基于大规模社交媒体数据(12.2万条Reddit讨论),系统识别出创意社区中AI素养相关话语的四大核心主题,并揭示其随重大AI事件动态变化的特征——表明AI素养本质上是实践导向、事件响应型的,而非纯概念性的静态能力。这一发现为设计更贴合实际需求的学习资源、社区支持机制和政策提供了实证依据。
链接: https://arxiv.org/abs/2603.09055
作者: Haidan Liu,Poorvi Bhatia,Nicholas Vincent,Parmit Chilana
机构: Simon Fraser University (西蒙菲莎大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at CHI 2026
Abstract:Developing AI literacy is increasingly urgent as generative AI reshapes creative practice. Yet most AI literacy frameworks are top-down and expert-driven, overlooking how literacy emerges organically in creative communities. To address this gap, we performed a large-scale analysis of 122k Reddit conversations from 80 creative-oriented subreddits over a three-year period. Our analysis identified four consistent themes in AI literacy-related discussions, and we further traced how discourse shifted alongside major AI events. Surprisingly, creators primarily frame AI literacy around how to use tools effectively, foregrounding practice and task skills, while discussions of AI capabilities and ethics surge only around high-profile events. Our findings suggest that AI literacy is dynamic, practice-driven, and event-responsive rather than static or purely conceptual. This study provides insights for researchers, designers, and policymakers to develop learning resources, community support, and policies that better promote AI literacy in creative communities.
[HC-10] AI Phenomenology for Understanding Human-AI Experiences Across Eras
【速读】:该论文试图解决当前AI研究中过度依赖量化指标(如可用性量表和参与度指标)而忽视用户与AI交互时主观体验的问题,这些问题往往掩盖了人-AI互动的复杂性和个体差异性。解决方案的关键在于提出“AI现象学”(AI phenomenology)这一研究立场,强调从用户的首人称视角出发,持续追踪其对AI系统感知、理解与情感体验的变化过程,从而实现人与AI之间的双向对齐(bidirectional human-AI alignment)。论文通过三组实证研究(包括两个纵向研究和一个多方法研究),构建了一套可复用的方法论工具包,涵盖捕捉个人与职业场景下生活经验的工具、三种设计概念(透明化设计、意识代理的价值对齐、时间共演化追踪)以及具体的研究议程,为未来AI与人类协同演化的研究提供实践框架。
链接: https://arxiv.org/abs/2603.09020
作者: Bhada Yun,Evgenia Taranova,Dana Feng,Renn Su,April Yi Wang
机构: ETH Zürich(苏黎世联邦理工学院); University of Bergen(卑尔根大学); Independent Researcher(独立研究员); Stanford University(斯坦福大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: This is an accepted workshop paper at CHI '26, “W37: Human-AI Interaction Alignment: Designing, Evaluating, and Evolving Value-Centered AI For Reciprocal Human-AI Futures”, or this https URL
Abstract:There is no ‘ordinary’ when it comes to AI. The human-AI experience is extraordinarily complex and specific to each person, yet dominant measures such as usability scales and engagement metrics flatten away nuance. We argue for AI phenomenology: a research stance that asks “How did it feel?” beyond the standard questions of “How well did it perform?” when interacting with AI systems. AI phenomenology acts as a paradigm for bidirectional human-AI alignment as it foregrounds users’ first-person perceptions and interpretations of AI systems over time. We motivate AI phenomenology as a framework that captures how alignment is experienced, negotiated, and updated between users and AI systems. Tracing a lineage from Husserl through postphenomenology to Actor-Network Theory, and grounding our argument in three studies-two longitudinal studies with “Day”, an AI companion, and a multi-method study of agentic AI in software engineering-we contribute a set of replicable methodological toolkits for conducting AI phenomenology research: instruments for capturing lived experience across personal and professional contexts, three design concepts (translucent design, agency-aware value alignment, temporal co-evolution tracking), and a concrete research agenda. We offer this toolkit not as a new paradigm but as a practical scaffold that researchers can adapt as AI systems-and the humans who live alongside them-continue to co-evolve.
[HC-11] “Who wants to be nagged by AI?”: Investigating the Effects of Agreeableness on Older Adults Perception of LLM -Based Voice Assistants Explanations
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Model, LLM)驱动的语音助手(Voice Assistant, VA)在支持老年人居家养老场景中,其人格特质中的宜人性(agreeableness)如何影响老年人对解释内容的感知。研究发现,高宜人性VA在常规情境下显著提升老年人对其信任感、共情能力和亲和力的评价,但在紧急情境下,清晰性优先于情感温暖,此时宜人性的优势减弱;同时,宜人性与感知智能无关,表明社交语气与能力感知可分离。解决方案的关键在于:根据具体情境(常规 vs. 紧急)和用户特征(如宜人性倾向)动态调整VA的表达策略,实现个性化、情境适配的AI可解释性设计,而非采用统一模式。
链接: https://arxiv.org/abs/2603.09012
作者: Niharika Mathur,Hasibur Rahman,Smit Desai
机构: Georgia Institute of Technology (佐治亚理工学院); Northeastern University (东北大学)
类目: Human-Computer Interaction (cs.HC)
备注: To be published as a poster extended abstract at CHI 2026
Abstract:LLM-based voice assistants (VAs) increasingly support older adults aging in place, yet how an assistant’s agreeableness shapes explanation perception remains underexplored. We conducted a study(N=70) examining how VA agreeableness influences older adults’ perceptions of explanations across routine and emergency home scenarios. High-agreeableness assistants were perceived as more trustworthy, empathetic, and likable, but these benefits diminished in emergencies where clarity outweighed warmth. Agreeableness did not affect perceived intelligence, suggesting social tone and competence are separable dimensions. Real-time environmental explanations outperformed history-based ones, and agreeable older adults penalized low-agreeableness assistants more strongly. These findings show the need to move beyond a one-size-fits-all approach to AI explainability, while balancing personality, context, and audience.
[HC-12] Improving through Interaction: Searching Behavioral Representation Spaces with CMA-ES-IG
【速读】:该论文旨在解决机器人在人机交互环境中因缺乏对非专家用户偏好建模能力而导致的适应性不足问题,尤其关注现有偏好学习方法过于侧重最终估计精度而忽视用户在排序过程中体验的问题。解决方案的关键在于提出Covariance Matrix Adaptation Evolution Strategies with Information Gain (CMA-ES-IG)算法,该算法通过显式引入用户体验考量,在生成查询轨迹时优先选择感知上差异显著且信息量高的行为选项,从而提升用户在排序过程中的满意度与效率,同时保持高维偏好空间下的计算可行性与对噪声反馈的鲁棒性。
链接: https://arxiv.org/abs/2603.09011
作者: Nathaniel Dennler,Zhonghao Shi,Yiran Tao,Andreea Bobu,Stefanos Nikolaidis,Maja Matarić
机构: Massachusetts Institute of Technology (麻省理工学院); University of Southern California (南加州大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Under submission to IJRR
Abstract:Robots that interact with humans must adapt to individual users’ preferences to operate effectively in human-centered environments. An intuitive and effective technique to learn non-expert users’ preferences is through rankings of robot behaviors, e.g., trajectories, gestures, or voices. Existing techniques primarily focus on generating queries that optimize preference learning outcomes, such as sample efficiency or final preference estimation accuracy. However, the focus on outcome overlooks key user expectations in the process of providing these rankings, which can negatively impact users’ adoption of robotic systems. This work proposes the Covariance Matrix Adaptation Evolution Strategies with Information Gain (CMA-ES-IG) algorithm. CMA-ES-IG explicitly incorporates user experience considerations into the preference learning process by suggesting perceptually distinct and informative trajectories for users to rank. We demonstrate these benefits through both simulated studies and real-robot experiments. CMA-ES-IG, compared to state-of-the-art alternatives, (1) scales more effectively to higher-dimensional preference spaces, (2) maintains computational tractability for high-dimensional problems, (3) is robust to noisy or inconsistent user feedback, and (4) is preferred by non-expert users in identifying their preferred robot behaviors. This project’s code is available at this http URL
[HC-13] Dishonesty Tendencies in Testing Scenarios Among Students with Virtual Reality and Computer-Mediated Technology
【速读】:该论文试图解决的问题是:在虚拟现实(Virtual Reality, VR)环境中,学生是否比在传统计算机媒介环境中更倾向于进行欺骗行为(deceptive behaviour),即学术不端行为是否因技术环境的不同而改变。解决方案的关键在于通过对照实验设计,让同一组参与者在VR和真实桌面环境(使用笔记本电脑)中分别完成模拟在线考试,并结合问卷调查收集行为数据。结果表明,在两种环境下学生的作弊行为频率无显著差异,说明VR环境并未显著增加或减少学术不端行为,其核心贡献在于验证了VR教学环境中学生诚信行为的稳定性,为未来构建可信的虚拟学习空间提供了实证依据。
链接: https://arxiv.org/abs/2603.08974
作者: Tanja Kojić,Alina Dovhalevska,Maurizio Vergari,Sebastian Möller,Jan-Niklas Voigt-Antons
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Paper presented at the International Conference on Human-Computer Interaction (HCII 2024)
Abstract:Virtual reality (VR) systems have the potential to be an innovation in the field of e-learning. Starting with fully functional e-classes, VR technologies can be used to build entire e-campuses. The power of VR is that it allows for stronger contact with students than computer-mediated technology. Deceptive behaviour, both verbal and nonverbal, refers to intentional activities designed to deceive others. Students often engage in dishonest practices to make progress. Whether it is cheating on an exam, copying another student’s essay, or inflating their GPA, the motivation for cheating is rarely simply a lack of preparation. Even though some may see academic dishonesty as an asset, the reality is that it can have major consequences. This poster demonstrates the findings from a study of students’ deceitful behaviour during a test in VR and in real-life situations. For this user study, 22 volunteers were invited to participate, with each experiment involving exactly two participants and the examiner present in the room. Students were invited to take two tests: one in VR and one on a laptop. Their goal was to score as many points as possible by simulating a real-world online exam. Participants were requested to complete questionnaires during and after each experiment, which assisted in collecting additional data for this study. The results indicate that the amount of cheating that happened in VR and on a laptop was exactly the same.
[HC-14] Influence of Interactivity in Shaping User Experience and Social Acceptance of Mobile XR
【速读】:该论文旨在解决移动增强现实(Mobile Augmented Reality, MAR)应用中交互程度(Degree of Interactivity)对用户体验(User Experience, UX)和社会接受度(Social Acceptability, SA)之间复杂关系的理解不足问题。研究发现,高交互性虽可能提升感知可用性,但也会因用户在公共场合需进行显著身体动作而引发社会排斥或不适,从而削弱其社会可接受性。解决方案的关键在于采用平衡设计方法,在优化UX的同时将SA纳入评估体系,确保AR技术在真实场景中的无缝集成与广泛采纳。
链接: https://arxiv.org/abs/2603.08973
作者: Tanja Kojić,Maurizio Vergari,Maximilian Warsinke,Sebastian Möller,Jan-Niklas Voigt-Antons
机构: Technical University of Berlin (柏林工业大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Hamm-Lippstadt University of Applied Sciences (哈姆-利普施塔特应用科学大学)
类目: Human-Computer Interaction (cs.HC)
备注: Paper presented at the workshop Social Interaction and Collaboration in eXtended Reality (SIC-XR 2025)
Abstract:This study investigates the impact of the Degree of Interactivity on User Experience (UX) and social acceptability (SA) in Mobile Augmented Reality (MAR) applications. As AR technologies become more prevalent, understanding how varying levels of interactivity influence both user perception and social dynamics is crucial for their design and adoption. Two commercially available MAR applications, IKEA and Virtlo, which differ significantly in their interactivity levels, were used to conduct a user study. The study examines how body movements required for interaction with AR content affect both UX and SA, shedding light on users’ comfort levels and potential social barriers in public settings. The findings suggest a complex relationship between interactivity, perceived usability, and social considerations, emphasizing the need for a balanced design approach. This research provides valuable insights into the development of future AR applications by addressing not only usability but also the broader social implications of AR interactions. By integrating social acceptability into traditional UX evaluations, this study highlights its significance in ensuring the seamless integration of AR technologies into everyday environments.
[HC-15] Integrating Virtual and Augmented Reality into Public Education: Opportunities and Challenges in Language Learning
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)和增强现实(Augmented Reality, AR)在公共教育体系中应用于语言学习时所面临的整合难题,包括技术可用性、认知负荷、课程适配性及教师培训不足等问题。其核心解决方案在于通过优化界面设计、降低认知负荷、提升系统适应性以及加强基础设施建设和教师专业发展,从而推动沉浸式技术在语言教学中的有效落地与可持续应用。
链接: https://arxiv.org/abs/2603.08970
作者: Tanja Kojić,Maurizio Vergari,Giulia-Marielena Benta,Joy Krupinski,Maximilian Warsinke,Sebastian Möller,Jan-Niklas Voigt-Antons
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Paper presented at the International Conference on Human-Computer Interaction (HCII 2025)
Abstract:Virtual Reality (VR) and Augmented Reality (AR) are emerging as transformative tools in education, offering new possibilities for engagement and immersion. This paper explores their potential in language learning within public education, focusing on their ability to enhance traditional schooling methods and address existing educational gaps. The integration of VR and AR in schools, however, is not without challenges, including usability, technical barriers, and the alignment of these technologies with existing curricula. Drawing on two empirical studies, this work investigates the opportunities and challenges of VR- and AR-assisted language learning and proposes strategies for their effective implementation in the public sector. The findings show that VR increases motivation and immersion but has an unclear impact on vocabulary retention, with technical limitations and cognitive overload identified as key challenges. AR enhances contextual learning and accessibility but faces usability constraints and limited personalization. To facilitate effective adoption, this paper recommends improving interface design, reducing cognitive load, increasing adaptability, and ensuring adequate infrastructure and teacher training. Overcoming these barriers will enable a more effective integration of immersive technologies in language education.
[HC-16] Design Guidance Towards Addressing Over-Reliance on AI in Sensemaking
【速读】:该论文旨在解决生成式 AI (Generative AI) 系统在协作工作与学习中因设计不当而导致群体过度依赖显式指令、削弱自主意义建构(sensemaking)能力的问题。其解决方案的关键在于引入群体意识工具(Group Awareness Tools, GATs),通过隐式引导机制——即以可视化方式外显协作过程中的可观察数据,揭示组员间的差异,从而引发认知冲突,激发个体主动深化理解与讨论,最终促进自主意义建构的自然涌现。
链接: https://arxiv.org/abs/2603.08903
作者: Yihang Zhao,Wenxin Zhang,Amy Rechkemmer,Albert Meroño Peñuela,Elena Simperl
机构: King’s College London(伦敦国王学院); Technical University of Munich(慕尼黑工业大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Sensemaking in collaborative work and learning is increasingly supported by GenAI systems, however, emerging evidence suggests that poorly designed GenAI systems tend to provide explicit instruction that groups passively follow, fostering over-reliance and eroding autonomous sensemaking. Group awareness tools (GATs) address this challenge through implicit guidance: rather than instructing groups on what to do, GATs externalize observable collaboration data through visualizations that reveal differences between group members to create cognitive conflict, which triggers autonomous elaboration and discussion, thereby implicitly guiding autonomous sensemaking emergence. Drawing on an initial literature search of existing GAT systems, this paper explores the design of GenAI-augmented GATs to support autonomous sensemaking in collaborative work and learning, presenting preliminary design principles for discussion.
[HC-17] Exploring the Design of GenAI-Based Systems to Support Socially Shared Metacognition
【速读】:该论文旨在解决生成式 AI(Generative AI)在协同工作与学习中可能引发的“过度依赖”问题,即不当设计的 GenAI 系统会削弱群体自主调节认知过程的能力,从而阻碍社会共享元认知(Socially Shared Metacognition, SSM)的发展。解决方案的关键在于设计增强型群组意识工具(Group Awareness Tools, GATs),通过可视化社会与认知信息、凸显组员间的差异以激发认知冲突,并间接引导群体自发进行深入讨论与自我调节,从而实现对 SSM 的隐性支持与自主涌现。
链接: https://arxiv.org/abs/2603.08894
作者: Yihang Zhao,Wenxin Zhang,Amy Rechkemmer,Albert Meroño-Peñuela,Elena Simperl
机构: King’s College London(伦敦国王学院); Technical University of Munich(慕尼黑工业大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Socially shared metacognition (SSM) refers to the collective monitoring and regulation of joint cognitive processes in collaborative problem-solving, and is essential for effective knowledge work and learning. Generative AI (GenAI)-based systems offer new opportunities to support SSM, but emerging evidence suggests that poorly designed systems can encourage over-reliance on AI-generated explicit instruction and erode groups’ capacity to develop autonomous regulatory processes. Group awareness tools (GATs) address this challenge through established design principles that make social and cognitive awareness information visible, highlight differences between group members to create cognitive conflict, and trigger autonomous elaboration and discussion, thereby implicitly guiding autonomous SSM emergence. This paper explores the design of GenAI-augmented GATs to support autonomous SSM in collaborative work and learning through an initial literature search, presenting preliminary design principles for discussion.
[HC-18] A Decentralized Frontier AI Architecture Based on Personal Instances Synthetic Data and Collective Context Synchronization
【速读】:该论文旨在解决当前集中式大规模语言模型(Large Language Models, LLMs)所面临的计算资源集中、能耗高、数据获取受限及治理复杂等结构性问题。其解决方案的关键在于提出一种去中心化的前沿模型架构(Decentralized Frontier Model Architecture, DFMA),通过本地运行的AI实例生成基于推理过程和交互的合成学习信号,并在共享的“集体上下文场”(Collective Context Field, CCF)中聚合这些信号,从而在无需直接参数同步的情况下实现跨网络的推理行为条件化。该机制支持隐私保护下的集体学习与抽象知识的分布式共享,同时引入能量自适应模型演化策略,使学习活动与可再生能源供给动态匹配,从而构建一种类生物神经网络的分布式认知系统,为人工智能提供一条基于分布上下文学习和集体经验积累的新规模扩展路径。
链接: https://arxiv.org/abs/2603.08893
作者: Jacek Małecki,Alexander Mathiesen-Ohman,Katarzyna Tworek
机构: Wrocław University of Science and Technology (弗罗茨瓦夫科学与技术大学); AMOTHO Research Institute (AMOTHO 研究所)
类目: Human-Computer Interaction (cs.HC)
备注: 38 pages, 2 figures
Abstract:Recent progress in artificial intelligence has been driven largely by the scaling of centralized large language models through increased parameters, datasets, and computational resources. While effective, this paradigm introduces structural constraints related to compute concentration, energy consumption, data availability, and governance. This paper proposes an alternative architectural approach through the H3LIX Decentralized Frontier Model Architecture (DFMA), a distributed AI framework in which locally operating AI instances generate synthetic learning signals derived from reasoning processes and interactions. These signals are aggregated within a shared contextual substrate termed the Collective Context Field (CCF), which conditions reasoning behavior across the network without requiring direct parameter synchronization. By enabling contextual signal propagation rather than centralized retraining at every iteration, the architecture can be designed to support privacy-preserving collective learning under explicit assumptions, while facilitating distributed sharing of learned abstractions. The system further integrates Energy-Adaptive Model Evolution, aligning learning activities with renewable energy availability to support more sustainable AI infrastructure. Conceptually, the architecture reframes artificial intelligence as a distributed cognitive system analogous to biological neural networks, in which intelligence emerges from the interaction of many locally adaptive agents within a shared contextual environment. Together, these mechanisms suggest a new scaling pathway for artificial intelligence systems based on distributed contextual learning and collective experience accumulation.
[HC-19] ouching Emotions Smelling Shapes: Exploring Tactile Olfactory and Emotional Cross-sensory Correspondences in Preschool Aged Children
【速读】:该论文旨在解决学前儿童(2-4岁)在多感官整合过程中跨感官对应关系(cross-sensory correspondence)的实证认知机制不明确的问题。其解决方案的关键在于通过设计基于游戏的任务,系统考察嗅觉-触觉-情绪之间的映射关系,并识别出支撑这些感知映射的关联策略,从而为早期儿童多感官认知提供实证依据,并提出与儿童感官联结方式一致的设计指南及可复现的探测方法。
链接: https://arxiv.org/abs/2603.08889
作者: Tegan Roberts-Morgan,Min S. Li,Priscilla Lo,Zhuzhi Fan,Dan Bennett,Oussama Metatla
机构: University of Bristol(布里斯托大学); Aalborg University(奥尔堡大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:The use of a wide range of sensory modalities is increasingly central to technologies for learning, communication, and affective regulation. During the preschool years, sensory integration develops rapidly, shaping how children perceive and make sense of their environments. A key component of this process is cross-sensory correspondence: the systematic ways in which perceptions in different sensory modalities influence one another. Despite its relevance, little is known about cross-sensory correspondences in preschool-aged children (2-4 years). We present a study with 26 preschoolers examining smell-touch-emotion correspondences through playful tasks. We found significant correspondences both between sensory modalities and between sensory modalities and affective judgements. Further analysis revealed association strategies underpinning these mappings. We contribute empirical insights into cross-sensory correspondences in early childhood, design guidelines that align with how preschoolers relate sensory input, and a replicable method for probing cross-sensory cognition in this age group.
[HC-20] Unpacking Interpretability: Human-Centered Criteria for Optimal Combinatorial Solutions
【速读】:该论文旨在解决算法支持系统中最优解难以理解的问题,即在机器生成的多个等效最优解中,如何量化并识别哪些结构特征使解决方案更具可解释性(interpretability),从而促进人类与算法的有效协作。其关键解决方案是通过实验范式让参与者在两个等效最优的装箱方案中选择更易理解的方案,并发现三个可量化的结构属性显著影响人类偏好:与贪心启发式(greedy heuristic)的一致性、组内组成简单性(simple within-bin composition)以及有序视觉呈现(ordered visual representation)。其中,有序表示和启发式一致性关联最强,组成简单性也具稳定效应,为实现可解释性感知的优化与展示提供了明确的设计依据。
链接: https://arxiv.org/abs/2603.08856
作者: Dominik Pegler,Frank Jäkel,David Steyrl,Frank Scharnowski,Filip Melinscak
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 66 pages (42 main text, 24 appendix), 18 figures (5 in main text, 13 in appendix)
Abstract:Algorithmic support systems often return optimal solutions that are hard to understand. Effective human-algorithm collaboration, however, requires interpretability. When machine solutions are equally optimal, humans must select one, but a precise account of what makes one solution more interpretable than another remains missing. To identify structural properties of interpretable machine solutions, we present an experimental paradigm in which participants chose which of two equally optimal solutions for packing items into bins was easier to understand. We show that preferences reliably track three quantifiable properties of solution structure: alignment with a greedy heuristic, simple within-bin composition, and ordered visual representation. The strongest associations were observed for ordered representations and heuristic alignment, with compositional simplicity also showing a consistent association. Reaction-time evidence was mixed, with faster responses observed primarily when heuristic differences were larger, and aggregate webcam-based gaze did not show reliable effects of complexity. These results provide a concrete, feature-based account of interpretability in optimal packing solutions, linking solution structure to human preference. By identifying actionable properties (simple compositions, ordered representation, and heuristic alignment), our findings enable interpretability-aware optimization and presentation of machine solutions, and outline a path to quantify trade-offs between optimality and interpretability in real-world allocation and design tasks.
[HC-21] Investigating the Effects of LLM Use on Critical Thinking Under Time Constraints: Access Timing and Time Availability
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)对批判性思维能力的实际影响是否具有方向性一致性,尤其是在不同时间约束条件下,LLM的介入时机如何调节人类在认知任务中的表现。解决方案的关键在于通过一个被试间实验设计(n=393),系统考察两种时间变量的作用:一是LLM访问时机(早期、连续、晚期或无LLM),二是任务时间可用性(时间不足或充足)。研究发现存在“时间反转效应”——即LLM在任务初期提供时,在时间压力下提升表现,但在充足时间内反而损害表现;反之,延迟使用LLM或完全不使用则在充足时间内表现更优。这一结果表明,时间约束是决定LLM是增强还是削弱批判性思维的核心机制,为设计人机协同的认知支持策略提供了关键依据。
链接: https://arxiv.org/abs/2603.08849
作者: Jiayin Zhi,Harsh Kumar,Mina Lee
机构: University of Chicago(芝加哥大学); University of Toronto(多伦多大学)
类目: Human-Computer Interaction (cs.HC)
备注: CHI 2026
Abstract:The impact of large language models (LLMs) on critical thinking has provoked growing attention, yet this impact on actual performance may not be uniformly negative or positive. Particularly, the role of time – the temporal context under which an LLM is provided – remains overlooked. In a between-subjects experiment (n=393), we examined two types of time constraints for a critical thinking task requiring participants to make a reasoned decision for a real-world scenario based on diverse documents: (1) LLM access timing – an LLM available only at the beginning (early), throughout (continuous), near the end (late), or not at all (no LLM), and (2) time availability – insufficient or sufficient time for the task. We found a temporal reversal: LLM access from the start (early, continuous) improved performance under time pressure but impaired it with sufficient time, whereas beginning the task independently (late, no LLM) showed the opposite pattern. These findings demonstrate that time constraints fundamentally shape whether an LLM augments or undermines critical thinking, making time a central consideration when designing LLM support and evaluating human-AI collaboration in cognitive tasks.
[HC-22] he Data-Dollars Tradeoff: Privacy Harms vs. Economic Risk in Personalized AI Adoption
【速读】:该论文旨在解决隐私担忧如何影响用户对生成式 AI (Generative AI) 个性化服务的采纳问题,特别是信息环境中的风险与模糊性如何调节用户的决策行为。其关键发现是:当数据泄露概率明确(风险情境)时,隐私威胁并不会显著降低AI个性化采纳率;而在泄露概率范围不确定(模糊情境)时,隐私威胁会显著抑制用户采纳行为,且该效应在敏感人口统计信息和匿名偏好数据中均成立。研究进一步表明,用户对隐私披露标签存在系统性高估,说明存在对透明度制度的强烈需求,但隐私威胁并未影响后续与算法的议价行为。因此,解决方案的关键在于识别“模糊性”而非单纯隐私偏好,是驱动用户规避个性化AI的核心机制。
链接: https://arxiv.org/abs/2603.08848
作者: Alexander Erlei,Tahir Abbas,Kilian Bizer,Ujwal Gadiraju
机构: University of Göttingen (哥廷根大学); Wageningen University and Research (瓦赫宁恩大学与研究机构); Delft University of Technology (代尔夫特理工大学)
类目: Human-Computer Interaction (cs.HC); General Economics (econ.GN)
备注:
Abstract:Privacy concerns significantly impact AI adoption, yet little is known about how information environments shape user responses to data leak threats. We conducted a 2 x 3 between-subjects experiment (N=610) examining how risk versus ambiguity about privacy leaks affects the adoption of AI personalization. Participants chose between standard and AI-personalized product baskets, with personalization requiring data sharing that could leak to pricing algorithms. Under risk (30% leak probability), we found no difference in AI adoption between privacy-threatening and neutral conditions (ca. 50% adoption). Under ambiguity (10-50% range), privacy threats significantly reduced adoption compared to neutral conditions. This effect holds for sensitive demographic data as well as anonymized preference data. Users systematically over-bid for privacy disclosure labels, suggesting strong demand for transparency institutions. Notably, privacy leak threats did not affect subsequent bargaining behavior with algorithms. Our findings indicate that ambiguity over data leaks, rather than only privacy preferences per se, drives avoidance behavior among users towards personalized AI.
[HC-23] NaviNote: Enabling In-situ Spatial Annotation Authoring to Support Exploration and Navigation for Blind and Low Vision People
【速读】:该论文旨在解决盲人及低视力(Blind and Low Vision, BLV)用户在使用现有基于全球定位系统(GPS)的注释系统时面临的精度不足问题,以及这些系统尚未经过BLV用户实际评估的缺陷。当前GPS技术存在数米级别的偏差,限制了BLV用户对环境的准确感知与导航能力。为应对这一挑战,研究者提出了一种名为NaviNote的新系统,其关键创新在于融合视觉定位技术实现高精度(亚米级)空间定位,并采用代理式(agentic)架构支持语音驱动的注释创建与导航功能。该方案不仅提升了BLV用户的导航性能,还增强了他们对周围环境的理解与参与度,从而推动了更精准、包容性的位置注释系统的开发。
链接: https://arxiv.org/abs/2603.08837
作者: Ruijia Chen,Yuheng Wu,Charlie Houseago,Filipe Gaspar,Filippo Aleotti,Dorian Gálvez-López,Oliver Johnston,Diego Mazala,Guillermo Garcia-Hernando,Maryam Bandukda,Gabriel Brostow,Jessica Van Brummelen
机构: University of Wisconsin-Madison(威斯康星大学麦迪逊分校); Niantic Spatial, Inc.(Niantic空间公司); University College London(伦敦大学学院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:GPS and smartphones enable users to place location-based annotations, capturing rich environmental context. Previous research demonstrates that blind and low vision (BLV) people can use annotations to explore unfamiliar areas. However, current commercial systems allowing BLV users to create annotations have never been evaluated, and current GPS-based systems can deviate several meters. Motivated by high-accuracy visual positioning technology, we first conducted a formative study with 24 BLV participants to envision a more accurate and inclusive annotation system. Surprisingly, many participants viewed the high-accuracy technology not just as an annotation system but also as a tool for precise last-few-meters navigation. Guided by participant feedback, we developed NaviNote, which combines vision-based high-precision localization with an agentic architecture to enable voice-based annotation authoring and navigation. Evaluating NaviNote with 18 BLV participants showed that it significantly improved navigation performance and supported users in understanding and annotating their surroundings. Based on these findings, we discuss design considerations for future accessible annotation authoring systems.
[HC-24] Clarifying the Compass: A Reflexive Narrative on Entry Barriers into HCI and Aging Research
【速读】:该论文试图解决人机交互(Human-Computer Interaction, HCI)与老龄化研究之间存在的跨学科协作断层问题,具体表现为老年群体的实际需求与新兴技术设计之间的脱节。其解决方案的关键在于通过亲身参与和共情实践——即作者在养老社区的志愿服务经历——深化对老年人群体的理解,并以此为基础推动更具包容性和适老化的设计思维,从而弥合技术开发与用户真实需求之间的鸿沟。
链接: https://arxiv.org/abs/2603.08818
作者: Tianyi Li,Jin Wei-Kocsis
机构: Purdue University (普渡大学)
类目: Human-Computer Interaction (cs.HC)
备注: Paper accepted at the CHI digiage workshop: this https URL
Abstract:This manuscript presents the perspectives and reflections of two researchers who were not previously engaged in aging research, regarding the gaps and barriers related to interdisciplinary collaboration on HCI and Aging research. The manuscript has two sections. In the first section, the authors discuss their observations on the disconnect between the needs of aging populations and the design of emerging technologies. The second section delves into their personal journey of developing empathy and a deeper understanding of older adults by volunteering in a senior living community, and shares their reflective thoughts on these experiences.
计算机视觉
[CV-0] From Data Statistics to Feature Geometry: How Correlations Shape Superposition
【速读】:该论文旨在解决当前机制可解释性研究中对神经网络超完备表示(superposition)的理解局限问题,即现有理论多基于理想化假设(如特征稀疏且不相关),而忽略了真实数据中特征间存在强相关性的现实情况。其解决方案的关键在于提出一种受控的“词袋超位置编码”(Bag-of-Words Superposition, BOWS)框架,通过将互联网文本的二进制词袋表示以共激活模式为依据进行排列,使干扰从单纯的噪声转变为可被利用的建设性信号;这种结构化排列结合ReLU非线性激活函数,在保留局部几何特性的同时,自然生成语义聚类和循环结构,从而更准确地解释真实语言模型中的观测现象。
链接: https://arxiv.org/abs/2603.09972
作者: Lucas Prieto,Edward Stevinson,Melih Barsbey,Tolga Birdal,Pedro A.M. Mediano
机构: Imperial College London (帝国理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A central idea in mechanistic interpretability is that neural networks represent more features than they have dimensions, arranging them in superposition to form an over-complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as introducing interference that must be minimized geometrically and filtered out by non-linearities such as ReLUs, yielding local structures like regular polytopes. We show that this account is incomplete for realistic data by introducing Bag-of-Words Superposition (BOWS), a controlled setting to encode binary bag-of-words representations of internet text in superposition. Using BOWS, we find that when features are correlated, interference can be constructive rather than just noise to be filtered out. This is achieved by arranging features according to their co-activation patterns, making interference between active features constructive, while still using ReLUs to avoid false positives. We show that this kind of arrangement is more prevalent in models trained with weight decay and naturally gives rise to semantic clusters and cyclical structures which have been observed in real language models yet were not explained by the standard picture of superposition. Code for this paper can be found at this https URL.
[CV-1] ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare
【速读】:该论文旨在解决在线新视角合成(novel view synthesis)中的挑战,即如何从顺序且通常未标定的观测中实现鲁棒的场景重建。核心问题在于:尽管基于相机位姿(camera poses)组装局部高斯分布(Gaussian Splatting)比在规范空间(canonical space)预测更高效,但训练时若使用真值位姿可保证稳定,却会导致推理时使用预测位姿时的分布偏移(distribution mismatch)。解决方案的关键是提出一个“渲染与比较”(Render-and-Compare, ReCo)模块:该模块通过从预测视角渲染当前重建结果并与输入观测进行对比,提供一个稳定的条件信号以补偿位姿误差,从而缓解分布不匹配问题。此外,为支持长序列处理,还设计了一种混合KV缓存压缩策略,结合早期层截断与分块级选择性保留,使KV缓存大小减少超过90%,显著提升效率。
链接: https://arxiv.org/abs/2603.09968
作者: Freeman Cheng,Botao Ye,Xueting Li,Junqi You,Fangneng Zhan,Ming-Hsuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Online novel view synthesis remains challenging, requiring robust scene reconstruction from sequential, often unposed, observations. We present ReCoSplat, an autoregressive feed-forward Gaussian Splatting model supporting posed or unposed inputs, with or without camera intrinsics. While assembling local Gaussians using camera poses scales better than canonical-space prediction, it creates a dilemma during training: using ground-truth poses ensures stability but causes a distribution mismatch when predicted poses are used at inference. To address this, we introduce a Render-and-Compare (ReCo) module. ReCo renders the current reconstruction from the predicted viewpoint and compares it with the incoming observation, providing a stable conditioning signal that compensates for pose errors. To support long sequences, we propose a hybrid KV cache compression strategy combining early-layer truncation with chunk-level selective retention, reducing the KV cache size by over 90% for 100+ frames. ReCoSplat achieves state-of-the-art performance across different input settings on both in- and out-of-distribution benchmarks. Code and pretrained models will be released. Our project page is at this https URL .
[CV-2] BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion
【速读】:该论文旨在解决语言引导的局部导航任务中,现有视觉-语言空间定位方法因依赖图像空间推理而导致在遮挡区域(如家具或移动人类造成的遮挡)无法准确推断可通行目标位置的问题。其解决方案的关键在于提出BEACON框架,该框架通过在鸟瞰图(Bird’s-Eye View, BEV)空间中预测一个以机器人为中心的可达性热力图(affordance heatmap),从而覆盖包含遮挡区域的局部环境;具体实现上,BEACON将空间线索注入视觉-语言模型(Vision-Language Model, VLM),并融合VLM输出与基于深度信息生成的BEV特征,以增强对非可见区域的目标定位能力。
链接: https://arxiv.org/abs/2603.09961
作者: Xinyu Gao,Gang Chen,Javier Alonso-Mora
机构: Delft University of Technology (代尔夫特理工大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages. Project page: this https URL
Abstract:Language-conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open-vocabulary, relational instruction. Existing vision-language spatial grounding methods usually rely on vision-language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans. To address this issue, we propose BEACON, which predicts an ego-centric Bird’s-Eye View (BEV) affordance heatmap over a bounded local region including occluded areas. Given an instruction and surround-view RGB-D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM’s output with depth-derived BEV features. Using an occlusion-aware dataset built in the Habitat simulator, we conduct detailed experimental analysis to validate both our BEV space formulation and the design choices of each module. Our method improves the accuracy averaged across geodesic thresholds by 22.74 percentage points over the state-of-the-art image-space baseline on the validation subset with occluded target locations. Our project page is: this https URL.
[CV-3] From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding
【速读】:该论文旨在解决自监督视觉预训练方法中对比学习(Contrastive Learning, CL)与掩码图像建模(Masked Image Modeling, MIM)之间的固有矛盾:CL能够捕捉全局语义信息但丢失细粒度细节,而MIM虽能保留局部纹理特征却因语义无关的随机掩码导致“注意力漂移”问题。其解决方案的关键在于提出一种粗到细的掩码自动编码器(Coarse-to-Fine Masked Autoencoder, C2FMAE),通过显式学习跨三个数据粒度(场景级、实例级和像素级)的分层视觉表征来缓解该矛盾。具体而言,两个协同创新机制共同作用:一是级联解码器逐级从场景语义重建至对象实例再至像素细节,建立明确的跨粒度依赖关系;二是渐进式掩码课程动态调整训练焦点,由语义引导逐步过渡到实例引导直至随机掩码,形成从全局上下文到局部特征的结构化学习路径。
链接: https://arxiv.org/abs/2603.09955
作者: Wenzhao Xiang,Yue Wu,Hongyang Yu,Feng Gao,Fan Yang,Xilin Chen
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Pengcheng Laboratory (鹏城实验室); School of Arts, Peking University (北京大学艺术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from “attention drift” due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.
[CV-4] Leverag ing whole slide difficulty in Multiple Instance Learning to improve prostate cancer grading
【速读】:该论文旨在解决组织病理学中全切片图像(Whole Slide Images, WSI)分类任务中因专家与非专家病理学家标注不一致而导致的模型性能下降问题。其核心挑战在于,尽管滑片的金标准标签由专家确定,但非专家在诊断时易出现分歧,这种分歧可反映滑片本身的判读难度。解决方案的关键是引入“全切片难度”(Whole Slide Difficulty, WSD)这一新概念,基于专家与非专家之间的标注差异进行量化,并通过两种策略——多任务学习框架和加权分类损失函数——将WSD信息融入训练过程,从而提升模型对高 Gleason 评分(即更差预后)滑片的分类准确性,尤其在不同特征编码器和多实例学习(MIL)方法下均表现出一致性改进。
链接: https://arxiv.org/abs/2603.09953
作者: Marie Arrivat,Rémy Peyret,Elsa Angelini,Pietro Gori
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ISBI 2026
Abstract:Multiple Instance Learning (MIL) has been widely applied in histopathology to classify Whole Slide Images (WSIs) with slide-level diagnoses. While the ground truth is established by expert pathologists, the slides can be difficult to diagnose for non-experts and lead to disagreements between the annotators. In this paper, we introduce the notion of Whole Slide Difficulty (WSD), based on the disagreement between an expert and a non-expert pathologist. We propose two different methods to leverage WSD, a multi-task approach and a weighted classification loss approach, and we apply them to Gleason grading of prostate cancer slides. Results show that integrating WSD during training consistently improves the classification performance across different feature encoders and MIL methods, particularly for higher Gleason grades (i.e. worse diagnosis).
[CV-5] No Image No Problem: End-to-End Multi-Task Cardiac Analysis from Undersampled k-Space
【速读】:该论文旨在解决传统临床心脏磁共振成像(Cardiac MRI, CMR)流程中“重建-分析”(reconstruct-then-analyze)范式的固有缺陷,该范式在欠采样k空间数据到高维图像的逆问题求解过程中引入了不必要的伪影和信息瓶颈,本质上是一个病态问题。其核心矛盾在于:诊断所需的低维生理标签(physiological labels)并未直接从原始k空间数据中提取,而是通过中间图像重建步骤间接获取。解决方案的关键在于提出k-MTR(k-space Multi-Task Representation)框架,该框架通过大规模可控模拟(42,000名受试者)训练一个k空间编码器,使其将欠采样k空间数据与全采样图像映射到共享语义流形(shared semantic manifold),从而在潜在空间中直接恢复因欠采样丢失的解剖信息,绕过显式的图像重建过程。这一设计使高阶生理语义可直接从欠采样k空间表示中密集嵌入,显著提升了连续表型回归、疾病分类与解剖分割等多任务性能,为任务感知的心脏MRI工作流提供了可泛化的架构基础。
链接: https://arxiv.org/abs/2603.09945
作者: Yundi Zhang,Sevgi Gokce Kafali,Niklas Bubeck,Daniel Rueckert,Jiazhen Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional clinical CMR pipelines rely on a sequential “reconstruct-then-analyze” paradigm, forcing an ill-posed intermediate step that introduces avoidable artifacts and information bottlenecks. This creates a fundamental mathematical paradox: it attempts to recover high-dimensional pixel arrays (i.e., images) from undersampled k-space, rather than directly extracting the low-dimensional physiological labels actually required for diagnosis. To unlock the direct diagnostic potential of k-space, we propose k-MTR (k-space Multi-Task Representation), a k-space representation learning framework that aligns undersampled k-space data and fully-sampled images into a shared semantic manifold. Leveraging a large-scale controlled simulation of 42,000 subjects, k-MTR forces the k-space encoder to restore anatomical information lost to undersampling directly within the latent space, bypassing the explicit inverse problem for downstream analysis. We demonstrate that this latent alignment enables the dense latent space embedded with high-level physiological semantics directly from undersampled frequencies. Across continuous phenotype regression, disease classification, and anatomical segmentation, k-MTR achieves highly competitive performance against state-of-the-art image-domain baselines. By showcasing that precise spatial geometries and multi-task features can be successfully recovered directly from the k-space representations, k-MTR provides a robust architectural blueprint for task-aware cardiac MRI workflows.
[CV-6] Unsupervised Domain Adaptation with Target-Only Margin Disparity Discrepancy
【速读】:该论文旨在解决介入放射学中锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)图像缺乏标注数据的问题,从而限制了肝脏分割等任务的性能提升。由于CBCT与传统CT在成像范围、伪影特征及造影剂注射方式上存在差异,且现有公开数据集多集中于放疗领域,导致直接迁移基于CT训练的模型到CBCT表现不佳。解决方案的关键在于提出一种基于边缘差异差异性(Margin Disparity Discrepancy, MDD)形式化的无监督域适应(Unsupervised Domain Adaptation, UDA)框架,通过重构原始MDD优化目标,有效缩小CBCT与CT之间的域差距,显著提升在未标注CBCT数据上的肝脏分割性能,在无监督域适应和少样本设置下均达到当前最优水平。
链接: https://arxiv.org/abs/2603.09932
作者: Gauthier Miralles,Loïc Le Folgoc,Vincent Jugnon,Pietro Gori
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ISBI 2026
Abstract:In interventional radiology, Cone-Beam Computed Tomography (CBCT) is a helpful imaging modality that provides guidance to practicians during minimally invasive procedures. CBCT differs from traditional Computed Tomography (CT) due to its limited reconstructed field of view, specific artefacts, and the intra-arterial administration of contrast medium. While CT benefits from abundant publicly available annotated datasets, interventional CBCT data remain scarce and largely unannotated, with existing datasets focused primarily on radiotherapy applications. To address this limitation, we leverage a proprietary collection of unannotated interventional CBCT scans in conjunction with annotated CT data, employing domain adaptation techniques to bridge the modality gap and enhance liver segmentation performance on CBCT. We propose a novel unsupervised domain adaptation (UDA) framework based on the formalism of Margin Disparity Discrepancy (MDD), which improves target domain performance through a reformulation of the original MDD optimization framework. Experimental results on CT and CBCT datasets for liver segmentation demonstrate that our method achieves state-of-the-art performance in UDA, as well as in the few-shot setting.
[CV-7] Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)诊断中多模态神经影像数据常因缺失某一模态而导致建模困难的问题。现有临床数据集普遍存在模态缺失现象,影响了多模态融合模型的性能与泛化能力。解决方案的关键在于提出ACADiff框架,其核心创新是通过自适应临床感知扩散机制(adaptive clinical-aware diffusion),在生成缺失模态时动态融合可用的影像数据与临床元数据(clinical metadata),并利用GPT-4o编码的语义提示提供临床指导。该框架采用可变融合策略以适应不同输入配置,并通过三个专用生成器实现结构磁共振成像(sMRI)、氟代脱氧葡萄糖正电子发射断层扫描(FDG-PET)和AV45-PET之间的双向合成,在极端80%模态缺失场景下仍保持优异生成质量和诊断一致性,显著优于现有基线方法。
链接: https://arxiv.org/abs/2603.09931
作者: Rong Zhou,Houliang Zhou,Yao Su,Brian Y. Chen,Yu Zhang,Lifang He,Alzheimer’s Disease Neuroimaging Initiative
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal neuroimaging provides complementary insights for Alzheimer’s disease diagnosis, yet clinical datasets frequently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modalities through adaptive clinical-aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent representations while attending to available imaging data and clinical metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, coupled with semantic clinical guidance via GPT-4o-encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET. Evaluated on ADNI subjects, ACADiff achieves superior generation quality and maintains robust diagnostic performance even under extreme 80% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at this https URL
[CV-8] On the Structural Failure of Chamfer Distance in 3D Shape Optimization
【速读】:该论文旨在解决点云重建、补全与生成任务中,直接优化标准训练损失——切比雪夫距离(Chamfer distance)时可能出现的悖论性失败问题:即优化后反而得到比未优化更差的Chamfer值。研究发现,这种失败源于梯度结构特性:每点的Chamfer梯度会导致“多对一坍缩”(many-to-one collapse),这是前向传播项的唯一吸引子,且无法通过任何局部正则化手段(如排斥力、平滑性或密度感知加权)消除。解决方案的关键在于引入非局部耦合机制——必须将耦合关系扩展至局部邻域之外,才能抑制坍缩。作者在受控二维场景中通过共享基变形实现全局耦合,在三维形状变形中采用可微分的粒子模拟物理(MPM)先验来体现相同原理,显著降低Chamfer差距,尤其在拓扑复杂的龙模型上提升达2.5倍。因此,是否具备非局部耦合成为决定Chamfer优化成败的核心设计准则。
链接: https://arxiv.org/abs/2603.09925
作者: Chang-Yong Song,David Hyde
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 27 pages, including supplementary material
Abstract:Chamfer distance is the standard training loss for point cloud reconstruction, completion, and generation, yet directly optimizing it can produce worse Chamfer values than not optimizing it at all. We show that this paradoxical failure is gradient-structural. The per-point Chamfer gradient creates a many-to-one collapse that is the unique attractor of the forward term and cannot be resolved by any local regularizer, including repulsion, smoothness, and density-aware re-weighting. We derive a necessary condition for collapse suppression: coupling must propagate beyond local neighborhoods. In a controlled 2D setting, shared-basis deformation suppresses collapse by providing global coupling; in 3D shape morphing, a differentiable MPM prior instantiates the same principle, consistently reducing the Chamfer gap across 20 directed pairs with a 2.5 \times improvement on the topologically complex dragon. The presence or absence of non-local coupling determines whether Chamfer optimization succeeds or collapses. This provides a practical design criterion for any pipeline that optimizes point-level distance metrics.
[CV-9] WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition CVPR26
【速读】:该论文旨在解决开放域视觉实体识别(Open-domain Visual Entity Recognition, VER)中生成式方法计算成本高、难以规模化部署的问题。其解决方案的关键在于重新审视对比学习范式,提出WikiCLIP框架:通过利用大语言模型(Large Language Model, LLM)嵌入作为知识丰富的实体表示,并引入视觉引导的知识适配器(Vision-Guided Knowledge Adaptor, VGKA),在图像块级别对齐文本语义与视觉线索;同时设计难负样本合成机制,在训练过程中生成视觉相似但语义不同的负样本,以增强细粒度判别能力。该方法在多个公开基准(如OVEN)上显著优于现有强基线,尤其在未见类别上提升达16%,且推理延迟较领先生成模型AutoVER降低近100倍。
链接: https://arxiv.org/abs/2603.09921
作者: Shan Ning,Longtian Qiu,Jiaxuan Sun,Xuming He
机构: ShanghaiTech University (上海科技大学); Shanghai Engineering Research Center of Intelligent Vision and Imaging (上海智能视觉与成像工程研究中心); Lingang Laboratory (临港实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR26, codes and weights are publicly available
Abstract:Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at this https URL
[CV-10] Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports
【速读】:该论文旨在解决当前视觉语言模型(VLMs)在复杂动态场景中空间智能(spatial intelligence)能力不足的问题,尤其聚焦于体育场景下高密度人体运动与物体交互的空间理解挑战。其核心解决方案是构建首个面向体育场景的大规模空间智能数据集 CourtSI,包含超过 100 万条问答对,覆盖空间计数、距离测量、定位和关系推理等维度,并基于标准化球场几何结构设计半自动数据生成引擎以实现高效可扩展的数据构建。此外,论文提出 CourtSI-Bench 高质量评测基准,验证了现有 VLMs 在该任务上仍存在显著的人机性能差距及跨场景泛化能力弱的问题;通过在 CourtSI 上微调 Qwen3-VL-8B 模型,准确率提升 23.5 个百分点,并展现出良好的迁移能力和空间感知的评论生成能力,表明 CourtSI 为提升 VLMs 的空间智能提供了有效且可扩展的路径。
链接: https://arxiv.org/abs/2603.09896
作者: Yuchen Yang,Yuqing Shao,Duxiu Huang,Linfeng Dong,Yifei Liu,Suixin Tang,Xiang Zhou,Yuanyuan Gao,Wei Wang,Yue Zhou,Xue Yang,Yanfeng Wang,Xiao Sun,Zhihang Zhong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.
[CV-11] DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary
【速读】:该论文旨在解决当前以人类为中心的视频生成方法在生成可控且物理一致的人体-物体交互(Human-Object Interaction, HOI)视频时面临的挑战,尤其是现有方法依赖密集控制信号、模板视频或精心设计的文本提示,导致灵活性不足且难以泛化到新物体。其解决方案的关键在于提出一种名为DISPLAY的框架,该框架仅使用稀疏运动引导(Sparse Motion Guidance),即手腕关节坐标和形状无关的物体边界框,从而减轻人体与物体表征之间的不平衡并实现直观的用户控制;同时引入对象强化注意力机制(Object-Stressed Attention)提升在稀疏引导下的物体鲁棒性,并通过多任务辅助训练策略结合专门的数据清洗流程,有效利用高质量HOI样本与辅助任务,显著增强模型在多样任务中的保真度与可控性。
链接: https://arxiv.org/abs/2603.09883
作者: Jiazhi Guan,Quanwei Yang,Luying Huang,Junhao Liang,Borong Liang,Haocheng Feng,Wei He,Kaisiyuan Wang,Hang Zhou,Jingdong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \hrefthis https URL.
[CV-12] InternVL-U: Democratizing Unified Multimodal Models for Understanding Reasoning Generation and Editing
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在保持强语义理解能力与获取强大生成能力之间存在的固有权衡问题。解决方案的关键在于提出InternVL-U,一个参数量仅为4B的轻量级UMM架构,其核心创新包括:基于统一上下文建模和模态特异性模块化设计(modality-specific modular design),采用解耦的视觉表示方式;集成先进的多模态大语言模型(Multimodal Large Language Model, MLLM)与基于MMDiT的专用视觉生成头;并通过面向高语义密度任务(如文本渲染和科学推理)的数据合成流水线,在以思维链(Chain-of-Thought, CoT)为中心的范式下,实现抽象用户意图与细粒度视觉生成细节的有效对齐。实验表明,InternVL-U在生成与编辑任务上显著优于参数规模超过3倍的基线模型(如BAGEL 14B),同时保留了强大的多模态理解和推理能力。
链接: https://arxiv.org/abs/2603.09877
作者: Changyao Tian,Danni Yang,Guanzhou Chen,Erfei Cui,Zhaokai Wang,Yuchen Duan,Penghao Yin,Sitao Chen,Ganlin Yang,Mingxin Liu,Zirun Zhu,Ziqian Fan,Leyao Gu,Haomin Wang,Qi Wei,Jinhui Yin,Xue Yang,Zhihang Zhong,Qi Qin,Yi Xin,Bin Fu,Yihao Liu,Jiaye Ge,Qipeng Guo,Gen Luo,Hongsheng Li,Yu Qiao,Kai Chen,Hongjie Zhang
机构: Shanghai AI Laboratory(上海人工智能实验室); Fudan University(复旦大学); University of Science and Technology of China(中国科学技术大学); Shanghai Jiao Tong University(上海交通大学); South China University of Technology(华南理工大学); Xiamen University(厦门大学); CUHK MMLab(香港中文大学多媒体实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report, 61 pages, this https URL
Abstract:Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.
[CV-13] MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities
【速读】:该论文旨在解决多模态情感计算(Multimodal Affective Computing)中因模态缺失不均衡导致的模型评估偏差问题。现有标准评测方法通常假设文本、语音和视觉模态均等可用,但在实际应用中,某些模态可能因成本高或易受损而出现系统性缺失,从而引发训练偏置,且仅依赖任务级指标无法揭示此类问题。解决方案的关键在于提出MissBench——一个统一的基准框架与诊断工具集,包含共享缺失率与不平衡缺失率两种协议,并引入两个诊断指标:模态公平指数(Modality Equity Index, MEI)用于衡量不同模态在多种缺失配置下的贡献公平性,以及模态学习指数(Modality Learning Index, MLI)通过比较训练过程中各模态相关模块的梯度范数来量化优化不平衡性。实验表明,即使模型在均匀缺失条件下表现良好,也可能在不平衡缺失场景下存在显著的模态不公平和优化失衡,MissBench及其指标为真实场景下多模态情感模型的鲁棒性分析提供了有效手段。
链接: https://arxiv.org/abs/2603.09874
作者: Tien Anh Pham,Phuong-Anh Nguyen,Duc-Trong Le,Cam-Van Thi Nguyen
机构: VNU University of Engineering and Technology (河内大学工程与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal affective computing underpins key tasks such as sentiment analysis and emotion recognition. Standard evaluations, however, often assume that textual, acoustic, and visual modalities are equally available. In real applications, some modalities are systematically more fragile or expensive, creating imbalanced missing rates and training biases that task-level metrics alone do not reveal. We introduce MissBench, a benchmark and framework for multimodal affective tasks that standardizes both shared and imbalanced missing-rate protocols on four widely used sentiment and emotion datasets. MissBench also defines two diagnostic metrics. The Modality Equity Index (MEI) measures how fairly different modalities contribute across missing-modality configurations. The Modality Learning Index (MLI) quantifies optimization imbalance by comparing modality-specific gradient norms during training, aggregated across modality-related modules. Experiments on representative method families show that models that appear robust under shared missing rates can still exhibit marked modality inequity and optimization imbalance under imbalanced conditions. These findings position MissBench, together with MEI and MLI, as practical tools for stress-testing and analyzing multimodal affective models in realistic incomplete-modality this http URL reproducibility, we release our code at: this https URL
[CV-14] MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents
【速读】:该论文旨在解决多智能体环境中,如何有效理解并整合来自多个具身AI代理(embodied agents)的长时程第一人称视角视频(long-horizon egocentric videos)以支持人类与多代理系统之间的高效协作问题。其核心挑战在于:如何压缩和传输高维感官输入(如视频流),以及如何聚合多个第一人称视角视频构建系统级记忆。解决方案的关键是提出一个名为MultiAgent-EgoQA(MA-EgoQA)的新基准,用于系统性评估模型在多代理场景下的理解能力,并设计了一个简单但有效的基线模型EgoMAS,该模型通过共享记忆机制(shared memory)和代理特异性的动态检索(agent-wise dynamic retrieval)来实现对多源视频流的并行解析与上下文关联。实验表明,当前方法难以有效处理多第一人称视频流,凸显了未来在跨代理系统级理解方面的研究必要性。
链接: https://arxiv.org/abs/2603.09827
作者: Kangsan Kim,Yanlai Yang,Suji Kim,Woongyeong Yeo,Youngwan Lee,Mengye Ren,Sung Ju Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at this https URL.
[CV-15] VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models CVPR2026
【速读】:该论文旨在解决文本到点云(Text-to-point-cloud, T2P)定位任务中现有方法依赖浅层文本-点云对应关系、缺乏有效空间推理能力,导致在复杂环境中定位精度受限的问题。其解决方案的关键在于提出VLM-Loc框架,该框架利用大视觉语言模型(Large Vision-Language Models, VLMs)的空间推理能力,通过将点云转换为鸟瞰图(Bird’s-eye-view, BEV)图像与场景图(scene graph)联合编码几何与语义上下文,构建结构化输入以学习跨模态表示;并在此基础上引入部分节点分配机制(partial node assignment),显式地将文本线索与场景图节点关联,从而实现可解释的空间推理,提升定位准确性。
链接: https://arxiv.org/abs/2603.09826
作者: Shuhao Kang,Youqi Liao,Peijie Wang,Wenlong Liao,Qilin Zhang,Benjamin Busam,Xieyuanli Chen,Yun Liu
机构: Nankai University (南开大学); Wuhan University (武汉大学); CASIA (中国科学院自动化研究所); COWAROBOT; TUM; MCML; NUDT (国防科技大学); AAIS, Nankai University (南开大学人工智能学院); NKIARI, Shenzhen Futian (深圳市福田区人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird’s-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \hrefthis https URLrepository.
[CV-16] BrainSTR: Spatio-Temporal Contrastive Learning for Interpretable Dynamic Brain Network Modeling
【速读】:该论文旨在解决动态功能连接(Dynamic Functional Connectivity, DFC)在神经精神疾病诊断中可解释性不足的问题,尤其是诊断信号往往微弱且在时间和拓扑结构上分布稀疏,同时受噪声波动和非诊断性连接的干扰。其解决方案的关键在于提出一种时空对比学习框架 BrainSTR,通过数据驱动的自适应相位划分模块识别状态一致的相位边界,结合注意力机制定位诊断关键相位,并利用基于二值化、时间平滑性和稀疏性的增量图结构生成器提取各相位内的疾病相关连接;进一步引入时空监督对比学习方法,以优化样本间的相似性度量并捕获更具判别力的时空特征,从而构建语义结构清晰、可解释性强的表示空间。
链接: https://arxiv.org/abs/2603.09825
作者: Guiliang Guo,Guangqi Wen,Lingwen Liu,Ruoxian Song,Peng Cao,Jinzhu Yang,Fei Wang,Xiaoli Liu,Osmar R. Zaiane
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dynamic functional connectivity captures time-varying brain states for better neuropsychiatric diagnosis and spatio-temporal interpretability, i.e., identifying when discriminative disease signatures emerge and where they reside in the connectivity topology. Reliable interpretability faces major challenges: diagnostic signals are often subtle and sparsely distributed across both time and topology, while nuisance fluctuations and non-diagnostic connectivities are pervasive. To address these issues, we propose BrainSTR, a spatio-temporal contrastive learning framework for interpretable dynamic brain network modeling. BrainSTR learns state-consistent phase boundaries via a data-driven Adaptive Phase Partition module, identifies diagnostically critical phases with attention, and extracts disease-related connectivity within each phase using an Incremental Graph Structure Generator regularized by binarization, temporal smoothness, and sparsity. Then, we introduce a spatio-temporal supervised contrastive learning approach that leverages diagnosis-relevant spatio-temporal patterns to refine the similarity metric between samples and capture more discriminative spatio-temporal features, thereby constructing a well-structured semantic space for coherent and interpretable representations. Experiments on ASD, BD, and MDD validate the effectiveness of BrainSTR, and the discovered critical phases and subnetworks provide interpretable evidence consistent with prior neuroimaging findings. Our code: this https URL.
[CV-17] ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation
【速读】:该论文旨在解决从仅有两张输入图像在大视角变化下进行新视角合成(novel view synthesis)的难题。现有基于回归的方法难以重建未见过的区域,而基于相机引导的扩散模型常因点云投影噪声或相机位姿条件不足导致生成轨迹偏离预期。解决方案的关键在于提出ConfCtrl框架,其核心创新是通过置信度加权的投影点云潜在表示初始化扩散过程,并引入类卡尔曼滤波的预测-更新机制,将投影点云视为含噪观测,利用学习到的残差校正平衡位姿驱动预测与几何观测不确定性,从而在可靠投影区域增强约束、弱化不确定区域影响,实现几何一致且视觉合理的视角生成。
链接: https://arxiv.org/abs/2603.09819
作者: Liudi Yang,George Eskandar,Fengyi Shen,Mohammad Altillawi,Yang Bai,Chi Zhang,Ziyuan Liu,Abhinav Valada
机构: University of Freiburg (弗莱堡大学); Ludwig Maximilian University of Munich (慕尼黑路德维希-马克西米利安大学); Technical University of Munich (慕尼黑工业大学); Huawei Heisenberg Research Center (慕尼黑华为海森堡研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
Abstract:We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.
[CV-18] RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding
【速读】:该论文旨在解决传统音频-视觉学习(Audio-Visual Learning, AVL)任务多局限于粗粒度感知(如音视频对应、声源定位等)的问题,提出一种细粒度的音频-视觉学习新任务——区域感知声源理解(Region-Aware Sound Source Understanding, RA-SSU),以实现帧级、区域感知且高质量的声源理解。解决方案的关键在于:首先构建两个细粒度标注的数据集——f-Music 和 f-Lifescene,分别涵盖复杂音乐场景和多样化生活场景中的声源掩码与逐帧文本描述;其次设计 SSUFormer 框架,采用多模态输入与输出架构,并引入掩码协同模块(Mask Collaboration Module, MCM)提升分割精度,以及分层提示专家混合模块(Mixture of Hierarchical-prompted Experts, MoHE)增强声源描述的丰富性与准确性,从而在多个基准上实现最先进的性能(SOTA)。
链接: https://arxiv.org/abs/2603.09809
作者: Muyi Sun,Yixuan Wang,Hong Wang,Chen Su,Man Zhang,Xingqun Qi,Qi Li,Zhenan Sun
机构: Beijing University of Posts and Telecommunications (北京邮电大学); The Hong Kong University of Science and Technology (香港科技大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TMM
Abstract:Audio-Visual Learning (AVL) is one fundamental task of multi-modality learning and embodied intelligence, displaying the vital role in scene understanding and interaction. However, previous researchers mostly focus on exploring downstream tasks from a coarse-grained perspective (e.g., audio-visual correspondence, sound source localization, and audio-visual event localization). Considering providing more specific scene perception details, we newly define a fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding. To support this goal, we innovatively construct two corresponding datasets, i.e. fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene), each containing annotated sound source masks and frame-by-frame textual descriptions. The f-Music dataset includes 3,976 samples across 22 scene types related to specific application scenarios, focusing on music scenes with complex instrument mixing. The f-Lifescene dataset contains 6,156 samples across 61 types representing diverse sounding objects in life scenarios. Moreover, we propose SSUFormer, a Sound-Source Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-modal input and multi-modal output architecture. Specifically, we design two modules for this framework, Mask Collaboration Module (MCM) and Mixture of Hierarchical-prompted Experts (MoHE), to respectively enhance the accuracy and enrich the elaboration of the sound source description. Extensive experiments are conducted on our two datasets to verify the feasibility of the task, evaluate the availability of the datasets, and demonstrate the superiority of the SSUFormer, which achieves SOTA performance on the Sound Source Understanding benchmark.
[CV-19] st-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency CVPR2026
【速读】:该论文旨在解决视角间动作预测(Ego-Exo Action Anticipation)中因依赖目标视角数据训练而导致的计算与数据收集成本高昂的问题,提出了一种在测试阶段在线适应源视角模型以实现目标视角动作预测的新任务——Test-time Ego-Exo Adaptation for Action Anticipation (TE²A³)。其解决方案的关键在于设计了一个双线索增强原型生长网络(Dual-Clue enhanced Prototype Growing Network, DCPGN):首先通过多标签原型生长模块(ML-PGM)利用多标签分配和置信度重加权策略动态更新类级记忆库,缓解多动作候选下的类别不平衡问题;其次引入双线索一致性模块(DCCM),借助轻量级叙述器生成文本线索以补充视觉线索中的对象信息,并通过对齐文本与视觉logits构建跨模态一致性约束,从而在时空维度上有效弥合Ego与Exo视角间的差异。
链接: https://arxiv.org/abs/2603.09798
作者: Zhaofeng Shi,Heqian Qiu,Lanxiao Wang,Qingbo Wu,Fanman Meng,Lili Pan,Hongliang Li
机构: University of Electronic Science and Technology of China, Chengdu, China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Efficient adaptation between Egocentric (Ego) and Exocentric (Exo) views is crucial for applications such as human-robot cooperation. However, the success of most existing Ego-Exo adaptation methods relies heavily on target-view data for training, thereby increasing computational and data collection costs. In this paper, we make the first exploration of a Test-time Ego-Exo Adaptation for Action Anticipation (TE ^2 A ^3 ) task, which aims to adjust the source-view-trained model online during test time to anticipate target-view actions. It is challenging for existing Test-Time Adaptation (TTA) methods to address this task due to the multi-action candidates and significant temporal-spatial inter-view gap. Hence, we propose a novel Dual-Clue enhanced Prototype Growing Network (DCPGN), which accumulates multi-label knowledge and integrates cross-modality clues for effective test-time Ego-Exo adaptation and action anticipation. Specifically, we propose a Multi-Label Prototype Growing Module (ML-PGM) to balance multiple positive classes via multi-label assignment and confidence-based reweighting for class-wise memory banks, which are updated by an entropy priority queue strategy. Then, the Dual-Clue Consistency Module (DCCM) introduces a lightweight narrator to generate textual clues indicating action progressions, which complement the visual clues containing various objects. Moreover, we constrain the inferred textual and visual logits to construct dual-clue consistency for temporally and spatially bridging Ego and Exo views. Extensive experiments on the newly proposed EgoMe-anti and the existing EgoExoLearn benchmarks show the effectiveness of our method, which outperforms related state-of-the-art methods by a large margin. Code is available at \hrefthis https URLthis https URL.
[CV-20] What is Missing? Explaining Neurons Activated by Absent Concepts
【速读】:该论文旨在解决现有可解释人工智能(Explainable Artificial Intelligence, XAI)方法在揭示深度神经网络中“编码缺失”(encoded absences)这一重要但被忽视的因果关系时存在的局限性问题。具体而言,主流XAI方法如归因法(attribution methods)和特征可视化(feature visualization methods)通常聚焦于输入中与神经元强激活相关的概念存在性,而忽略了概念缺失反而增强神经元激活的现象。解决方案的关键在于提出两种简单的技术扩展:一是改进归因方法以识别导致高激活的缺失特征;二是优化特征可视化方法以捕捉由缺失概念引发的激活模式。通过这些改进,论文证明了能够有效揭示并解释模型中编码缺失的现象,并展示了其在ImageNet模型中的普遍性及对去偏(debiasing)性能提升的价值。
链接: https://arxiv.org/abs/2603.09787
作者: Robin Hesse,Simone Schaub-Meyer,Janina Hesse,Bernt Schiele,Stefan Roth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint
Abstract:Explainable artificial intelligence (XAI) aims to provide human-interpretable insights into the behavior of deep neural networks (DNNs), typically by estimating a simplified causal structure of the model. In existing work, this causal structure often includes relationships where the presence of a concept is associated with a strong activation of a neuron. For example, attribution methods primarily identify input pixels that contribute most to a prediction, and feature visualization methods reveal inputs that cause high activation of a target neuron - the former implicitly assuming that the relevant information resides in the input, and the latter that neurons encode the presence of concepts. However, a largely overlooked type of causal relationship is that of encoded absences, where the absence of a concept increases neural activation. In this work, we show that such missing but relevant concepts are common and that mainstream XAI methods struggle to reveal them when applied in their standard form. To address this, we propose two simple extensions to attribution and feature visualization techniques that uncover encoded absences. Across experiments, we show how mainstream XAI methods can be used to reveal and explain encoded absences, how ImageNet models exploit them, and that debiasing can be improved when considering them.
[CV-21] Removing the Trigger Not the Backdoor: Alternative Triggers and Latent Backdoors
【速读】:该论文试图解决当前后门防御方法的局限性问题,即现有方法假设移除已知触发器(trigger)即可消除后门,但忽略了存在其他感知上与训练触发器不同的替代触发器(alternative triggers),这些替代触发器同样能激活同一后门。解决方案的关键在于:通过对比干净样本与被触发样本在特征空间中的表示,估计出后门的方向(backdoor direction),并设计一种基于特征引导的攻击策略,联合优化目标预测与方向对齐。该方法从输入空间转向特征空间,提出针对后门方向而非具体触发器的防御思路,从而更有效地识别和抵御后门攻击。
链接: https://arxiv.org/abs/2603.09772
作者: Gorka Abad,Ermes Franch,Stefanos Koffas,Stjepan Picek
机构: University of Bergen(卑尔根大学); Delft University of Technology(代尔夫特理工大学); University of Zagreb(萨格勒布大学); Radboud University(拉德堡德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Current backdoor defenses assume that neutralizing a known trigger removes the backdoor. We show this trigger-centric view is incomplete: \emphalternative triggers, patterns perceptually distinct from training triggers, reliably activate the same backdoor. We estimate the alternative trigger backdoor direction in feature space by contrasting clean and triggered representations, and then develop a feature-guided attack that jointly optimizes target prediction and directional alignment. First, we theoretically prove that alternative triggers exist and are an inevitable consequence of backdoor training. Then, we verify this empirically. Additionally, defenses that remove training triggers often leave backdoors intact, and alternative triggers can exploit the latent backdoor feature-space. Our findings motivate defenses targeting backdoor directions in representation space rather than input-space triggers.
[CV-22] Ego: Embedding-Guided Personalization of Vision-Language Models
【速读】:该论文旨在解决当前多模态语言模型在个性化应用中面临的通用性不足问题,即如何在不牺牲模型泛化能力的前提下实现高效、轻量的个性化体验。现有方法通常依赖额外训练阶段或复杂的外部模块集成,导致可扩展性差或部署效率低。其解决方案的关键在于利用模型内部注意力机制提取代表特定概念的视觉token作为记忆单元,从而无需重新训练即可实现对目标概念的识别与描述,显著降低了个性化开销并提升了实用性。
链接: https://arxiv.org/abs/2603.09771
作者: Soroush Seifi,Simon Gardier,Vaggelis Dorovatas,Daniel Olmeda Reino,Rahaf Aljundi
机构: Toyota Motor Europe (丰田汽车欧洲)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model’s inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model’s internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.
[CV-23] PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments
【速读】:该论文旨在解决360°室内环境中场景级 affordance(可操作性)接地问题,现有方法多局限于物体中心且仅支持视角受限的感知,难以实现全局语义一致性。其核心挑战包括等距圆柱投影(Equirectangular Projection, ERP)带来的严重几何失真、语义分散以及跨尺度对齐困难。解决方案的关键在于提出PanoAffordanceNet框架,包含两个创新模块:一是畸变感知的谱调制器(Distortion-Aware Spectral Modulator, DASM),用于纬度依赖的校准以缓解ERP失真;二是全向球面稠密化头(Omni-Spherical Densification Head, OSDH),从稀疏激活中恢复拓扑连续性。此外,通过像素级、分布级和区域-文本对比约束的多层级联合优化,有效抑制了弱监督下的语义漂移,从而实现了更鲁棒的全景语义理解。
链接: https://arxiv.org/abs/2603.09760
作者: Guoliang Zhu,Wanjun Jia,Caoyang Shao,Yuheng Zhang,Zhiyong Li,Kailun Yang
机构: Hunan University (湖南大学); National Engineering Research Center of Robot Visual Perception and Control Technology (机器人视觉感知与控制技术国家工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code and benchmark dataset will be made publicly available at this https URL
Abstract:Global perception is essential for embodied agents in 360° spaces, yet current affordance grounding remains largely object-centric and restricted to perspective views. To bridge this gap, we introduce a novel task: Holistic Affordance Grounding in 360° Indoor Environments. This task faces unique challenges, including severe geometric distortions from Equirectangular Projection (ERP), semantic dispersion, and cross-scale alignment difficulties. We propose PanoAffordanceNet, an end-to-end framework featuring a Distortion-Aware Spectral Modulator (DASM) for latitude-dependent calibration and an Omni-Spherical Densification Head (OSDH) to restore topological continuity from sparse activations. By integrating multi-level constraints comprising pixel-wise, distributional, and region-text contrastive objectives, our framework effectively suppresses semantic drift under low supervision. Furthermore, we construct 360-AGD, the first high-quality panoramic affordance grounding dataset. Extensive experiments demonstrate that PanoAffordanceNet significantly outperforms existing methods, establishing a solid baseline for scene-level perception in embodied intelligence. The source code and benchmark dataset will be made publicly available at this https URL.
[CV-24] LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control
【速读】:该论文旨在解决多语言设计标志(logo)生成中视觉与文本元素难以和谐融合的问题,现有方法在应用创意风格时易扭曲字符几何结构,且缺乏对多语言文本生成的支持而需额外训练。解决方案的关键在于提出一种无需训练的LogoDiffuser方法,利用多模态扩散Transformer(multimodal diffusion transformer)将目标字符以图像形式输入,从而实现对字符结构的鲁棒控制;其核心创新是通过分析联合注意力机制识别出对文本结构响应最强的核心令牌(core tokens),并注入最具信息量的注意力图来整合字符结构与视觉设计,同时采用层间聚合策略缓解跨层注意力偏移,确保核心令牌的一致性,最终实现高质量的多语言标志生成。
链接: https://arxiv.org/abs/2603.09759
作者: Mingyu Kang,Hyein Seo,Yuna Jeong,Junhyeong Park,Yong Suk Choi
机构: Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in text-to-image generation have been remarkable, but generating multilingual design logos that harmoniously integrate visual and textual elements remains a challenging task. Existing methods often distort character geometry when applying creative styles and struggle to support multilingual text generation without additional training. To address these challenges, we propose LogoDiffuser, a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer. Instead of using textual prompts, we input the target characters as images, enabling robust character structure control regardless of language. We first analyze the joint attention mechanism to identify core tokens, which are tokens that strongly respond to textual structures. With this observation, our method integrates character structure and visual design by injecting the most informative attention maps. Furthermore, we perform layer-wise aggregation of attention maps to mitigate attention shifts across layers and obtain consistent core tokens. Extensive experiments and user studies demonstrate that our method achieves state-of-the-art performance in multilingual logo generation.
[CV-25] LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos
【速读】:该论文旨在解决程序规划(procedure planning)中因视觉观察存在歧义而导致动作预测不准确的问题,即不同动作在视觉上可能高度相似,使得仅依赖视觉输入的方法难以区分。解决方案的关键在于引入语言感知规划(Language-Aware Planning, LAP),其核心创新是利用预训练的视觉语言模型(Vision Language Model, VLM)将视觉观测转化为文本描述,并提取更具区分性的文本嵌入(text embeddings),这些嵌入作为扩散模型(diffusion model)的输入用于生成动作序列,从而显著提升规划的准确性与鲁棒性。
链接: https://arxiv.org/abs/2603.09743
作者: Lei Shi,Victor Aregbede,Andreas Persson,Martin Längkvist,Amy Loutfi,Stephanie Lowry
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Procedure planning requires a model to predict a sequence of actions that transform a start visual observation into a goal in instructional videos. While most existing methods rely primarily on visual observations as input, they often struggle with the inherent ambiguity where different actions can appear visually similar. In this work, we argue that language descriptions offer a more distinctive representation in the latent space for procedure planning. We introduce Language-Aware Planning (LAP), a novel method that leverages the expressiveness of language to bridge visual observation and planning. LAP uses a finetuned Vision Language Model (VLM) to translate visual observations into text descriptions and to predict actions and extract text embeddings. These text embeddings are more distinctive than visual embeddings and are used in a diffusion model for planning action sequences. We evaluate LAP on three procedure planning benchmarks: CrossTask, Coin, and NIV. LAP achieves new state-of-the-art performance across multiple metrics and time horizons by large margin, demonstrating the significant advantage of language-aware planning.
[CV-26] ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios
【速读】:该论文旨在解决工业场景中人类行为理解因缺乏同时包含第一人称视角(ego)与第三人称视角(exo)数据而受限的问题,从而阻碍了支持工人作业和提升安全性的智能系统发展。其解决方案的关键在于提出ENIGMA-360数据集,该数据集在真实工业环境中采集,包含180对时间同步的第一人称与第三人称视频,且配有时空标注,为研究工业场景下人类行为的不同维度提供了高质量、互补的多视角数据基础。通过在此数据集上开展三项基础任务(时序动作分割、关键步骤识别、第一人称人-物交互检测)的基线实验,验证了现有方法在该复杂场景中的局限性,凸显了开发鲁棒的多视角行为理解模型的必要性。
链接: https://arxiv.org/abs/2603.09741
作者: Francesco Ragusa,Rosario Leonardi,Michele Mazzamuto,Daniele Di Mauro,Camillo Quattrocchi,Alessandro Passanisi,Irene D’Ambra,Antonino Furnari,Giovanni Maria Farinella
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding human behavior from complementary egocentric (ego) and exocentric (exo) points of view enables the development of systems that can support workers in industrial environments and enhance their safety. However, progress in this area is hindered by the lack of datasets capturing both views in realistic industrial scenarios. To address this gap, we propose ENIGMA-360, a new ego-exo dataset acquired in a real industrial scenario. The dataset is composed of 180 egocentric and 180 exocentric procedural videos temporally synchronized offering complementary information of the same scene. The 360 videos have been labeled with temporal and spatial annotations, enabling the study of different aspects of human behavior in industrial domain. We provide baseline experiments for 3 foundational tasks for human behavior understanding: 1) Temporal Action Segmentation, 2) Keystep Recognition and 3) Egocentric Human-Object Interaction Detection, showing the limits of state-of-the-art approaches on this challenging scenario. These results highlight the need for new models capable of robust ego-exo understanding in real-world environments. We publicly release the dataset and its annotations at this https URL.
[CV-27] Lets Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments
【速读】:该论文旨在解决视觉语言导航在连续环境(Vision-Language Navigation in Continuous Environments, VLN-CE)中代理模型训练时面临的三大挑战:一是监督微调(Supervised Fine-Tuning, SFT)策略易产生误差累积且难以从分布外状态中恢复;二是强化学习微调(Reinforcement Fine-Tuning, RFT)方法如GRPO受限于稀疏的最终奖励信号,其二元反馈无法对单步行为进行精准赋权,导致失败批次中梯度信号坍塌。解决方案的关键在于提出步骤感知对比对齐(Step-Aware Contrastive Alignment, SACA)框架,其核心创新包括:(1) 感知 grounded 的步骤感知审计器(Perception-Grounded Step-Aware Auditor),可逐步评估进展并分离失败轨迹的有效前缀与精确偏离点;(2) 场景条件分组构建机制(Scenario-Conditioned Group Construction),根据轨迹质量动态分配批量至专用重采样与优化策略,从而提取密集监督信号以提升模型泛化能力、错误恢复能力和训练稳定性。
链接: https://arxiv.org/abs/2603.09740
作者: Haoyuan Li,Rui Liu,Hehe Fan,Yi Yang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 10 figures
Abstract:Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to learn complex reasoning from long-horizon human interactions. While Multi-modal Large Language Models (MLLMs) have driven recent progress, current training paradigms struggle to balance generalization capability, error recovery and training stability. Specifically, (i) policies derived from SFT suffer from compounding errors, struggling to recover from out-of-distribution states, and (ii) Reinforcement Fine-Tuning (RFT) methods e.g. GRPO are bottlenecked by sparse outcome rewards. Their binary feedback fails to assign credit to individual steps, leading to gradient signal collapse in failure dominant batches. To address these challenges, we introduce Step-Aware Contrastive Alignment (SACA), a framework designed to extract dense supervision from imperfect trajectories. At its core, the Perception-Grounded Step-Aware auditor evaluates progress step-by-step, disentangling failed trajectories into valid prefixes and exact divergence points. Leveraging these signals, Scenario-Conditioned Group Construction mechanism dynamically routes batches to specialized resampling and optimization strategies. Extensive experiments on VLN-CE benchmarks demonstrate that SACA achieves state-of-the-art performance.
[CV-28] M2-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs
【速读】:该论文旨在解决在多摄像头输入不完整情况下,如何保持语义占据预测(Semantic Occupancy Prediction)的几何结构和语义一致性的问题。现有基于摄像头的方法通常假设拥有完整的环视观测,但在实际部署中由于遮挡、硬件故障或通信中断等因素,这一假设难以满足。为应对这一挑战,作者提出 M²-Occ 框架,其核心创新在于两个互补模块:一是多视角掩码重建(Multi-view Masked Reconstruction, MMR)模块,利用相邻摄像头间的空间重叠,在特征空间中直接恢复缺失视角的表示;二是特征记忆模块(Feature Memory Module, FMM),引入可学习的记忆库存储类别级语义原型,通过检索并融合这些全局先验信息来修正模糊体素特征,从而保障在观测证据不足时仍能维持语义一致性。实验表明,该方法在nuScenes基准上的缺失后视图场景下IoU提升4.93%,在五视角缺失场景下提升达5.01%,且不影响全视角性能。
链接: https://arxiv.org/abs/2603.09737
作者: Kaixin Lin,Kunyu Peng,Di Wen,Yufan Chen,Ruiping Liu,Kailun Yang
机构: Hunan University (湖南大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Sofia University “St. Kliment Ohridski” (索非亚大学“克莱门特·奥霍里斯基”); INSAIT; State Key Laboratory of Autonomous Intelligent Unmanned Systems (自主智能无人系统国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code will be publicly released at this https URL
Abstract:Semantic occupancy prediction enables dense 3D geometric and semantic understanding for autonomous driving. However, existing camera-based approaches implicitly assume complete surround-view observations, an assumption that rarely holds in real-world deployment due to occlusion, hardware malfunction, or communication failures. We study semantic occupancy prediction under incomplete multi-camera inputs and introduce M^2 -Occ, a framework designed to preserve geometric structure and semantic coherence when views are missing. M^2 -Occ addresses two complementary challenges. First, a Multi-view Masked Reconstruction (MMR) module leverages the spatial overlap among neighboring cameras to recover missing-view representations directly in the feature space. Second, a Feature Memory Module (FMM) introduces a learnable memory bank that stores class-level semantic prototypes. By retrieving and integrating these global priors, the FMM refines ambiguous voxel features, ensuring semantic consistency even when observational evidence is incomplete. We introduce a systematic missing-view evaluation protocol on the nuScenes-based SurroundOcc benchmark, encompassing both deterministic single-view failures and stochastic multi-view dropout scenarios. Under the safety-critical missing back-view setting, M^2 -Occ improves the IoU by 4.93%. As the number of missing cameras increases, the robustness gap further widens; for instance, under the setting with five missing views, our method boosts the IoU by 5.01%. These gains are achieved without compromising full-view performance. The source code will be publicly released at this https URL.
[CV-29] FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation
【速读】:该论文旨在解决扩散模型在高保真视频生成中建模复杂时空动态的难题,尤其是现有方法在使用全3D注意力(Full 3D Attention)与局部因子化注意力(Local Factorized Attention)之间面临的效率与时空建模能力之间的权衡问题。其解决方案的关键在于提出一种帧级时序注意力机制——Matrix Attention,该机制将整帧视为矩阵并利用矩阵原生操作生成查询、键和值矩阵,从而通过跨帧而非跨token的注意力机制有效保留全局时空结构,并适应显著运动。基于此机制构建的FrameDiT-H进一步融合了局部因子化注意力,以同时捕捉大尺度和小尺度运动,在多个视频生成基准上实现了最优的时序连贯性和视频质量,且计算效率接近局部因子化注意力。
链接: https://arxiv.org/abs/2603.09721
作者: Minh Khoa Le,Kien Do,Duc Thanh Nguyen,Truyen Tran
机构: Deakin University (迪肯大学); FPT Smart Cloud (FPT智能云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.
[CV-30] GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)技术在实时辐射场渲染中因数据量庞大而导致的带宽密集问题,尤其针对其在实时分发场景下的压缩与传输效率挑战。解决方案的关键在于提出一种名为GSStream的新颖体积场景流媒体系统,其核心创新包括:1)集成协同视口预测模块,通过学习多用户历史视口序列中的协同先验与个体先验来更精准预测用户未来行为;2)设计基于深度强化学习(Deep Reinforcement Learning, DRL)的码率自适应模块,有效应对码率调整问题中状态空间与动作空间的动态变化挑战,从而实现高效、低延迟的体积场景交付。
链接: https://arxiv.org/abs/2603.09718
作者: Zhiye Tang,Qiudan Zhang,Lei Zhang,Junhui Hou,You Yang,Xu Wang
机构: Shenzhen University (深圳大学); City University of Hong Kong (香港城市大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, the 3D Gaussian splatting (3DGS) technique for real-time radiance field rendering has revolutionized the field of volumetric scene representation, providing users with an immersive experience. But in return, it also poses a large amount of data volume, which is extremely bandwidth-intensive. Cutting-edge researchers have tried to introduce different approaches and construct multiple variants for 3DGS to obtain a more compact scene representation, but it is still challenging for real-time distribution. In this paper, we propose GSStream, a novel volumetric scene streaming system to support 3DGS data format. Specifically, GSStream integrates a collaborative viewport prediction module to better predict users’ future behaviors by learning collaborative priors and historical priors from multiple users and users’ viewport sequences and a deep reinforcement learning (DRL)-based bitrate adaptation module to tackle the state and action space variability challenge of the bitrate adaptation problem, achieving efficient volumetric scene delivery. Besides, we first build a user viewport trajectory dataset for volumetric scenes to support the training and streaming simulation. Extensive experiments prove that our proposed GSStream system outperforms existing representative volumetric scene streaming systems in visual quality and network usage. Demo video: this https URL.
[CV-31] ProGS: Towards Progressive Coding for 3D Gaussian Splatting
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 数据在存储与传输中因体量庞大而带来的挑战,尤其针对现有方法无法支持渐进式编码(progressive coding)的问题,这限制了其在带宽波动的流媒体场景中的应用。解决方案的关键在于提出一种名为 ProGS 的新型编码架构,其核心是将 3DGS 数据组织为八叉树(octree)结构,并引入互信息增强机制以减少节点间的结构冗余,同时通过动态调整锚点节点自适应优化压缩效率。该方法实现了高达 45 倍的文件体积压缩比,且视觉质量提升超过 10%,显著提升了 3DGS 在实时应用场景下的可扩展性和鲁棒性。
链接: https://arxiv.org/abs/2603.09703
作者: Zhiye Tang,Lingzhuo Liu,Shengjie Jiao,Qiudan Zhang,Junhui Hou,You Yang,Xu Wang
机构: Shenzhen University (深圳大学); City University of Hong Kong (香港城市大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the emergence of 3D Gaussian Splatting (3DGS), numerous pioneering efforts have been made to address the effective compression issue of massive 3DGS data. 3DGS offers an efficient and scalable representation of 3D scenes by utilizing learnable 3D Gaussians, but the large size of the generated data has posed significant challenges for storage and transmission. Existing methods, however, have been limited by their inability to support progressive coding, a crucial feature in streaming applications with varying bandwidth. To tackle this limitation, this paper introduce a novel approach that organizes 3DGS data into an octree structure, enabling efficient progressive coding. The proposed ProGS is a streaming-friendly codec that facilitates progressive coding for 3D Gaussian splatting, and significantly improves both compression efficiency and visual fidelity. The proposed method incorporates mutual information enhancement mechanisms to mitigate structural redundancy, leveraging the relevance between nodes in the octree hierarchy. By adapting the octree structure and dynamically adjusting the anchor nodes, ProGS ensures scalable data compression without compromising the rendering quality. ProGS achieves a remarkable 45X reduction in file storage compared to the original 3DGS format, while simultaneously improving visual performance by over 10%. This demonstrates that ProGS can provide a robust solution for real-time applications with varying network conditions.
[CV-32] riFusion-SR: Joint Tri-Modal Medical Image Fusion and SR
【速读】:该论文旨在解决多模态医学图像融合中因分辨率退化和模态差异导致的性能瓶颈问题,尤其是在结合解剖结构模态(如MRI、CT)与功能成像模态(如PET、SPECT)的三模态场景下,由于频域不平衡所引发的伪影和感知质量下降问题。其解决方案的关键在于提出了一种基于小波引导的条件扩散框架TriFusionSR,通过二维离散小波变换(2D Discrete Wavelet Transform, DWT)显式地将多模态特征分解至不同频率带,实现频域感知的跨模态交互;进一步引入校准潜空间系数的修正小波特征(Rectified Wavelet Features, RWF)策略,并设计带有门控通道-空间注意力机制的自适应时空频融合(Adaptive Spatial-Frequency Fusion, ASFF)模块,从而实现结构驱动的多模态精细化重构,显著提升了融合图像的质量与保真度。
链接: https://arxiv.org/abs/2603.09702
作者: Fayaz Ali Dharejo,Sharif S. M. A.,Aiman Khalil,Nachiket Chaudhary,Rizwan Ali Naqvi,Radu Timofte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal medical image fusion facilitates comprehensive diagnosis by aggregating complementary structural and functional information, but its effectiveness is limited by resolution degradation and modality discrepancies. Existing approaches typically perform image fusion and super-resolution (SR) in separate stages, leading to artifacts and degraded perceptual quality. These limitations are further amplified in tri-modal settings that combine anatomical modalities (e.g., MRI, CT) with functional scans (e.g., PET, SPECT) due to pronounced frequency domain imbalances. We propose TriFusionSR, a wavelet-guided conditional diffusion framework for joint tri-modal fusion and SR. The framework explicitly decomposes multimodal features into frequency bands using the 2D Discrete Wavelet Transform, enabling frequency-aware crossmodal interaction. We further introduce a Rectified Wavelet Features (RWF) strategy for latent coefficient calibration, followed by an Adaptive Spatial-Frequency Fusion (ASFF) module with gated channel-spatial attention to enable structure-driven multimodal refinement. Extensive experiments demonstrate state-of-the-art performance, achieving 4.8-12.4% PSNR improvement and substantial reductions in RMSE and LPIPS across multiple upsampling scales.
[CV-33] mporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering
【速读】:该论文旨在解决外科手术视频问答(Surgical Video Question Answering, VideoQA)中因临床提问表述多样而引发的语言偏差问题,同时提升模型在时间上的精准定位能力。传统参数高效微调(Parameter Efficient Fine Tuning, PEFT)方法未显式建模帧间交互关系,难以有效利用稀疏的时间证据。其解决方案的关键在于提出TemporalDoRA,一种面向视频的PEFT方法:通过在视觉编码器的低秩瓶颈中插入轻量级时间多头注意力(Temporal Multi-Head Attention, MHA),并在可训练的低秩分支上选择性应用权重分解,而非对完整适配权重进行分解。该设计实现了时间感知的参数更新,在保持主干网络冻结和稳定扩展性的同时,通过在低秩子空间内混合跨帧信息,引导更新聚焦于时序一致的视觉线索,从而显著提升对语言变化的鲁棒性。
链接: https://arxiv.org/abs/2603.09696
作者: Luca Carlini,Chiara Lena,Cesare Hassan,Danail Stoyanov,Elena De Momi,Sophia Bano,Mobarak I. Hoque
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgical Video Question Answering (VideoQA) requires accurate temporal grounding while remaining robust to natural variation in how clinicians phrase questions, where linguistic bias can arise. Standard Parameter Efficient Fine Tuning (PEFT) methods adapt pretrained projections without explicitly modeling frame-to-frame interactions within the adaptation pathway, limiting their ability to exploit sparse temporal evidence. We introduce TemporalDoRA, a video-specific PEFT formulation that extends Weight-Decomposed Low-Rank Adaptation by (i) inserting lightweight temporal Multi-Head Attention (MHA) inside the low-rank bottleneck of the vision encoder and (ii) selectively applying weight decomposition only to the trainable low-rank branch rather than the full adapted weight. This design enables temporally-aware updates while preserving a frozen backbone and stable scaling. By mixing information across frames within the adaptation subspace, TemporalDoRA steers updates toward temporally consistent visual cues and improves robustness with minimal parameter overhead. To benchmark this setting, we present REAL-Colon-VQA, a colonoscopy VideoQA dataset with 6,424 clip–question pairs, including paired rephrased Out-of-Template questions to evaluate sensitivity to linguistic variation. TemporalDoRA improves Out-of-Template performance, and ablation studies confirm that temporal mixing inside the low-rank branch is the primary driver of these gains. We also validate on EndoVis18-VQA adapted to short clips and observe consistent improvements on the Out-of-Template split. Code and dataset available at~\hrefthis https URLAnonymous GitHub.
[CV-34] DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds
【速读】:该论文旨在解决4D雷达(4D radar)在自动驾驶系统中因点云密度远低于激光雷达(LiDAR)而导致的感知性能受限问题,尤其强调如何有效利用局部与全局场景上下文信息以提升目标检测和自由道路估计的准确性。其解决方案的关键在于提出了一种双路径架构模型DRIFT,该模型通过并行的点路径(point path)提取细粒度局部特征与支柱路径(pillar path)编码粗粒度全局特征,并在多个阶段通过新颖的特征共享层实现两者的深度融合,从而充分挖掘和融合多尺度上下文信息,显著提升了4D雷达的感知性能。
链接: https://arxiv.org/abs/2603.09695
作者: Siqi Pei,Andras Palffy,Dariu M. Gavrila
机构: Delft University of Technology (代尔夫特理工大学); Perciv AI (Perciv AI)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:4D radars, which provide 3D point cloud data along with Doppler velocity, are attractive components of modern automated driving systems due to their low cost and robustness under adverse weather conditions. However, they provide a significantly lower point cloud density than LiDAR sensors. This makes it important to exploit not only local but also global contextual scene information. This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture. The model incorporates a point path to aggregate fine-grained local features and a pillar path to encode coarse-grained global features. These two parallel paths are intertwined via novel feature-sharing layers at multiple stages, enabling full utilization of both representations. DRIFT is evaluated on the widely used View-of-Delft (VoD) dataset and a proprietary internal dataset. It outperforms the baselines on the tasks of object detection and/or free road estimation. For example, DRIFT achieves a mean average precision (mAP) of 52.6% (compared to, say, 45.4% of CenterPoint) on the VoD dataset.
[CV-35] AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering
【速读】:该论文旨在解决越南语视觉问答(Vietnamese Visual Question Answering, ViVQA)任务中模型对语言偏倚依赖较强、视觉 grounding 不足以及自动评估指标与人类判断一致性较差的问题。其解决方案的关键在于:利用基于 Transformer 的架构,融合预训练的文本模型(如 PhoBERT)和视觉模型(如 Vision Transformer, ViT),实现多模态特征的有效整合,并系统性地比较多种自动评估指标在多语言环境下的表现,从而提升模型在越南语场景下的理解能力与评估准确性。
链接: https://arxiv.org/abs/2603.09689
作者: Nguyen Anh Tuong,Phan Ba Duc,Nguyen Trung Quoc,Tran Dac Thinh,Dang Duy Lan,Nguyen Quoc Thinh,Tung Le
机构: University of Science, VNU-HCM (胡志明市国家大学科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual Question Answering (VQA) is a fundamental multimodal task that requires models to jointly understand visual and textual information. Early VQA systems relied heavily on language biases, motivating subsequent work to emphasize visual grounding and balanced datasets. With the success of large-scale pre-trained transformers for both text and vision domains – such as PhoBERT for Vietnamese language understanding and Vision Transformers (ViT) for image representation learning – multimodal fusion has achieved remarkable progress. For Vietnamese VQA, several datasets have been introduced to promote research in low-resource multimodal learning, including ViVQA, OpenViVQA, and the recently proposed ViTextVQA. These resources enable benchmarking of models that integrate linguistic and visual features in the Vietnamese context. Evaluation of VQA systems often employs automatic metrics originally designed for image captioning or machine translation, such as BLEU, METEOR, CIDEr, Recall, Precision, and F1-score. However, recent research suggests that large language models can further improve the alignment between automatic evaluation and human judgment in VQA tasks. In this work, we explore Vietnamese Visual Question Answering using transformer-based architectures, leveraging both textual and visual pre-training while systematically comparing automatic evaluation metrics under multilingual settings. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.09689 [cs.CV] (or arXiv:2603.09689v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.09689 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tung Le Dr. [view email] [v1] Tue, 10 Mar 2026 13:57:52 UTC (1,274 KB)
[CV-36] Improving 3D Foot Motion Reconstruction in Markerless Monocular Human Motion Capture
【速读】:该论文旨在解决现有视频驱动的人体三维运动恢复方法在足部细微关节运动(如脚踝)捕捉上的不足问题,这一缺陷限制了其在步态分析和动画等应用场景中的精度。解决方案的关键在于提出FootMR(Foot Motion Refinement)方法,通过将2D足部关键点序列提升为3D空间中的运动轨迹来优化已有模型输出的足部动作,同时避免直接依赖图像输入以规避标注不准确的问题;该方法利用大规模动作捕捉数据,并引入膝关节与足部运动作为上下文信息,仅预测足部残差运动,从而提升重建稳定性;此外,通过全局旋转表示和增强的数据增强策略,显著改善了极端足部姿态下的泛化能力。
链接: https://arxiv.org/abs/2603.09681
作者: Tom Wehrbein,Bodo Rosenhahn
机构: L3S - Leibniz University Hannover, Germany (L3S - 汉诺威大学莱布尼茨研究所,德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2026 International Conference on 3D Vision (3DV)
Abstract:State-of-the-art methods can recover accurate overall 3D human body motion from in-the-wild videos. However, they often fail to capture fine-grained articulations, especially in the feet, which are critical for applications such as gait analysis and animation. This limitation results from training datasets with inaccurate foot annotations and limited foot motion diversity. We address this gap with FootMR, a Foot Motion Refinement method that refines foot motion estimated by an existing human recovery model through lifting 2D foot keypoint sequences to 3D. By avoiding direct image input, FootMR circumvents inaccurate image-3D annotation pairs and can instead leverage large-scale motion capture data. To resolve ambiguities of 2D-to-3D lifting, FootMR incorporates knee and foot motion as context and predicts only residual foot motion. Generalization to extreme foot poses is further improved by representing joints in global rather than parent-relative rotations and applying extensive data augmentation. To support evaluation of foot motion reconstruction, we introduce MOOF, a 2D dataset of complex foot movements. Experiments on MOOF, MOYO, and RICH show that FootMR outperforms state-of-the-art methods, reducing ankle joint angle error on MOYO by up to 30% over the best video-based approach.
[CV-37] VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM CVPR2026
【速读】:该论文旨在解决现有基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的同步定位与建图(SLAM)系统在低纹理区域、透明表面或复杂反射特性场景中因测量可靠性处理不明确而导致位姿估计漂移的问题。解决方案的关键在于提出VarSplat,一个不确定性感知的3DGS-SLAM系统,其核心创新是显式学习每个高斯溅射点(splat)的外观方差,并通过全方差定律(law of total variance)结合alpha合成,实现高效单次遍历光栅化生成可微分的像素级不确定性图。该不确定性图引导跟踪、子地图注册和回环检测聚焦于可靠区域,从而提升优化稳定性与重建精度。
链接: https://arxiv.org/abs/2603.09673
作者: Anh Thuan Tran,Jana Kosecka
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Simultaneous Localization and Mapping (SLAM) with 3D Gaussian Splatting (3DGS) enables fast, differentiable rendering and high-fidelity reconstruction across diverse real-world scenes. However, existing 3DGS-SLAM approaches handle measurement reliability implicitly, making pose estimation and global alignment susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties. To this end, we introduce VarSplat, an uncertainty-aware 3DGS-SLAM system that explicitly learns per-splat appearance variance. By using the law of total variance with alpha compositing, we then render differentiable per-pixel uncertainty map via efficient, single-pass rasterization. This map guides tracking, submap registration, and loop detection toward focusing on reliable regions and contributes to more stable optimization. Experimental results on Replica (synthetic) and TUM-RGBD, ScanNet, and ScanNet++ (real-world) show that VarSplat improves robustness and achieves competitive or superior tracking, mapping, and novel view synthesis rendering compared to existing studies for dense RGB-D SLAM.
[CV-38] DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics ICLR2026
【速读】:该论文旨在解决从视频观测中建模风驱动物体动力学的问题,其核心挑战在于风的不可见性与时空变化性,以及物体复杂的形变行为。解决方案的关键在于提出DiffWind框架,该框架是一个物理信息感知的可微分方法,统一了风-物相互作用建模、基于视频的重建与前向仿真:通过将风表示为网格化的物理场,物体则基于3D Gaussian Splatting构建粒子系统,并利用材料点法(Material Point Method, MPM)建模二者交互;同时引入可微渲染与仿真联合优化策略以恢复时空风力场和物体运动,并结合格子玻尔兹曼方法(Lattice Boltzmann Method, LBM)作为物理约束确保流体动力学规律的遵守,从而实现高精度重建与真实感前向模拟。
链接: https://arxiv.org/abs/2603.09668
作者: Yuanhang Lei,Boming Zhao,Zesong Yang,Xingxuan Li,Tao Cheng,Haocheng Peng,Ru Zhang,Yang Yang,Siyuan Huang,Yujun Shen,Ruizhen Hu,Hujun Bao,Zhaopeng Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026. Project page: this https URL
Abstract:Modeling wind-driven object dynamics from video observations is highly challenging due to the invisibility and spatio-temporal variability of wind, as well as the complex deformations of objects. We present DiffWind, a physics-informed differentiable framework that unifies wind-object interaction modeling, video-based reconstruction, and forward simulation. Specifically, we represent wind as a grid-based physical field and objects as particle systems derived from 3D Gaussian Splatting, with their interaction modeled by the Material Point Method (MPM). To recover wind-driven object dynamics, we introduce a reconstruction framework that jointly optimizes the spatio-temporal wind force field and object motion through differentiable rendering and simulation. To ensure physical validity, we incorporate the Lattice Boltzmann Method (LBM) as a physics-informed constraint, enforcing compliance with fluid dynamics laws. Beyond reconstruction, our method naturally supports forward simulation under novel wind conditions and enables new applications such as wind retargeting. We further introduce WD-Objects, a dataset of synthetic and real-world wind-driven scenes. Extensive experiments demonstrate that our method significantly outperforms prior dynamic scene modeling approaches in both reconstruction accuracy and simulation fidelity, opening a new avenue for video-based wind-object interaction modeling.
[CV-39] When to Lock Attention: Training-Free KV Control in Video Diffusion
【速读】:该论文旨在解决视频编辑中保持背景一致性的同时提升前景质量的核心挑战:传统方法在注入全图信息时易引入背景伪影,而严格锁定背景则限制了模型对前景的生成能力。解决方案的关键在于提出一种无需训练的KV-Lock框架,其核心思想是利用去噪预测方差(即扩散幻觉指标)来量化生成多样性,并发现该指标与无分类器引导(Classifier-Free Guidance, CFG)尺度存在内在关联;基于此,KV-Lock动态调度两个关键组件——缓存背景键值(Key-Value, KV)与新生成KV的融合比例,以及CFG尺度:当检测到幻觉风险时,增强背景KV锁定并同步放大条件引导强度,从而有效抑制伪影、提升前景生成保真度。该模块可无缝集成至任意预训练DiT(Diffusion Transformer)视频扩散模型中,实现高背景保真度下的高质量前景编辑。
链接: https://arxiv.org/abs/2603.09657
作者: Tianyi Zeng,Jincheng Gao,Tianyi Wang,Zijie Meng,Miao Zhang,Jun Yin,Haoyuan Sun,Junfeng Jiao,Christian Claudel,Junbo Tan,Xueqian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Image and Video Processing (eess.IV)
备注: 18 pages, 9 figures, 3 tables
Abstract:Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model’s capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.
[CV-40] OTPL-VIO: Robust Visual-Inertial Odometry with Optimal Transport Line Association and Adaptive Uncertainty
【速读】:该论文旨在解决低纹理场景和突变光照条件下立体视觉惯性里程计(Stereo Visual-Inertial Odometry, VIO)的鲁棒性问题,这些问题会导致点特征稀疏且不稳定,从而引发误匹配和约束不足。其核心解决方案是引入一种基于深度学习的线段描述符与熵正则化最优传输匹配方法,使线段在存在模糊性、异常值和部分观测时仍能实现全局一致的对应关系;同时,通过分析线测量噪声并设计可靠性自适应权重机制,在优化过程中动态调节线约束的影响,提升估计稳定性。该方法无需训练即可计算描述符,兼顾精度与实时性能,在EuRoC、UMA-VI数据集及真实低纹理/光照挑战环境中的实验验证了其优越性。
链接: https://arxiv.org/abs/2603.09653
作者: Zikun Chen,Wentao Zhao,Yihe Niu,Tianchen Deng,Jingchuan Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Robust stereo visual-inertial odometry (VIO) remains challenging in low-texture scenes and under abrupt illumination changes, where point features become sparse and unstable, leading to ambiguous association and under-constrained estimation. Line structures offer complementary geometric cues, yet many efficient point-line systems still rely on point-guided line association, which can break down when point support is weak and may lead to biased constraints. We present a stereo point-line VIO system in which line segments are equipped with dedicated deep descriptors and matched using an entropy-regularized optimal transport formulation, enabling globally consistent correspondences under ambiguity, outliers, and partial observations. The proposed descriptor is training-free and is computed by sampling and pooling network feature maps. To improve estimation stability, we analyze the impact of line measurement noise and introduce reliability-adaptive weighting to regulate the influence of line constraints during optimization. Experiments on EuRoC and UMA-VI, together with real-world deployments in low-texture and illumination-challenging environments, demonstrate improved accuracy and robustness over representative baselines while maintaining real-time performance.
[CV-41] Grounding Synthetic Data Generation With Vision and Language Models
【速读】:该论文旨在解决当前合成数据(synthetic data)评估指标缺乏可解释性且与下游任务性能关联性弱的问题,尤其在遥感(remote sensing)领域中,传统基于潜在特征相似性的评价方法难以准确反映合成数据对模型训练的实际贡献。其解决方案的关键在于提出一种视觉-语言协同的框架,通过结合生成模型、语义分割(semantic segmentation)和图像描述生成(image captioning)技术,并利用视觉语言模型(vision-language models)实现跨模态一致性验证,从而构建了一个可自动评估合成数据质量的新范式。该框架支持对合成数据的语义组成分析、冗余度控制及视觉-语言一致性检验,最终在ARAS400k大规模遥感数据集上验证了融合真实与合成数据训练的模型优于纯真实数据基线,为遥感任务中的合成数据增强提供了可扩展且可解释的基准。
链接: https://arxiv.org/abs/2603.09625
作者: Ümit Mert Çağlar,Alptekin Temizel
机构: Middle East Technical University (中东技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision-language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large-scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross-modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real-data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at this http URL and the code base at this http URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.09625 [cs.CV] (or arXiv:2603.09625v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.09625 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-42] Decoder-Free Distillation for Quantized Image Restoration
【速读】:该论文旨在解决生成式 AI (Generative AI) 在边缘部署场景下,针对低层次视觉任务(如图像恢复, Image Restoration, IR)中量化感知训练(Quantization-Aware Training, QAT)与知识蒸馏(Knowledge Distillation, KD)联合优化时存在的三大瓶颈问题:教师-学生模型容量不匹配、解码器蒸馏过程中的空间误差放大以及重建损失与蒸馏损失因量化噪声引发的优化冲突。解决方案的关键在于提出一种名为 Quantization-aware Distilled Restoration (QDR) 的新框架:通过 FP32 自蒸馏消除容量差异,利用 Decoder-Free Distillation (DFD) 在网络瓶颈处纠正量化误差以抑制误差传播,并引入可学习幅度重加权(Learnable Magnitude Reweighting, LMR)动态平衡梯度冲突;此外,设计轻量级边缘友好模型(Edge-Friendly Model, EFM)并嵌入可学习退化门控机制(Learnable Degradation Gating, LDG),实现空间退化区域的动态调制,从而在保持高精度的同时显著提升推理效率和下游任务性能。
链接: https://arxiv.org/abs/2603.09624
作者: S. M. A. Sharif,Abdur Rehman,Seongwan Kim,Jaeho Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Quantization-Aware Training (QAT), combined with Knowledge Distillation (KD), holds immense promise for compressing models for edge deployment. However, joint optimization for precision-sensitive image restoration (IR) to recover visual quality from degraded images remains largely underexplored. Directly adapting QAT-KD to low-level vision reveals three critical bottlenecks: teacher-student capacity mismatch, spatial error amplification during decoder distillation, and an optimization “tug-of-war” between reconstruction and distillation losses caused by quantization noise. To tackle these, we introduce Quantization-aware Distilled Restoration (QDR), a framework for edge-deployed IR. QDR eliminates capacity mismatch via FP32 self-distillation and prevents error amplification through Decoder-Free Distillation (DFD), which corrects quantization errors strictly at the network bottleneck. To stabilize the optimization tug-of-war, we propose a Learnable Magnitude Reweighting (LMR) that dynamically balances competing gradients. Finally, we design an Edge-Friendly Model (EFM) featuring a lightweight Learnable Degradation Gating (LDG) to dynamically modulate spatial degradation localization. Extensive experiments across four IR tasks demonstrate that our Int8 model recovers 96.5% of FP32 performance, achieves 442 frames per second (FPS) on an NVIDIA Jetson Orin, and boosts downstream object detection by 16.3 mAP
[CV-43] Physics-Driven 3D Gaussian Rendering for Zero-Shot MRI Super-Resolution ICASSP
【速读】:该论文旨在解决高分辨率磁共振成像(MRI)在临床诊断中因采集时间长和运动伪影导致的局限性问题,同时克服现有超分辨率(SR)方法在数据依赖性和计算效率之间的权衡困境。其解决方案的关键在于提出一种零样本(zero-shot)MRI超分辨率框架,采用显式高斯表示(explicit Gaussian representation)来平衡数据需求与计算效率:通过嵌入组织物理特性的MRI定制高斯参数,显著减少可学习参数并保持MR信号保真度;结合基于物理的体渲染策略模拟MRI信号形成过程,利用无序独立的砖块(brick-based)光栅化方案实现高度并行的三维计算,从而有效降低训练与推理成本。
链接: https://arxiv.org/abs/2603.09621
作者: Shuting Liu,Lei Zhang,Wei Huang,Zhao Zhang,Zizhou Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICASSP
Abstract:High-resolution Magnetic Resonance Imaging (MRI) is vital for clinical diagnosis but limited by long acquisition times and motion artifacts. Super-resolution (SR) reconstructs low-resolution scans into high-resolution images, yet existing methods are mutually constrained: paired-data methods achieve efficiency only by relying on costly aligned datasets, while implicit neural representation approaches avoid such data needs at the expense of heavy computation. We propose a zero-shot MRI SR framework using explicit Gaussian representation to balance data requirements and efficiency. MRI-tailored Gaussian parameters embed tissue physical properties, reducing learnable parameters while preserving MR signal fidelity. A physics-grounded volume rendering strategy models MRI signal formation via normalized Gaussian aggregation. Additionally, a brick-based order-independent rasterization scheme enables highly parallel 3D computation, lowering training and inference costs. Experiments on two public MRI datasets show superior reconstruction quality and efficiency, demonstrating the method’s potential for clinical MRI SR.
[CV-44] A saccade-inspired approach to image classification using visiontransformer attention maps
【速读】:该论文旨在解决传统人工智能视觉系统在图像处理中缺乏生物合理性与计算效率的问题,即如何在有限的代谢资源下实现高效且精准的视觉感知。其解决方案的关键在于借鉴人类视觉系统的主动注意机制——通过模拟快速眼动(saccadic eye movements)行为,利用DINO模型生成的注意力图(attention maps)指导信息处理聚焦于视觉空间中的关键区域,从而构建一种基于视觉Transformer注意力机制的“扫视式”选择性处理策略。实验表明,该方法在ImageNet分类任务中可保持接近全图处理的性能,甚至在某些情况下超越,同时展现出优于现有用于人类注视预测的显著性模型的区域选择能力,为神经形态视觉处理提供了新的高效范式。
链接: https://arxiv.org/abs/2603.09613
作者: Matthis Dallain,Laurent Rodriguez,Laurent Udo Perrinet,Benoît Miramond
机构: Institut de Neurosciences de la Timone, Aix-Marseille Université, CNRS, Marseille, 13005, France; Laboratoire d’Électronique, Antennes et Télécommunications, Université Côte d’Azur, CNRS, Sophia Antipolis, 06903, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 page, 11 figure main paper + 3 pages, 6 appendix
Abstract:Human vision achieves remarkable perceptual performance while operating under strict metabolic constraints. A key ingredient is the selective attention mechanism, driven by rapid saccadic eye movements that constantly reposition the high-resolution fovea onto task-relevant locations, unlike conventional AI systems that process entire images with equal emphasis. Our work aims to draw inspiration from the human visual system to create smarter, more efficient image processing models. Using DINO, a self-supervised Vision Transformer that produces attention maps strikingly similar to human gaze patterns, we explore a saccade inspired method to focus the processing of information on key regions in visual space. To do so, we use the ImageNet dataset in a standard classification task and measure how each successive saccade affects the model’s class scores. This selective-processing strategy preserves most of the full-image classification performance and can even outperform it in certain cases. By benchmarking against established saliency models built for human gaze prediction, we demonstrate that DINO provides superior fixation guidance for selecting informative regions. These findings highlight Vision Transformer attention as a promising basis for biologically inspired active vision and open new directions for efficient, neuromorphic visual processing.
[CV-45] ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis CVPR2026
【速读】:该论文旨在解决文本到动作生成(text-to-motion synthesis)中现有方法在生成特定身体部位动作时准确性不足,以及基于部件的运动生成方法因独立生成各部分运动而导致全身动作不连贯的问题。其解决方案的关键在于提出ParTY框架,通过三个核心模块实现:(1) 部件引导网络(Part-Guided Network),先生成局部部件运动以获取部件引导信息,再用于生成整体动作;(2) 部件感知的文本对齐机制(Part-aware Text Grounding),多样化地转换文本嵌入并精准对齐至每个身体部位;(3) 整体-部件融合机制(Holistic-Part Fusion),自适应融合整体与部件运动,从而在提升部件表达力的同时保证全身动作的一致性与自然性。
链接: https://arxiv.org/abs/2603.09611
作者: KunHo Heo,SuYeon Kim,Yonghyun Gwon,Youngbin Kim,MyeongAh Cho
机构: Kyung Hee University (경희대학교)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Code: this https URL
Abstract:Text-to-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.
[CV-46] BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers CVPR2026
【速读】:该论文旨在解决视觉任务中Transformer模型注意力模块计算复杂度高的问题,尤其是传统8-bit或4-bit量化方法在效率与精度之间难以平衡的瓶颈。其解决方案的关键在于提出BinaryAttention,通过将查询(query)和键(key)二值化(仅保留符号信息),并将浮点点积替换为位运算操作,实现1-bit qk-attention的高效计算;同时引入可学习偏置项缓解1-bit量化带来的信息损失,并结合量化感知训练(quantization-aware training)与自蒸馏(self-distillation)技术确保注意力相似性对齐,从而在显著降低计算开销的同时保持甚至超越全精度注意力的性能。
链接: https://arxiv.org/abs/2603.09582
作者: Chaodong Xiao,Zhengqiang Zhang,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit qk-attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2x faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit vision and diffusion transformers. The codes and models can be found at this https URL.
[CV-47] More than the Sum: Panorama-Language Models for Adverse Omni-Scenes CVPR2026
【速读】:该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)在处理全景图像时的局限性问题,即当前模型主要针对针孔成像(pinhole imagery)设计,通过拼接多个窄视场输入来构建完整场景理解,但这种多视角感知方式忽略了单个全景图所固有的全局空间和上下文关系。其解决方案的关键在于提出全景语言建模(Panorama-Language Modeling, PLM)范式,该范式实现了对360°视觉信息的统一语言推理能力,并开发了一个即插即用的全景稀疏注意力模块(panoramic sparse attention module),使现有的针孔基VLM无需重新训练即可处理等距投影(equirectangular)全景图像,从而显著提升在复杂全景场景下的鲁棒性和整体推理能力。
链接: https://arxiv.org/abs/2603.09573
作者: Weijia Fan,Ruiping Liu,Jiale Wei,Yufan Chen,Junwei Zheng,Zichao Zeng,Jiaming Zhang,Qiufu Li,Linlin Shen,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Hunan University (湖南大学); Shenzhen University (深圳大学); UCL (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project page: this https URL
Abstract:Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified 360^\circ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: this https URL.
[CV-48] GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning
【速读】:该论文旨在解决现有遥感视觉-语言预训练模型在细粒度对齐能力上的不足,即模型通常仅依赖全局图像与文本的对齐方式,难以有效整合多粒度的视觉与文本信息,从而限制了其在复杂细粒度任务中的性能表现。解决方案的关键在于提出GeoAlignCLIP框架,该框架通过学习多粒度语义对齐(multi-granular semantic alignments)并引入模态内一致性(intra-modal consistency),实现图像区域与文本概念之间更精确的视觉-语义对齐;同时构建了RSFG-100k数据集,提供层次化监督信号以支持模型训练,显著提升了模型在多个公开遥感基准上的细粒度对齐能力与任务表现。
链接: https://arxiv.org/abs/2603.09566
作者: Xiao Yang,Ronghao Fu,Zhuoran Duan,Zhiwen Lin,Xueyan Liu,Bo Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model’s ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and challenging hard-negative samples, providing hierarchical supervision for model training. Extensive experiments conducted on multiple public remote-sensing benchmarks demonstrate that GeoAlignCLIP consistently outperforms existing RS-specific methods across diverse tasks, exhibiting more robust and accurate fine-grained vision-language alignment.
[CV-49] GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision
【速读】:该论文旨在解决遥感图像理解中生成式 AI(Generative AI)模型在执行复杂、多步骤推理时难以保证视觉忠实性(visual faithfulness)的问题。现有方法虽引入了链式思维(Chain-of-Thought, CoT)推理机制,但中间步骤的视觉一致性缺乏有效验证,导致推理过程不可靠。其解决方案的关键在于提出 GeoSolver 框架,通过构建大规模 token 级别过程监督数据集 Geo-PRM-2M(基于熵引导的蒙特卡洛树搜索和针对性视觉幻觉注入),训练出一个 token 级别过程奖励模型(Process Reward Model, PRM)GeoPRM,用于提供细粒度的视觉忠实性反馈;并设计 Process-Aware Tree-GRPO 强化学习算法,结合树状探索结构与忠实性加权奖励机制,精准分配中间步骤的信用。这一方法显著提升了模型在多种遥感基准上的性能,并实现了测试时扩展(Test-Time Scaling, TTS)能力,且 GeoPRM 具备跨模型泛化能力,可增强通用视觉语言模型(VLMs)的推理可靠性。
链接: https://arxiv.org/abs/2603.09551
作者: Lang Sun,Ronghao Fu,Zhuoran Duan,Haoran Liu,Xueyan Liu,Bo Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.
[CV-50] A comprehensive study of time-of-flight non-line-of-sight imaging
【速读】:该论文旨在解决当前时间-of-flight非视距(ToF NLOS)成像方法因公式多样性和硬件实现差异而导致的理论与实验评估难以统一的问题。其解决方案的关键在于构建一个通用的前向模型来统一描述多种ToF NLOS成像方法,并在此基础上分析其简化后的正向与逆向模型与Radon变换家族的关系,同时将频域迁移策略与基于相量的虚拟视距成像模型联系起来,从而为不同方法提供可比较的基准框架。通过在相同硬件条件下对选定方法进行定量对比实验,验证了现有方法在空间分辨率、可见性及噪声敏感性方面具有相似局限性,仅在特定参数设置上存在差异,为未来研究提供了客观评估新旧方法的标准化路径。
链接: https://arxiv.org/abs/2603.09548
作者: Julio Marco,Adrian Jarabo,Ji Hyun Nam,Alberto Tosi,Diego Gutierrez,Andreas Velten
机构: Universidad de Zaragoza–I3A (萨拉戈萨大学–I3A); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Inha University (仁荷大学); Politecnico di Milano (米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Time-of-Flight non-line-of-sight (ToF NLOS) imaging techniques provide state-of-the-art reconstructions of scenes hidden around corners by inverting the optical path of indirect photons scattered by visible surfaces and measured by picosecond resolution sensors. The emergence of a wide range of ToF NLOS imaging methods with heterogeneous formulae and hardware implementations obscures the assessment of both their theoretical and experimental aspects. We present a comprehensive study of a representative set of ToF NLOS imaging methods by discussing their similarities and differences under common formulation and hardware. We first outline the problem statement under a common general forward model for ToF NLOS measurements, and the typical assumptions that yield tractable inverse models. We discuss the relationship of the resulting simplified forward and inverse models to a family of Radon transforms, and how migrating these to the frequency domain relates to recent phasor-based virtual line-of-sight imaging models for NLOS imaging that obey the constraints of conventional lens-based imaging systems. We then evaluate performance of the selected methods on hidden scenes captured under the same hardware setup and similar photon counts. Our experiments show that existing methods share similar limitations on spatial resolution, visibility, and sensitivity to noise when operating under equal hardware constraints, with particular differences that stem from method-specific parameters. We expect our methodology to become a reference in future research on ToF NLOS imaging to obtain objective comparisons of existing and new methods.
[CV-51] Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA
【速读】:该论文旨在解决具身问答(Embodied Question Answering, EQA)在动态、人类参与场景下的两个核心挑战:一是由视角依赖的遮挡导致的感知歧义问题,二是如何在保证高效推理的同时维护紧凑且实时更新的视觉证据记忆。传统方法依赖“存储-检索”策略,在动态环境中会积累冗余信息并增加计算开销。为此,作者提出了一种无需训练的框架DIVRR(Dynamic-Informed View Refinement and Relevance-guided Adaptive Memory Selection),其关键在于将基于相关性的视图精炼(relevance-guided view refinement)与自适应记忆选择(adaptive memory selection)相结合:通过在存入记忆前验证观察结果的可靠性,并仅保留高信息量的证据,从而提升对遮挡的鲁棒性,同时保持低延迟和紧凑的记忆结构。
链接: https://arxiv.org/abs/2603.09541
作者: Xin Lu,Rui Li,Xun Huang,Weixin Li,Chuanqing Zhuang,Jiayuan Li,Zhengda Lu,Jun Xiao,Yunhong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Embodied Question Answering (EQA) has traditionally been evaluated in temporally stable environments where visual evidence can be accumulated reliably. However, in dynamic, human-populated scenes, human activities and occlusions introduce significant perceptual non-stationarity: task-relevant cues are transient and view-dependent, while a store-then-retrieve strategy over-accumulates redundant evidence and increases inference cost. This setting exposes two practical challenges for EQA agents: resolving ambiguity caused by viewpoint-dependent occlusions, and maintaining compact yet up-to-date evidence for efficient inference. To enable systematic study of this setting, we introduce DynHiL-EQA, a human-in-the-loop EQA dataset with two subsets: a Dynamic subset featuring human activities and temporal changes, and a Static subset with temporally stable observations. To address the above challenges, we present DIVRR (Dynamic-Informed View Refinement and Relevance-guided Adaptive Memory Selection), a training-free framework that couples relevance-guided view refinement with selective memory admission. By verifying ambiguous observations before committing them and retaining only informative evidence, DIVRR improves robustness under occlusions while preserving fast inference with compact memory. Extensive experiments on DynHiL-EQA and the established HM-EQA dataset demonstrate that DIVRR consistently improves over existing baselines in both dynamic and static settings while maintaining high inference efficiency.
[CV-52] owards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
【速读】:该论文旨在解决统一视觉-语言模型在生成多模态交错输出(multimodal interleaved outputs)方面的不足,这一能力对于视觉叙事和分步视觉推理等任务至关重要。现有模型虽在多模态理解与生成上取得进展,但难以有效实现文本与图像的交替生成。解决方案的关键在于提出一种基于强化学习的后训练策略:首先通过一个预热阶段使用混合数据集(包含精选的交错序列和有限的多模态理解数据)引导模型学习交错生成模式,同时保留其预训练能力;随后引入一种扩展自Group Relative Policy Optimization (GRPO) 的统一策略优化框架,在单一解码轨迹中联合建模文本与图像生成,并采用融合文本相关性、图文对齐性和结构保真度的混合奖励机制进行优化,同时引入过程级奖励以提供步骤级指导,从而显著提升复杂多模态任务中的生成质量和连贯性。
链接: https://arxiv.org/abs/2603.09538
作者: Ming Nie,Chunwei Wang,Jianhua Han,Hang Xu,Li Zhang
机构: Fudan University (复旦大学); Huawei (华为); Yinwang Intelligent Technology Co., Ltd. (英伟旺科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.
[CV-53] DCAU-Net: Differential Cross Attention and Channel-Spatial Feature Fusion for Medical Image Segmentation IJCNN2026
【速读】:该论文旨在解决医学图像分割中长期依赖建模与细粒度边界细节捕捉之间的矛盾问题。传统卷积神经网络(Convolutional Neural Networks, CNNs)因感受野有限导致语义信息不足,而Transformer虽能缓解此问题,却引入了二次计算复杂度及对无关区域分配非忽略注意力权重的问题,从而削弱对判别性结构的关注,影响分割精度。此外,编码器-解码器架构中的融合策略多采用简单拼接或相加方式,难以自适应地整合高层语义信息与低层空间细节。为此,作者提出DCAU-Net框架,其核心创新在于两点:一是设计差异交叉注意力(Differential Cross Attention, DCA),通过计算两个独立Softmax注意力图的差异来自适应突出判别结构,并以窗口级汇总token替代像素级键值token,显著降低计算复杂度而不损失精度;二是引入通道-空间特征融合(Channel-Spatial Feature Fusion, CSFF)策略,利用序列化的通道与空间注意力机制对跳跃连接和上采样路径特征进行自适应校准,有效抑制冗余信息并增强显著线索。
链接: https://arxiv.org/abs/2603.09530
作者: Yanxin Li,Hui Wan,Libin Lan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IJCNN 2026, 6 pages, 5 tables, 4 figures
Abstract:Accurate medical image segmentation requires effective modeling of both long-range dependencies and fine-grained boundary details. While transformers mitigate the issue of insufficient semantic information arising from the limited receptive field inherent in convolutional neural networks, they introduce new challenges: standard self-attention incurs quadratic computational complexity and often assigns non-negligible attention weights to irrelevant regions, diluting focus on discriminative structures and ultimately compromising segmentation accuracy. Existing attention variants, although effective in reducing computational complexity, fail to suppress redundant computation and inadvertently impair global context modeling. Furthermore, conventional fusion strategies in encoder-decoder architectures, typically based on simple concatenation or summation, can not adaptively integrate high-level semantic information with low-level spatial details. To address these limitations, we propose DCAU-Net, a novel yet efficient segmentation framework with two key ideas. First, a new Differential Cross Attention (DCA) is designed to compute the difference between two independent softmax attention maps to adaptively highlight discriminative structures. By replacing pixel-wise key and value tokens with window-level summary tokens, DCA dramatically reduces computational complexity without sacrificing precision. Second, a Channel-Spatial Feature Fusion (CSFF) strategy is introduced to adaptively recalibrate features from skip connections and up-sampling paths through using sequential channel and spatial attention, effectively suppressing redundant information and amplifying salient cues. Experiments on two public benchmarks demonstrate that DCAU-Net achieves competitive performance with enhanced segmentation accuracy and robustness.
[CV-54] RESBev: Making BEV Perception More Robust
【速读】:该论文旨在解决自动驾驶系统中鸟瞰图(Bird’s-eye-view, BEV)感知在真实场景下因传感器退化和对抗攻击导致的感知异常问题,这些问题会显著影响系统的安全性。解决方案的关键在于将感知鲁棒性重构为潜在语义预测问题,通过构建一个潜在世界模型来提取连续BEV观测中的时空相关性,从而学习底层的BEV状态转移规律,并据此预测干净的BEV特征以重建受损观测。该方法在Lift-Splat-Shoot架构的语义特征层面运行,无需修改原有主干网络即可实现对自然扰动与对抗攻击的泛化恢复能力。
链接: https://arxiv.org/abs/2603.09529
作者: Lifeng Zhuo,Kefan Jin,Zhe Liu,Hesheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Bird’s-eye-view (BEV) perception has emerged as a cornerstone of autonomous driving systems, providing a structured, ego-centric representation critical for downstream planning and control. However, real-world deployment faces challenges from sensor degradation and adversarial attacks, which can cause severe perceptual anomalies and ultimately compromise the safety of autonomous driving systems. To address this, we propose a resilient and plug-and-play BEV perception method, RESBev, which can be easily applied to existing BEV perception methods to enhance their robustness to diverse disturbances. Specifically, we reframe perception robustness as a latent semantic prediction problem. A latent world model is constructed to extract spatiotemporal correlations across sequential BEV observations, thereby learning the underlying BEV state transitions to predict clean BEV features for reconstructing corrupted observations. The proposed framework operates at the semantic feature level of the Lift-Splat-Shoot pipeline, enabling recovery that generalizes across both natural disturbances and adversarial attacks without modifying the underlying backbone. Extensive experiments on the nuScenes dataset demonstrate that, with few-shot fine-tuning, RESBev significantly improves the robustness of existing BEV perception models against various external disturbances and adversarial attacks.
[CV-55] Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)作为驾驶辅助系统时存在的响应不一致性和有限的时间推理能力问题。具体而言,研究发现即使VLM具备较强的视觉理解能力,其输出仍可能因输入微小扰动而出现显著差异,且难以基于当前观测对后续事件进行连贯的时间序列推理,导致决策不可靠。解决方案的关键在于引入一个名为FutureVQA的人工标注基准数据集,用于专门评估未来场景推理能力,并提出一种无需时间标签的自监督微调方法,结合思维链(chain-of-thought)推理机制,在不依赖外部时间标注的情况下有效提升模型的一致性与时间推理性能。
链接: https://arxiv.org/abs/2603.09512
作者: Chun-Peng Chang,Chen-Yu Wang,Holger Caesar,Alain Pagani
机构: DFKI Augmented Vision (DFKI增强视觉); TU Delft (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.
[CV-56] Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation CVPR2026
【速读】:该论文旨在解决文本目标实例导航(Text-goal Instance Navigation, TGIN)问题,即在存在同类别干扰物的复杂3D场景中,仅凭自由格式的自然语言描述准确识别并导航至特定对象实例。其核心解决方案在于提出Context-Nav框架,关键创新点包括:(1)通过计算密集的文本-图像对齐生成价值图(value map),将长句上下文信息作为全局探索先验,引导智能体优先探索与完整描述一致的区域,从而避免因局部匹配导致的无效移动;(2)引入基于视角感知的空间关系验证机制,在候选目标被观测后,通过采样多个观察位姿、对齐局部坐标系,并仅当至少一个视角下空间关系可满足时才确认为目标,实现几何约束下的细粒度实例消歧。该方法无需任务特定训练或微调,显著优于现有基准(InstanceNav 和 CoIN-Bench),验证了几何驱动的空间推理在复杂场景中替代昂贵策略训练或人工交互的有效性。
链接: https://arxiv.org/abs/2603.09506
作者: Won Shik Jang,Ue-Hwan Kim
机构: Gwangju Institute of Science and Technology (光州科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Camera-ready version. Accepted to CVPR 2026
Abstract:Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textitContext-Nav that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers – guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.
[CV-57] SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding
【速读】:该论文旨在解决多任务联邦学习(Multi-Task Federated Learning, MTFL)在机器人辅助微创手术(Robot-Assisted Minimally Invasive Surgery, RAS)场景下视频理解中的两个核心挑战:一是组织多样性(Tissue Diversity),即本地模型难以适应不同医疗站点特有的组织特征,导致在异质临床环境中预测性能下降;二是任务多样性(Task Diversity),即服务器端基于梯度聚类的聚合方式因跨站点任务异构性而产生次优或错误的参数更新,进而影响定位精度。解决方案的关键在于提出SurgFed框架,其核心创新为两项设计:语言引导的通道选择(Language-guided Channel Selection, LCS),通过轻量级个性化通道选择网络结合预定义文本输入,优化本地模型对特定组织嵌入的学习能力;以及语言引导的超聚合机制(Language-guided Hyper Aggregation, LHA),利用层间交叉注意力机制建模跨站点任务交互,并驱动超网络生成个性化的参数更新策略,从而实现跨站点与跨任务的充分探索与协同优化。
链接: https://arxiv.org/abs/2603.09496
作者: Zheng Fang,Ziwei Niu,Ziyue Wang,Zhu Zhuo,Haofeng Liu,Shuyang Qian,Jun Xia,Yueming Jin
机构: National University of Singapore (新加坡国立大学); The Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgical scene Multi-Task Federated Learning (MTFL) is essential for robot-assisted minimally invasive surgery (RAS) but remains underexplored in surgical video understanding due to two key challenges: (1) Tissue Diversity: Local models struggle to adapt to site-specific tissue features, limiting their effectiveness in heterogeneous clinical environments and leading to poor local predictions. (2) Task Diversity: Server-side aggregation, relying solely on gradient-based clustering, often produces suboptimal or incorrect parameter updates due to inter-site task heterogeneity, resulting in inaccurate localization. In light of these two issues, we propose SurgFed, a multi-task federated learning framework, enabling federated learning for surgical scene segmentation and depth estimation across diverse surgical types. SurgFed is powered by two appealing designs, i.e., Language-guided Channel Selection (LCS) and Language-guided Hyper Aggregation (LHA), to address the challenge of fully exploration on corss-site and cross-task. Technically, the LCS is first designed a lightweight personalized channel selection network that enhances site-specific adaptation using pre-defined text inputs, which optimally the local model learn the specific embeddings. We further introduce the LHA that employs a layer-wise cross-attention mechanism with pre-defined text inputs to model task interactions across sites and guide a hypernetwork for personalized parameter updates. Extensive empirical evidence shows that SurgFed yields improvements over the state-of-the-art methods in five public datasets across four surgical types. The code is available at this https URL.
[CV-58] Evolving Prompt Adaptation for Vision-Language Models
【速读】:该论文旨在解决大规模视觉语言模型(Vision-Language Models, VLMs)在下游任务中使用少量标注数据进行适应时面临的灾难性遗忘问题,即参数高效提示学习方法常导致预训练知识的丢失。解决方案的关键在于提出EvoPrompt框架,其核心创新是通过显式引导提示的演化路径实现无遗忘的稳定微调:首先利用模态共享提示投影器(Modality-Shared Prompt Projector, MPP)从统一嵌入空间生成分层提示;其次采用进化训练策略将低秩更新解耦为方向与幅度分量,仅调整幅度而不改变早期习得的语义方向,从而保留基础知识;最后引入特征几何正则化(Feature Geometric Regularization, FGR)强制特征去相关以防止表示坍缩,整体保障了模型在少样本学习中的性能提升和零样本能力的稳健保持。
链接: https://arxiv.org/abs/2603.09493
作者: Enming Zhang,Jiayang Li,Yanru Wu,Zhenyu Liu,Yang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free adaptation. To this end, we propose EvoPrompt, a novel framework designed to explicitly steer the prompt trajectory for stable, knowledge-preserving fine-tuning. Specifically, our approach employs a Modality-Shared Prompt Projector (MPP) to generate hierarchical prompts from a unified embedding space. Critically, an evolutionary training strategy decouples low-rank updates into directional and magnitude components, preserving early-learned semantic directions while only adapting their magnitude, thus enabling prompts to evolve without discarding foundational knowledge. This process is further stabilized by Feature Geometric Regularization (FGR), which enforces feature decorrelation to prevent representation collapse. Extensive experiments demonstrate that EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.
[CV-59] Streaming Autoregressive Video Generation via Diagonal Distillation
【速读】:该论文旨在解决大规模预训练扩散模型在实时视频流生成中应用受限的问题,特别是现有视频蒸馏方法因忽视时序依赖性而导致运动一致性差、误差累积严重及延迟与质量之间的权衡难题。其解决方案的关键在于提出一种名为“对角蒸馏(Diagonal Distillation)”的新框架,该框架通过异构生成策略——即早期采用更多去噪步骤、后期逐步减少步骤——有效利用跨视频片段和去噪步长的时序信息;同时,通过将片段生成过程中隐式预测的后续噪声水平与实际推理条件对齐,缓解误差传播并降低长序列中的过饱和现象,并引入隐式光流建模以在严格步数约束下保持运动质量,从而实现高效且高质量的视频生成(5秒视频仅需2.61秒,速度提升达277.3倍)。
链接: https://arxiv.org/abs/2603.09488
作者: Jinxiu Liu,Xuanming Liu,Kangfu Mei,Yandong Wen,Ming-HsuanYang,Weiyang Liu
机构: South China University of Technology (华南理工大学); Westlake University (西湖大学); Johns Hopkins University (约翰霍普金斯大学); University of California, Merced (加州大学默塞德分校); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.
[CV-60] Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion
【速读】:该论文旨在解决从自由手绘草图到逼真图像的生成问题,这一任务因草图具有抽象性、稀疏性和风格多样性而极具挑战,现有基于生成对抗网络(GAN)和扩散模型的方法在细节重建、空间对齐保持以及跨草图领域适应方面表现不足。解决方案的关键在于提出一种组件感知且自优化的两阶段框架:第一阶段采用基于自注意力机制的自动编码器网络(SA2N)提取组件级语义与结构特征;第二阶段通过坐标保持门控融合模块(CGF)将这些特征整合为一致的空间布局,并由基于改进StyleGAN2架构的空间自适应细化修订器(SARR)进行迭代优化,以提升图像真实感与一致性。该方法在多个面部与非面部数据集上均显著优于当前最优GAN与扩散模型,在图像保真度、语义准确性及感知质量上取得大幅提升。
链接: https://arxiv.org/abs/2603.09484
作者: Ali Zia,Muhammad Umer Ramzan,Usman Ali,Muhammad Faheem,Abdelwahed Khamis,Shahnawaz Qureshi
机构: La Trobe University (拉特罗布大学); Gujranwala Institute of Future Technologies (GIFT) University (古兰瓦拉未来技术学院( GIFT )大学); Data61, CSIRO (数据61,CSIRO); Sino-Pak Centre for Artificial Intelligence Pak-Austria Fachhochschule Institute of Applied Sciences and Technology (中巴人工智能中心巴基斯坦-奥地利应用科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.
[CV-61] Prune Redundancy Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity ICLR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)因生成过多视觉标记(visual tokens)而导致的计算效率低下问题。现有压缩方法难以在保留重要信息与维持信息多样性之间取得平衡。其解决方案的关键在于提出一种无需训练的协同重要性-多样性方法 PruneSID,包含两个核心阶段:首先通过主语义成分分析(Principal Semantic Components Analysis, PSCA)将视觉标记聚类为语义一致的组别,确保概念覆盖全面;其次在组内采用非极大值抑制(Intra-group Non-Maximum Suppression, NMS)机制剔除冗余标记并保留关键代表性标记。此外,PruneSID 引入基于图像复杂度的信息感知动态压缩率机制,实现跨场景的平均信息保真度优化,从而在极低token保留率下仍保持高精度性能,且显著提升推理速度。
链接: https://arxiv.org/abs/2603.09480
作者: Zhengyao Fang,Pengyuan Lyu,Chengquan Zhang,Guangming Lu,Jun Yu,Wenjie Pei
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳); Peng Cheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICLR2026
Abstract:Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 \times faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at this https URLthis https URL.
[CV-62] OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks
【速读】:该论文旨在解决遥感视觉语言模型(Remote Sensing Vision-Language Models, RSVLMs)缺乏系统性评估基准的问题。现有研究虽已证明视觉语言模型(Vision-Language Models, VLMs)在通用任务中具备良好的感知与推理能力,但其在地球观测(Earth observation)场景下的性能仍缺乏全面、细粒度的量化评估工具。为填补这一空白,作者提出OmniEarth基准,其关键在于:(1)从感知、推理和鲁棒性三个维度构建28个细粒度任务,覆盖多源遥感数据与多样地理空间情境;(2)支持多种任务形式(多项选择VQA、开放式VQA及其文本、边界框、掩码输出),提升评估多样性;(3)引入盲测协议与五重语义一致性要求,有效降低语言偏差并检验模型是否基于视觉证据进行预测。该基准包含9,275张高质量图像(含吉林一号卫星影像)及44,210条人工验证指令,可系统评测对比学习模型、闭源/开源通用VLMs及专用RSVLMs,揭示当前模型在复杂地理空间任务中的显著性能差距。
链接: https://arxiv.org/abs/2603.09471
作者: Ronghao Fu,Haoran Liu,Weijie Zhang,Zhiwen Lin,Xiao Yang,Peng Zhang,Bo Yang
机构: Jilin University (吉林大学); Chang Guang Satellite Technology (长光卫星技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) have demonstrated effective perception and reasoning capabilities on general-domain tasks, leading to growing interest in their application to Earth observation. However, a systematic benchmark for comprehensively evaluating remote sensing vision-language models (RSVLMs) remains lacking. To address this gap, we introduce OmniEarth, a benchmark for evaluating RSVLMs under realistic Earth observation scenarios. OmniEarth organizes tasks along three capability dimensions: perception, reasoning, and robustness. It defines 28 fine-grained tasks covering multi-source sensing data and diverse geospatial contexts. The benchmark supports two task formulations: multiple-choice VQA and open-ended VQA. The latter includes pure text outputs for captioning tasks, bounding box outputs for visual grounding tasks, and mask outputs for segmentation tasks. To reduce linguistic bias and examine whether model predictions rely on visual evidence, OmniEarth adopts a blind test protocol and a quintuple semantic consistency requirement. OmniEarth includes 9,275 carefully quality-controlled images, including proprietary satellite imagery from Jilin-1 (JL-1), along with 44,210 manually verified instructions. We conduct a systematic evaluation of contrastive learning-based models, general closed-source and open-source VLMs, as well as RSVLMs. Results show that existing VLMs still struggle with geospatially complex tasks, revealing clear gaps that need to be addressed for remote sensing applications. OmniEarth is publicly available at this https URL.
[CV-63] he Patrologia Graeca Corpus: OCR Annotation and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions
【速读】:该论文旨在解决十九世纪未数字化的古希腊语文献(特别是《Patrologia Graeca》系列)在光学字符识别(OCR)中存在的高噪声和复杂排版问题,这些文献采用希腊语与拉丁语双语混排且字体为高度退化的多调号希腊文(polytonic Greek)。解决方案的关键在于构建一个专用的处理流程,结合基于YOLO的版面检测模块与基于CRNN的文本识别模型,实现了1.05%的字符错误率(CER)和4.69%的词错误率(WER),显著优于现有针对多调号希腊文的OCR系统。该方案不仅生成了高质量的可检索、可分析的语料库,还为未来大语言模型(LLM)训练提供了宝贵的数据资源。
链接: https://arxiv.org/abs/2603.09470
作者: Chahan Vidal-Gorène(CJM, LIPN),Bastien Kindt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.
[CV-64] opoOR: A Unified Topological Scene Representation for the Operating Room
【速读】:该论文旨在解决现有手术场景图(Surgical Scene Graphs)在建模手术室(OR)复杂关系时存在的严格二元结构限制问题,即传统方法依赖成对消息传递或分词序列会破坏关系结构的流形几何特性,导致信息丢失。其解决方案的关键在于提出TopoOR框架,通过将实体间交互提升至高阶拓扑单元(higher-order topological cells),天然保留成对与群体关系,从而以更高表达力建模手术室中的多模态动态特性;同时引入高阶注意力机制,在层级关系注意力中显式保持流形结构和模态特异性特征,避免将3D几何、音频与机器人运动学融合为单一潜在表示,从而保障安全关键推理所需的精确多模态结构。
链接: https://arxiv.org/abs/2603.09466
作者: Tony Danjun Wang,Ka Young Kim,Tolga Birdal,Nassir Navab,Lennart Bastian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgical Scene Graphs abstract the complexity of surgical operating rooms (OR) into a structure of entities and their relations, but existing paradigms suffer from strictly dyadic structural limitations. Frameworks that predominantly rely on pairwise message passing or tokenized sequences flatten the manifold geometry inherent to relational structures and lose structure in the process. We introduce TopoOR, a new paradigm that models multimodal operating rooms as a higher-order structure, innately preserving pairwise and group relationships. By lifting interactions between entities into higher-order topological cells, TopoOR natively models complex dynamics and multimodality present in the OR. This topological representation subsumes traditional scene graphs, thereby offering strictly greater expressivity. We also propose a higher-order attention mechanism that explicitly preserves manifold structure and modality-specific features throughout hierarchical relational attention. In this way, we circumvent combining 3D geometry, audio, and robot kinematics into a single joint latent representation, preserving the precise multimodal structure required for safety-critical reasoning, unlike existing methods. Extensive experiments demonstrate that our approach outperforms traditional graph and LLM-based baselines across sterility breach detection, robot phase prediction, and next-action anticipation
[CV-65] EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶任务中因解冻视觉编码器后感知性能下降以及长期规划中累积不稳定性的问题。解决方案的关键在于提出EvoDriveVLA框架,其核心是协同感知与规划的蒸馏机制:一方面通过自锚定感知蒸馏(self-anchored perceptual distillation),利用自锚定教师模型提供视觉约束,引导学生模型关注轨迹引导的关键区域以稳定表征;另一方面通过Oracle引导的轨迹蒸馏(oracle-guided trajectory distillation),借助未来感知的教师模型进行粗到细的轨迹优化,并结合蒙特卡洛丢弃采样生成高质量候选轨迹,从而选择最优轨迹指导学生预测。该方法显著提升了开环和闭环评估下的性能表现。
链接: https://arxiv.org/abs/2603.09465
作者: Jiajun Cao,Xiaoan Zhang,Xiaobao Wei,Liyuqiu Huang,Wang Zijian,Hanzhen Zhang,Zhengyu Jia,Wei Mao,Hao Wang,Xianming Liu,Shuchang Zhou Liu,Yang Wang,Shanghang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures
Abstract:Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, oracle-guided trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates, thereby selecting the optimal trajectory to guide the student’s prediction. EvoDriveVLA achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation. Our code is available at: this https URL.
[CV-66] A Guideline-Aware AI Agent for Zero-Shot Target Volume Auto-Delineation MICCAI2026
【速读】:该论文旨在解决放射治疗中临床靶区(Clinical Target Volume, CTV)勾画依赖专家标注数据、难以随临床指南更新而快速调整的问题。传统深度学习模型需耗费大量成本重新训练以适应新指南,限制了其在实际临床中的灵活性与可扩展性。解决方案的关键在于提出一种名为OncoAgent的新型“指南感知”AI代理框架,该框架无需任何训练即可将文本形式的临床指南直接转化为三维靶区轮廓,实现了零样本(zero-shot)映射。其核心创新在于将医学指南规则结构化并嵌入到生成式代理逻辑中,从而在不依赖标注数据的前提下实现高精度(Dice系数达0.842)且符合临床规范的CTV和计划靶区(Planning Target Volume, PTV)自动勾画,并展现出对不同器官部位(如前列腺)和指南版本的良好泛化能力,显著提升了放射治疗规划的适应性、透明度与临床接受度。
链接: https://arxiv.org/abs/2603.09448
作者: Yoon Jo Kim,Wonyoung Cho,Jongmin Lee,Han Joo Chae,Hyunki Park,Sang Hoon Seo,Noh Jae Myung,Kyungmi Yang,Dongryul Oh,Jin Sung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to MICCAI 2026
Abstract:Delineating the clinical target volume (CTV) in radiotherapy involves complex margins constrained by tumor location and anatomical barriers. While deep learning models automate this process, their rigid reliance on expert-annotated data requires costly retraining whenever clinical guidelines update. To overcome this limitation, we introduce OncoAgent, a novel guideline-aware AI agent framework that seamlessly converts textual clinical guidelines into three-dimensional target contours in a training-free manner. Evaluated on esophageal cancer cases, the agent achieves a zero-shot Dice similarity coefficient of 0.842 for the CTV and 0.880 for the planning target volume, demonstrating performance highly comparable to a fully supervised nnU-Net baseline. Notably, in a blinded clinical evaluation, physicians strongly preferred OncoAgent over the supervised baseline, rating it higher in guideline compliance, modification effort, and clinical acceptability. Furthermore, the framework generalizes zero-shot to alternative esophageal guidelines and other anatomical sites (e.g., prostate) without any retraining. Beyond mere volumetric overlap, our agent-based paradigm offers near-instantaneous adaptability to alternative guidelines, providing a scalable and transparent pathway toward interpretability in radiotherapy treatment planning.
[CV-67] GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis AAAI-26 AAAI
【速读】:该论文旨在解决当前多视角计算机辅助诊断(CADx)系统在处理医学影像时,难以准确建模异常病灶在单个视图内的相互依赖关系以及跨视图动态变化的问题,同时应对临床中常见的数据缺失挑战。其解决方案的关键在于提出一种基于图结构的新型框架GIIM(Graph-based Interpretable Imaging Model),该框架能够同步捕捉病灶间的 intra-view 依赖关系和 inter-view 动态演化特性,并通过专门设计的数据缺失处理机制提升诊断鲁棒性,从而显著增强诊断准确性和可靠性。
链接: https://arxiv.org/abs/2603.09446
作者: Tran Bao Sam,Hung Vu,Dao Trung Kien,Tran Dat Dang,Van Ha Tang,Steven Truong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in the 40th AAAI Conference on Artificial Intelligence (AAAI-26). 10 pages, 2 figures
Abstract:Computer-aided diagnosis (CADx) has become vital in medical imaging, but automated systems often struggle to replicate the nuanced process of clinical interpretation. Expert diagnosis requires a comprehensive analysis of how abnormalities relate to each other across various views and time points, but current multi-view CADx methods frequently overlook these complex dependencies. Specifically, they fail to model the crucial relationships within a single view and the dynamic changes lesions exhibit across different views. This limitation, combined with the common challenge of incomplete data, greatly reduces their predictive reliability. To address these gaps, we reframe the diagnostic task as one of relationship modeling and propose GIIM, a novel graph-based approach. Our framework is uniquely designed to simultaneously capture both critical intra-view dependencies between abnormalities and inter-view dynamics. Furthermore, it ensures diagnostic robustness by incorporating specific techniques to effectively handle missing data, a common clinical issue. We demonstrate the generality of this approach through extensive evaluations on diverse imaging modalities, including CT, MRI, and mammography. The results confirm that our GIIM model significantly enhances diagnostic accuracy and robustness over existing methods, establishing a more effective framework for future CADx systems.
[CV-68] Open-World Motion Forecasting
【速读】:该论文旨在解决传统运动预测(motion forecasting)方法在开放世界场景中性能下降的问题,即现有方法通常假设对象类别固定且感知质量高,难以应对现实环境中感知不完善和对象类别随时间动态演化的挑战。其解决方案的关键在于提出首个端到端的类增量运动预测框架,通过伪标签策略结合视觉-语言模型过滤不一致与过度自信的预测,并引入基于查询特征方差的重放采样策略来缓解灾难性遗忘,从而在持续学习新类别的同时保持对旧类别的性能。
链接: https://arxiv.org/abs/2603.09420
作者: Nicolas Schischka,Nikhil Gosala,B Ravi Kiran,Senthil Yogamani,Abhinav Valada
机构: 1. University of Freiburg (弗莱堡大学); 2. IIT Delhi (印度理工学院德里分校); 3. Samsung Research (三星研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Motion forecasting aims to predict the future trajectories of dynamic agents in the scene, enabling autonomous vehicles to effectively reason about scene evolution. Existing approaches operate under the closed-world regime and assume fixed object taxonomy as well as access to high-quality perception. Therefore, they struggle in real-world settings where perception is imperfect and object taxonomy evolves over time. In this work, we bridge this fundamental gap by introducing open-world motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are estimated directly from camera images. We tackle this setting by proposing the first end-to-end class-incremental motion forecasting framework to mitigate catastrophic forgetting while simultaneously learning to forecast newly introduced classes. When a new class is introduced, our framework employs a pseudo-labeling strategy to first generate motion forecasting pseudo-labels for all known classes which are then processed by a vision-language model to filter inconsistent and over-confident predictions. Parallelly, our approach further mitigates catastrophic forgetting by using a novel replay sampling strategy that leverages query feature variance to sample previous sequences with informative motion patterns. Extensive evaluation on the nuScenes and Argoverse 2 datasets demonstrates that our approach successfully resists catastrophic forgetting and maintains performance on previously learned classes while improving adaptation to novel ones. Further, we demonstrate that our approach supports zero-shot transfer to real-world driving and naturally extends to end-to-end class-incremental planning, enabling continual adaptation of the full autonomous driving system. We provide the code at this https URL .
[CV-69] MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating ICRA2026
【速读】:该论文旨在解决轨迹预测(trajectory prediction)模型在测试时分布偏移(distribution shift)下性能显著下降的问题,以及现有测试时训练(test-time training)方法依赖固定预训练模型和静态更新规则、缺乏在线学习灵活性与数据适应性的问题。解决方案的关键在于:首先提出一种元学习框架,在预训练阶段通过双层优化模拟测试时适应任务,使模型具备快速准确的在线适应能力;其次在测试阶段引入数据自适应的模型更新机制,基于在线部分导数和难样本选择动态调整学习率与更新频率,从而提升对测试数据的适配效率与鲁棒性。
链接: https://arxiv.org/abs/2603.09419
作者: Yuning Wang,Pu Zhang,Yuan He,Ke Wang,Jianru Xue
机构: State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, IAIR, Xi’an Jiaotong University (西安交通大学); KargoBot
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2026
Abstract:Existing trajectory prediction methods exhibit significant performance degradation under distribution shifts during test time. Although test-time training techniques have been explored to enable adaptation, current approaches rely on an offline pre-trained predictor that lacks online learning flexibility. Moreover, they depend on fixed online model updating rules that do not accommodate the specific characteristics of test data. To address these limitations, we first propose a meta-learning framework to directly optimize the predictor for fast and accurate online adaptation, which performs bi-level optimization on the performance of simulated test-time adaptation tasks during pre-training. Furthermore, at test time, we introduce a data-adaptive model updating mechanism that dynamically adjusts the predefined learning rates and updating frequencies based on online partial derivatives and hard sample selection. This mechanism enables the online learning rate to suit the test data, and focuses on informative hard samples to enhance efficiency. Experiments are conducted on various challenging cross-dataset distribution shift scenarios, including nuScenes, Lyft, and Waymo. Results demonstrate that our method achieves superior adaptation accuracy, surpassing state-of-the-art test-time training methods for trajectory prediction. Additionally, our method excels under suboptimal learning rates and high FPS demands, showcasing its robustness and practicality.
[CV-70] CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation CVPR2026
【速读】:该论文旨在解决当前全身姿态估计器在复杂场景下缺乏鲁棒性的问题,即模型容易产生解剖学上不合理的预测结果。研究表明,这一问题源于模型从视觉上下文中学习到的虚假相关性(spurious correlations),作者通过结构因果模型(Structural Causal Model, SCM)形式化地识别出视觉上下文为混杂因子(confounder),并由此导致非因果后门路径污染了模型推理过程。解决方案的关键在于提出Causal Intervention Graph Pose (CIGPose)框架,其核心是一个新颖的因果干预模块(Causal Intervention Module):该模块首先利用预测不确定性识别受混杂影响的关键点表示,再用学习得到的与上下文无关的规范嵌入(context-invariant canonical embeddings)进行替换,从而获得去混杂的特征表示;随后,这些嵌入由分层图神经网络处理,在局部和全局语义层次上推理人体骨骼结构,以强制解剖学合理性。实验表明,CIGPose在COCO-WholeBody数据集上达到新的最先进性能,且无需额外训练数据即可实现67.0% AP,结合UBody数据集进一步提升至67.5% AP,验证了其优越的鲁棒性和数据效率。
链接: https://arxiv.org/abs/2603.09418
作者: Bohao Li,Zhicheng Cao,Huixian Li,Yangming Guo
机构: Northwestern Polytechnical University (西北工业大学); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper is accepted by CVPR 2026
Abstract:State-of-the-art whole-body pose estimators often lack robustness, producing anatomically implausible predictions in challenging scenes. We posit this failure stems from spurious correlations learned from visual context, a problem we formalize using a Structural Causal Model (SCM). The SCM identifies visual context as a confounder that creates a non-causal backdoor path, corrupting the model’s reasoning. We introduce the Causal Intervention Graph Pose (CIGPose) framework to address this by approximating the true causal effect between visual evidence and pose. The core of CIGPose is a novel Causal Intervention Module: it first identifies confounded keypoint representations via predictive uncertainty and then replaces them with learned, context-invariant canonical embeddings. These deconfounded embeddings are processed by a hierarchical graph neural network that reasons over the human skeleton at both local and global semantic levels to enforce anatomical plausibility. Extensive experiments show CIGPose achieves a new state-of-the-art on COCO-WholeBody. Notably, our CIGPose-x model achieves 67.0% AP, surpassing prior methods that rely on extra training data. With the additional UBody dataset, CIGPose-x is further boosted to 67.5% AP, demonstrating superior robustness and data efficiency. The codes and models are publicly available at this https URL.
[CV-71] PromptDLA: A Domain-aware Prompt Document Layout Analysis Framework with Descriptive Knowledge as a Cue
【速读】:该论文旨在解决现有文档布局分析(Document Layout Analysis, DLA)模型在融合多源数据集进行训练时性能不佳的问题,其核心在于忽视了不同领域间固有的布局结构差异,如标注风格、文档类型和语言等。解决方案的关键是提出 PromptDLA,一种面向领域的提示生成器(domain-aware Prompter),通过根据数据域的特定属性定制提示(prompt),将描述性知识作为线索引入 DLA 任务中,从而引导模型聚焦于关键特征与结构,显著提升跨域泛化能力。
链接: https://arxiv.org/abs/2603.09414
作者: Zirui Zhang,Yaping Zhang,Lu Xiang,Yang Zhao,Feifei Zhai,Yu Zhou,Chengqing Zong
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of the Chinese Academy of Sciences (中国科学院大学); Fanyu AI Laboratory (凡语人工智能实验室); Zhongke Fanyu Technology Co., Ltd. (中科凡语科技有限公司); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE TMM
Abstract:Document Layout Analysis (DLA) is crucial for document artificial intelligence and has recently received increasing attention, resulting in an influx of large-scale public DLA datasets. Existing work often combines data from various domains in recent public DLA datasets to improve the generalization of DLA. However, directly merging these datasets for training often results in suboptimal model performance, as it overlooks the different layout structures inherent to various domains. These variations include different labeling styles, document types, and languages. This paper introduces PromptDLA, a domain-aware Prompter for Document Layout Analysis that effectively leverages descriptive knowledge as cues to integrate domain priors into DLA. The innovative PromptDLA features a unique domain-aware prompter that customizes prompts based on the specific attributes of the data domain. These prompts then serve as cues that direct the DLA toward critical features and structures within the data, enhancing the model’s ability to generalize across varied domains. Extensive experiments show that our proposal achieves state-of-the-art performance among DocLayNet, PubLayNet, M6Doc, and D ^4 LA. Our code is available at this https URL.
[CV-72] RiO-DETR: DETR for Real-time Oriented Object Detection
【速读】:该论文旨在解决将DETR(Detection Transformer)框架扩展至实时定向目标检测(Real-time Oriented Object Detection)时面临的三大挑战:语义依赖的朝向估计、角度周期性导致的标准欧氏精化失效,以及因搜索空间扩大而引起的收敛速度下降。其核心解决方案包括三项任务原生设计:首先,提出基于内容驱动的角度估计方法(Content-Driven Angle Estimation),通过解耦角度与位置查询,并引入旋转校正正交注意力(Rotation-Rectified Orthogonal Attention)以捕获互补特征提升朝向可靠性;其次,采用解耦周期精化机制(Decoupled Periodic Refinement),结合有界粗到精更新与最短路径周期损失(Shortest-Path Periodic Loss)实现跨角度边界稳定学习;最后,引入定向密集O2O监督(Oriented Dense O2O)在不增加额外计算成本的前提下注入角度多样性,加速角度收敛。这些设计共同实现了实时性能下更优的精度-速度权衡。
链接: https://arxiv.org/abs/2603.09411
作者: Zhangchi Hu,Yifan Zhao,Yansong Peng,Wenzhang Sun,Xiangchen Yin,Jie Chen,Peixi Wu,Hebei Li,Xinghao Wang,Dongsheng Jiang,Xiaoyan Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 9 figures
Abstract:We present RiO-DETR: DETR for Real-time Oriented Object Detection, the first real-time oriented detection transformer to the best of our knowledge. Adapting DETR to oriented bounding boxes (OBBs) poses three challenges: semantics-dependent orientation, angle periodicity that breaks standard Euclidean refinement, and an enlarged search space that slows convergence. RiO-DETR resolves these issues with task-native designs while preserving real-time efficiency. First, we propose Content-Driven Angle Estimation by decoupling angle from positional queries, together with Rotation-Rectified Orthogonal Attention to capture complementary cues for reliable orientation. Second, Decoupled Periodic Refinement combines bounded coarse-to-fine updates with a Shortest-Path Periodic Loss for stable learning across angular seams. Third, Oriented Dense O2O injects angular diversity into dense supervision to speed up angle convergence at no extra cost. Extensive experiments on DOTA-1.0, DIOR-R, and FAIR-1M-2.0 demonstrate RiO-DETR establishes a new speed–accuracy trade-off for real-time oriented detection. Code will be made publicly available.
[CV-73] Reviving ConvNeXt for Efficient Convolutional Diffusion Models CVPR2026
【速读】:该论文旨在解决当前扩散模型(Diffusion Models)过度依赖Transformer架构所带来的计算效率低下问题,特别是其在局部性偏差(locality bias)、参数效率和硬件友好性方面相较于卷积神经网络(ConvNets)的不足。解决方案的关键在于提出一种全卷积扩散模型(Fully Convolutional Diffusion Model, FCDM),其骨干网络设计借鉴了ConvNeXt的现代卷积结构,但专为条件扩散建模优化。实验表明,FCDM-XL仅需DiT-XL/2一半的浮点运算量(FLOPs),即可在256×256和512×512分辨率下分别减少7倍和7.5倍的训练步数,并可在4-GPU系统上高效训练,验证了现代卷积设计在扩散模型中的高效率与竞争力。
链接: https://arxiv.org/abs/2603.09408
作者: Taesung Kwon,Lorenzo Bianchi,Lennart Wittke,Felix Watine,Fabio Carrara,Jong Chul Ye,Romann Weber,Vinicius Azevedo
机构: KAIST(韩国科学技术院); ETH Zürich(苏黎世联邦理工学院); ISTI-CNR(意大利国家研究委员会信息科学与技术研究所); University of Pisa(比萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2026. Official implementation: this https URL
Abstract:Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness–the attributes that established ConvNets as the efficient vision backbone–have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT-XL/2, FCDM-XL achieves competitive performance with 7 \times and 7.5 \times fewer training steps at 256 \times 256 and 512 \times 512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.
[CV-74] YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search
【速读】:该论文旨在解决目标检测领域神经架构搜索(Neural Architecture Search, NAS)因评估成本过高而受限的问题,尤其是在YOLO系列检测器中,完整训练每个候选架构在COCO数据集上需耗费数天GPU时间;同时,现有NAS基准主要面向图像分类任务,缺乏适用于目标检测的可比性评估标准。解决方案的关键在于提出首个专为YOLO风格检测器设计的代理基准YOLO-NAS-Bench,其定义了涵盖主干(backbone)和颈部(neck)模块的通道宽度、块深度及操作类型等多维搜索空间,并通过随机、分层和拉丁超立方采样策略生成1,000个架构,在COCO-mini上训练后构建LightGBM代理预测器。进一步地,提出自进化机制(Self-Evolving Mechanism),利用预测器自身迭代发现并评估高绩效架构,逐步调整训练分布以逼近性能前沿,最终将集成预测器的R²从0.770提升至0.815,稀疏肯德尔相关系数(Sparse Kendall Tau)从0.694提升至0.752,显著增强预测准确性和排序一致性,从而有效支撑高效进化搜索并发现超越官方YOLOv8–YOLO12基线的高性能架构。
链接: https://arxiv.org/abs/2603.09405
作者: Zhe Li,Xiaoyu Ding,Jiaxin Zheng,Yongtao Wang
机构: Wangxuan Institute of Computer Technology, Peking University, China (北京大学王选计算机研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural Architecture Search (NAS) for object detection is severely bottlenecked by high evaluation cost, as fully training each candidate YOLO architecture on COCO demands days of GPU time. Meanwhile, existing NAS benchmarks largely target image classification, leaving the detection community without a comparable benchmark for NAS evaluation. To address this gap, we introduce YOLO-NAS-Bench, the first surrogate benchmark tailored to YOLO-style detectors. YOLO-NAS-Bench defines a search space spanning channel width, block depth, and operator type across both backbone and neck, covering the core modules of YOLOv8 through YOLO12. We sample 1,000 architectures via random, stratified, and Latin Hypercube strategies, train them on COCO-mini, and build a LightGBM surrogate predictor. To sharpen the predictor in the high-performance regime most relevant to NAS, we propose a Self-Evolving Mechanism that progressively aligns the predictor’s training distribution with the high-performance frontier, by using the predictor itself to discover and evaluate informative architectures in each iteration. This method grows the pool to 1,500 architectures and raises the ensemble predictor’s R2 from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752, demonstrating strong predictive accuracy and ranking consistency. Using the final predictor as the fitness function for evolutionary search, we discover architectures that surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini, confirming the predictor’s discriminative power for top-performing detection architectures.
[CV-75] ICDAR 2025 Competition on End-to-End Document Image Machine Translation Towards Complex Layouts ICDAR2025
【速读】:该论文旨在解决文档图像机器翻译(Document Image Machine Translation, DIMT)问题,即在不依赖传统光学字符识别(OCR)步骤的前提下,直接从文档图像中联合建模文本内容与版式布局,实现跨语言的端到端翻译。其核心挑战在于如何有效融合多模态信息(图像与语义),以应对复杂排版、噪声干扰及语言差异等问题。解决方案的关键在于提出一个统一的DIMT系统框架,并通过两个并行赛道(OCR-free和OCR-based)评估不同模型规模(小模型<1B参数与大模型>1B参数)下的性能表现,从而验证大规模模型在处理复杂布局文档图像时的优越性,为未来研究提供了新范式与方向。
链接: https://arxiv.org/abs/2603.09392
作者: Yaping Zhang,Yupu Liang,Zhiyang Zhang,Zhiyuan Chen,Lu Xiang,Yang Zhao,Yu Zhou,Chengqing Zong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by ICDAR 2025
Abstract:Document Image Machine Translation (DIMT) seeks to translate text embedded in document images from one language to another by jointly modeling both textual content and page layout, bridging optical character recognition (OCR) and natural language processing (NLP). The DIMT 2025 Challenge advances research on end-to-end document image translation, a rapidly evolving area within multimodal document understanding. The competition features two tracks, OCR-free and OCR-based, each with two subtasks for small (less than 1B parameters) and large (greater than 1B parameters) models. Participants submit a single unified DIMT system, with the option to incorporate provided OCR transcripts. Running from December 10, 2024 to April 20, 2025, the competition attracted 69 teams and 27 valid submissions in total. Track 1 had 34 teams and 13 valid submissions, while Track 2 had 35 teams and 14 valid submissions. In this report, we present the challenge motivation, dataset construction, task definitions, evaluation protocol, and a summary of results. Our analysis shows that large-model approaches establish a promising new paradigm for translating complex-layout document images and highlight substantial opportunities for future research.
[CV-76] raining-Free Coverless Multi-Image Steganography with Access Control
【速读】:该论文旨在解决现有Coverless Image Steganography (CIS)方法在多用户场景下缺乏有效访问控制机制的问题,即难以实现针对不同授权用户的差异化内容隐藏与解密。其解决方案的关键在于提出一种无需训练的基于扩散模型(diffusion-based)的框架MIDAS,通过潜在空间融合(latent-level fusion)实现用户特定的访问控制;具体包括引入Random Basis机制以抑制残留结构信息,并设计Latent Vector Fusion模块对聚合潜在向量进行重构,使其与扩散过程对齐,从而在保持隐写图像质量、多样性和抗分析能力的同时,实现高效且可扩展的多图像隐藏与细粒度访问管理。
链接: https://arxiv.org/abs/2603.09390
作者: Minyeol Bae,Si-Hyeon Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Coverless Image Steganography (CIS) hides information without explicitly modifying a cover image, providing strong imperceptibility and inherent robustness to steganalysis. However, existing CIS methods largely lack robust access control, making it difficult to selectively reveal different hidden contents to different authorized users. Such access control is critical for scalable and privacy-sensitive information hiding in multi-user settings. We propose MIDAS, a training-free diffusion-based CIS framework that enables multi-image hiding with user-specific access control via latent-level fusion. MIDAS introduces a Random Basis mechanism to suppress residual structural information and a Latent Vector Fusion module that reshapes aggregated latents to align with the diffusion process. Experimental results demonstrate that MIDAS consistently outperforms existing training-free CIS baselines in access control functionality, stego image quality and diversity, robustness to noise, and resistance to steganalysis, establishing a practical and scalable approach to access-controlled coverless steganography.
[CV-77] EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation
【速读】:该论文旨在解决事件相机(event camera)在单目深度估计中因缺乏密集深度标注而导致性能受限的问题。现有无标注方法虽通过从视觉基础模型(Vision Foundation Models, VFMs)蒸馏知识缓解了数据稀缺问题,但其将事件流视为独立帧处理,忽略了事件数据固有的时间连续性,导致深度预测在时序上不一致且精度不足。解决方案的关键在于提出EventVGGT框架,首次将视觉几何基础Transformer(Visual Geometry Grounded Transformer, VGGT)中的时空与多视角几何先验蒸馏至事件域,并通过三层次蒸馏策略实现:(i) 跨模态特征混合(Cross-Modal Feature Mixture, CMFM)在输出层融合RGB与事件特征生成辅助深度预测;(ii) 空间-时间特征蒸馏(Spatio-Temporal Feature Distillation, STFD)在特征层提取VGGT的时空表示;(iii) 时间一致性蒸馏(Temporal Consistency Distillation, TCD)在时序层对齐帧间深度变化以增强跨帧一致性。该方法显著提升了深度估计精度与零样本泛化能力。
链接: https://arxiv.org/abs/2603.09385
作者: Yinrui Ren,Jinjing Zhu,Kanghao Chen,Zhuoxiao Li,Jing Ou,Zidong Cao,Tongyan Hua,Peilun Shi,Yingchun Fu,Wufan Zhao,Hui Xiong
机构: HKUST(GZ); CUHK; SCNU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras offer superior sensitivity to high-speed motion and extreme lighting, making event-based monocular depth estimation a promising approach for robust 3D perception in challenging conditions. However, progress is severely hindered by the scarcity of dense depth annotations. While recent annotation-free approaches mitigate this by distilling knowledge from Vision Foundation Models (VFMs), a critical limitation persists: they process event streams as independent frames. By neglecting the inherent temporal continuity of event data, these methods fail to leverage the rich temporal priors encoded in VFMs, ultimately yielding temporally inconsistent and less accurate depth predictions. To address this, we introduce EventVGGT, a novel framework that explicitly models the event stream as a coherent video sequence. To the best of our knowledge, we are the first to distill spatio-temporal and multi-view geometric priors from the Visual Geometry Grounded Transformer (VGGT) into the event domain. We achieve this via a comprehensive tri-level distillation strategy: (i) Cross-Modal Feature Mixture (CMFM) bridges the modality gap at the output level by fusing RGB and event features to generate auxiliary depth predictions; (ii) Spatio-Temporal Feature Distillation (STFD) distills VGGT’s powerful spatio-temporal representations at the feature level; and (iii) Temporal Consistency Distillation (TCD) enforces cross-frame coherence at the temporal level by aligning inter-frame depth changes. Extensive experiments demonstrate that EventVGGT consistently outperforms existing methods – reducing the absolute mean depth error at 30m by over 53% on EventScape (from 2.30 to 1.06) – while exhibiting robust zero-shot generalization on the unseen DENSE and MVSEC datasets.
[CV-78] SinGeo: Unlock Single Models Potential for Robust Cross-View Geo-Localization
【速读】:该论文旨在解决跨视角地理定位(cross-view geo-localization, CVGL)中模型在未见过的视场角(field-of-view, FoV)和未知朝向下性能显著下降的问题。现有方法依赖于特定视场角的训练范式,在测试时对新FoV表现脆弱,需部署多个模型以覆盖多样性变化;而简单随机化FoV的动态训练策略未能实现真正鲁棒性,因其假设所有FoV难度均等。解决方案的关键在于提出SinGeo框架,其核心是双判别学习架构(dual discriminative learning architecture),通过增强地面与卫星分支内部的视图内区分能力,并首次引入课程学习(curriculum learning)策略,使单一模型能够适应多变条件下的CVGL任务。实验表明,SinGeo在四个基准数据集上达到SOTA性能,且在极端FoV场景下优于专门为此训练的方法,同时具备跨架构迁移能力。
链接: https://arxiv.org/abs/2603.09377
作者: Yang Chen,Xieyuanli Chen,Junxiang Li,Jie Tang,Tao Wu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: v1
Abstract:Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions – implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to quantitatively assess model stability under varying views, providing an explainable perspective for understanding and advancing robustness in future CVGL research. Codes will be available upon acceptance.
[CV-79] MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification
【速读】:该论文旨在解决将大规模预训练基础模型(foundation models)应用于高分辨率医学影像(如乳腺X线摄影,mammography)时面临的挑战,包括标注数据有限、弱监督信号以及端到端微调计算成本过高问题。其解决方案的关键在于提出一种名为“基于预计算特征的多实例学习”(Multiple Instance Learning on Precomputed Features, MIL-PF)的可扩展框架:该框架冻结基础编码器以保留其强大的视觉表征能力,仅训练一个轻量级的多实例学习(MIL)聚合头(仅40k参数),从而显著降低训练复杂度;同时通过注意力机制显式建模全局组织背景与局部病灶信号之间的关系,实现临床规模下的高性能分类任务。
链接: https://arxiv.org/abs/2603.09374
作者: Nikola Jovišić,Milica Škipina,Nicola Dall’Asen,Dubravko Ćulibrk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, 4 tables. Code will be released
Abstract:Modern foundation models provide highly expressive visual representations, yet adapting them to high-resolution medical imaging remains challenging due to limited annotations and weak supervision. Mammography, in particular, is characterized by large images, variable multi-view studies and predominantly breast-level labels, making end-to-end fine-tuning computationally expensive and often impractical. We propose Multiple Instance Learning on Precomputed Features (MIL-PF), a scalable framework that combines frozen foundation encoders with a lightweight MIL head for mammography classification. By precomputing the semantic representations and training only a small task-specific aggregation module (40k parameters), the method enables efficient experimentation and adaptation without retraining large backbones. The architecture explicitly models the global tissue context and the sparse local lesion signals through attention-based aggregation. MIL-PF achieves state-of-the-art classification performance at clinical scale while substantially reducing training complexity. We release the code for full reproducibility.
[CV-80] M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition
【速读】:该论文旨在解决自监督骨骼动作识别方法中存在的三大问题:视图差异建模不足、缺乏有效的对抗机制以及增强扰动不可控。其解决方案的核心在于提出一种基于博弈论的对比学习框架——多视角极小极大无限骨骼数据博弈对比学习(M3GCLR),关键创新包括:构建无限骨骼数据博弈(ISG)模型及其均衡定理,实现基于多视角互信息的极小极大优化;通过多视角旋转增强生成正常-极端数据对,并利用时间平均输入作为中性锚点以显式刻画扰动强度;进一步基于均衡定理设计强对抗性的极小极大骨骼数据博弈,促使模型挖掘更丰富的动作判别信息;最后引入双损失均衡优化器,使学习过程在最大化动作相关性的同时最小化编码冗余,且证明该优化器与ISG模型等价。
链接: https://arxiv.org/abs/2603.09367
作者: Yanshan Li,Ke Ma,Miaomiao Wei,Linhui Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, contrastive learning has drawn significant attention as an effective approach to reducing reliance on labeled data. However, existing methods for self-supervised skeleton-based action recognition still face three major limitations: insufficient modeling of view discrepancies, lack of effective adversarial mechanisms, and uncontrollable augmentation perturbations. To tackle these issues, we propose the Multi-view Mini-Max infinite skeleton-data Game Contrastive Learning for skeleton-based action Recognition (M3GCLR), a game-theoretic contrastive framework. First, we establish the Infinite Skeleton-data Game (ISG) model and the ISG equilibrium theorem, and further provide a rigorous proof, enabling mini-max optimization based on multi-view mutual information. Then, we generate normal-extreme data pairs through multi-view rotation augmentation and adopt temporally averaged input as a neutral anchor to achieve structural alignment, thereby explicitly characterizing perturbation strength. Next, leveraging the proposed equilibrium theorem, we construct a strongly adversarial mini-max skeleton-data game to encourage the model to mine richer action-discriminative information. Finally, we introduce the dual-loss equilibrium optimizer to optimize the game equilibrium, allowing the learning process to maximize action-relevant information while minimizing encoding redundancy, and we prove the equivalence between the proposed optimizer and the ISG model. Extensive Experiments show that M3GCLR achieves three-stream 82.1%, 85.8% accuracy on NTU RGB+D 60 (X-Sub, X-View) and 72.3%, 75.0% accuracy on NTU RGB+D 120 (X-Sub, X-Set). On PKU-MMD Part I and II, it attains 89.1%, 45.2% in three-stream respectively, all results matching or outperforming state-of-the-art performance. Ablation studies confirm the effectiveness of each component.
[CV-81] Evidential Perfusion Physics-Informed Neural Networks with Residual Uncertainty Quantification
【速读】:该论文旨在解决急性缺血性卒中评估中CT灌注(CTP)成像的不适定去卷积问题,尤其关注现有物理信息神经网络(PINN)方法因缺乏不确定性量化而难以可靠评估结果的问题。其解决方案的关键在于提出Evidential Perfusion Physics-Informed Neural Networks (EPPINN),通过将证据深度学习(evidential deep learning)与物理信息建模相结合,实现灌注参数估计的不确定性感知;具体而言,EPPINN在物理残差上施加正态-逆伽马(Normal–Inverse–Gamma)先验分布,以无需贝叶斯采样或集成推断的方式,对每个体素的随机不确定性(aleatoric)和认知不确定性(epistemic)进行建模,并结合生理约束参数化和稳定策略提升单例优化的鲁棒性。
链接: https://arxiv.org/abs/2603.09359
作者: Junhyeok Lee,Minseo Choi,Han Jang,Young Hun Jeon,Heeseong Eum,Joon Jang,Chul-Ho Sohn,Kyu Sung Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Physics-informed neural networks (PINNs) have shown promise in addressing the ill-posed deconvolution problem in computed tomography perfusion (CTP) imaging for acute ischemic stroke assessment. However, existing PINN-based approaches remain deterministic and do not quantify uncertainty associated with violations of physics constraints, limiting reliability assessment. We propose Evidential Perfusion Physics-Informed Neural Networks (EPPINN), a framework that integrates evidential deep learning with physics-informed modeling to enable uncertainty-aware perfusion parameter estimation. EPPINN models arterial input, tissue concentration, and perfusion parameters using coordinate-based networks, and places a Normal–Inverse–Gamma distribution over the physics residual to characterize voxel-wise aleatoric and epistemic uncertainty in physics consistency without requiring Bayesian sampling or ensemble inference. The framework further incorporates physiologically constrained parameterization and stabilization strategies to promote robust per-case optimization. We evaluate EPPINN on digital phantom data, the ISLES 2018 benchmark, and a clinical cohort. On the evaluated datasets, EPPINN achieves lower normalized mean absolute error than classical deconvolution and PINN baselines, particularly under sparse temporal sampling and low signal-to-noise conditions, while providing conservative uncertainty estimates with high empirical coverage. On clinical data, EPPINN attains the highest voxel-level and case-level infarct-core detection sensitivity. These results suggest that evidential physics-informed learning can improve both accuracy and reliability of CTP analysis for time-critical stroke assessment.
[CV-82] Robust Provably Secure Image Steganography via Latent Iterative Optimization ICASSP2026
【速读】:该论文旨在解决现有图像隐写术(Steganography)在面对图像压缩和各类图像处理操作时鲁棒性不足的问题,同时保持嵌入信息的可证明安全性(Provable Security)。其解决方案的关键在于提出一种基于潜在空间迭代优化(Latent-space Iterative Optimization)的框架:接收方将传输图像视为固定参考,通过迭代优化潜在变量以最小化重建误差,从而显著提升消息提取的准确性。该方法在不损害原有可证明安全性的前提下,大幅增强了系统对压缩和图像处理的鲁棒性,并可作为独立模块集成到其他可证明安全的隐写方案中,进一步提升整体鲁棒性。
链接: https://arxiv.org/abs/2603.09348
作者: Yanan Li,Zixuan Wang,Qiyang Xiao,Yanzhen Ren
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for presentation at the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
Abstract:We propose a robust and provably secure image steganography framework based on latent-space iterative optimization. Within this framework, the receiver treats the transmitted image as a fixed reference and iteratively refines a latent variable to minimize the reconstruction error, thereby improving message extraction accuracy. Unlike prior methods, our approach preserves the provable security of the embedding while markedly enhancing robustness under various compression and image processing scenarios. On benchmark datasets, the experimental results demonstrate that the proposed iterative optimization not only improves robustness against image compression while preserving provable security, but can also be applied as an independent module to further reinforce robustness in other provably secure steganographic schemes. This highlights the practicality and promise of latent-space optimization for building reliable, robust, and secure steganographic systems.
[CV-83] Predictive Spectral Calibration for Source-Free Test-Time Regression
【速读】:该论文旨在解决图像回归任务中测试时适应(Test-time Adaptation, TTA)研究不足的问题,尤其针对现有分类导向的TTA方法难以直接迁移至连续回归目标的局限性。其解决方案的关键在于提出一种无需源数据的预测谱校准(Predictive Spectral Calibration, PSC)框架,通过将子空间对齐扩展至块谱匹配机制,在源预测支持子空间内联合对齐目标特征,并在校准正交补空间中的残差谱松弛项的同时实现更精准的分布适配。此方法保持了简单易实现、模型无关且兼容预训练回归器的优势,在多个图像回归基准上显著优于强基线,尤其在严重分布偏移下表现突出。
链接: https://arxiv.org/abs/2603.09338
作者: Nguyen Viet Tuan Kiet,Huynh Thanh Trung,Pham Huy Hieu
机构: Hanoi University of Science and Technology (河内科学技术大学); VinUniversity (Vin大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Test-time adaptation (TTA) for image regression has received far less attention than its classification counterpart. Methods designed for classification often depend on classification-specific objectives and decision boundaries, making them difficult to transfer directly to continuous regression targets. Recent progress revisits regression TTA through subspace alignment, showing that simple source-guided alignment can be both practical and effective. Building on this line of work, we propose Predictive Spectral Calibration (PSC), a source-free framework that extends subspace alignment to block spectral matching. Instead of relying on a fixed support subspace alone, PSC jointly aligns target features within the source predictive support and calibrates residual spectral slack in the orthogonal complement. PSC remains simple to implement, model-agnostic, and compatible with off-the-shelf pretrained regressors. Experiments on multiple image regression benchmarks show consistent improvements over strong baselines, with particularly clear gains under severe distribution shifts.
[CV-84] Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLM s in Zero-sum Environments
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在对抗性、时间敏感的交互环境中作为智能体(agent)时,其推理能力评估不足的问题。现有评测大多将推理视为一次性能力,忽视了对手感知决策、时间约束以及高压下的执行挑战。为应对这一问题,作者提出了战略战术推理(Strategic Tactical Agent Reasoning, STAR)基准,其核心创新在于构建了一个多智能体竞争环境,通过1v1零和博弈形式将推理建模为迭代且自适应的决策过程,并支持回合制与实时两种场景,从而统一评估长期战略规划与快速战术执行能力。STAR的关键在于其模块化架构与标准化API设计,使得实验可复现且任务灵活定制;同时引入战略评估套件(Strategic Evaluation Suite),不仅衡量胜负结果,还量化策略质量如执行效率和结果稳定性,揭示出“策略-执行”之间的显著差距:高推理能力模型在回合制中占优,但因推理延迟在实时场景中表现不佳,凸显了在动态竞争环境中及时行动能力的重要性。
链接: https://arxiv.org/abs/2603.09337
作者: Yang Li,Xing Chen,Yutao Liu,Gege Qi,Yanxian BI,Zizhe Wang,Yunjian Zhang,Yao Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code available
Abstract:Large Language Models (LLMs) have achieved strong performance on static reasoning benchmarks, yet their effectiveness as interactive agents operating in adversarial, time-sensitive environments remains poorly understood. Existing evaluations largely treat reasoning as a single-shot capability, overlooking the challenges of opponent-aware decision-making, temporal constraints, and execution under pressure. This paper introduces Strategic Tactical Agent Reasoning (STAR) Benchmark, a multi-agent evaluation framework that assesses LLMs through 1v1 zero-sum competitive interactions, framing reasoning as an iterative, adaptive decision-making process. STAR supports both turn-based and real-time settings, enabling controlled analysis of long-horizon strategic planning and fast-paced tactical execution within a unified environment. Built on a modular architecture with a standardized API and fully implemented execution engine, STAR facilitates reproducible evaluation and flexible task customization. To move beyond binary win-loss outcomes, we introduce a Strategic Evaluation Suite that assesses not only competitive success but also the quality of strategic behavior, such as execution efficiency and outcome stability. Extensive pairwise evaluations reveal a pronounced strategy-execution gap: while reasoning-intensive models dominate turn-based settings, their inference latency often leads to inferior performance in real-time scenarios, where faster instruction-tuned models prevail. These results show that strategic intelligence in interactive environments depends not only on reasoning depth, but also on the ability to translate plans into timely actions, positioning STAR as a principled benchmark for studying this trade-off in competitive, dynamic settings.
[CV-85] OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在低级视觉感知任务中对细粒度视觉差异敏感性不足的问题,尤其是其在检测图像中微小视觉差异(如颜色、尺寸、旋转或位置变化)方面的性能远低于人类水平。解决方案的关键在于提出一个可控的基准测试工具OddGridBench和一种基于强化学习的训练框架OddGrid-GRPO:前者通过超过1400张网格图像系统化评估模型对视觉差异的敏感性;后者结合课程学习(curriculum learning)与距离感知奖励机制(distance-aware reward),通过逐步增加训练样本难度并引入空间邻近约束优化奖励设计,显著提升模型的细粒度视觉辨别能力。
链接: https://arxiv.org/abs/2603.09326
作者: Tengjin Weng,Wenhao Jiang,Jingyi Wang,Ming Li,Lin Ma,Zhong Ming
机构: Shenzhen University (深圳大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); Shenzhen Technology University (深圳技术大学); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR 2026
Abstract:Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model’s fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at this https URL.
[CV-86] SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation
【速读】:该论文旨在解决空间自主操作(如在轨服务和主动碎片清除)中目标航天器的部件级语义理解与精确相对导航问题,其核心挑战在于轨道真实数据获取成本高、可用性差,且现有合成数据集存在目标多样性不足、单模态感知及标注不完整等问题。解决方案的关键在于构建一个大规模多模态基准数据集 SpaceSense-Bench,包含136个卫星模型、约70GB数据,每帧提供时间同步的RGB图像(1024×1024)、毫米级精度深度图和256束LiDAR点云,并配有像素级与点级的7类部件语义标签及6-DoF姿态真值。该数据集通过在Unreal Engine 5中搭建高保真空间仿真环境并结合全自动数据采集、多阶段质量控制与格式转换流程生成,为航天器感知研究提供了高质量、多样化、多模态的数据支撑,实验证明增加训练卫星数量可显著提升对未见目标的泛化性能,凸显了大规模多样数据的价值。
链接: https://arxiv.org/abs/2603.09320
作者: Aodi Wu,Jianhong Zuo,Zeyuan Zhao,Xubo Luo,Ruisuo Wang,Xue Wan
机构: University of Chinese Academy of Sciences (中国科学院大学); Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences (空间应用工程与技术中心,中国科学院); Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures
Abstract:Autonomous space operations such as on-orbit servicing and active debris removal demand robust part-level semantic understanding and precise relative navigation of target spacecraft, yet collecting large-scale real data in orbit remains impractical due to cost and access constraints. Existing synthetic datasets, moreover, suffer from limited target diversity, single-modality sensing, and incomplete ground-truth annotations. We present \textbfSpaceSense-Bench, a large-scale multi-modal benchmark for spacecraft perception encompassing 136~satellite models with approximately 70~GB of data. Each frame provides time-synchronized 1024 \times 1024 RGB images, millimeter-precision depth maps, and 256-beam LiDAR point clouds, together with dense 7-class part-level semantic labels at both the pixel and point level as well as accurate 6-DoF pose ground truth. The dataset is generated through a high-fidelity space simulation built in Unreal Engine~5 and a fully automated pipeline covering data acquisition, multi-stage quality control, and conversion to mainstream formats. We benchmark five representative tasks (object detection, 2D semantic segmentation, RGB–LiDAR fusion-based 3D point cloud segmentation, monocular depth estimation, and orientation estimation) and identify two key findings: (i)~perceiving small-scale components (\emphe.g., thrusters and omni-antennas) and generalizing to entirely unseen spacecraft in a zero-shot setting remain critical bottlenecks for current methods, and (ii)~scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large-scale, diverse datasets for space perception research. The dataset, code, and toolkit are publicly available at this https URL.
[CV-87] NLiPsCalib: An Efficient Calibration Framework for High-Fidelity 3D Reconstruction of Curved Visuotactile Sensors ICRA2026
【速读】:该论文旨在解决弯曲型视觉触觉传感器(curved visuotactile sensors)因非均匀光照导致感知质量下降、重建精度降低的问题,以及现有校准方法依赖定制压头和专用设备、过程昂贵且耗时的挑战。解决方案的关键在于提出NLiPsCalib框架,该框架结合可控近场光源与近光摄影测量(Near-Light Photometric Stereo, NLiPs)技术,通过少量日常物体接触即可估计接触几何信息,从而实现物理一致且高效的校准,显著简化了校准流程并提升了不同曲率形态下三维重建的保真度。
链接: https://arxiv.org/abs/2603.09319
作者: Xuhao Qin,Feiyu Zhao,Yatao Leng,Runze Hu,Chenxi Xiao
机构: ShanghaiTech University (上海科技大学); MoE Key Laboratory of Intelligent Perception and Human-Machine Collaboration (KLIP-HuMaCo) (教育部智能感知与人机协同重点实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, accepted to 2026 IEEE International Conference on Robotics Automation (ICRA 2026)
Abstract:Recent advances in visuotactile sensors increasingly employ biomimetic curved surfaces to enhance sensorimotor capabilities. Although such curved visuotactile sensors enable more conformal object contact, their perceptual quality is often degraded by non-uniform illumination, which reduces reconstruction accuracy and typically necessitates calibration. Existing calibration methods commonly rely on customized indenters and specialized devices to collect large-scale photometric data, but these processes are expensive and labor-intensive. To overcome these calibration challenges, we present NLiPsCalib, a physics-consistent and efficient calibration framework for curved visuotactile sensors. NLiPsCalib integrates controllable near-field light sources and leverages Near-Light Photometric Stereo (NLiPs) to estimate contact geometry, simplifying calibration to just a few simple contacts with everyday objects. We further introduce NLiPsTac, a controllable-light-source tactile sensor developed to validate our framework. Experimental results demonstrate that our approach enables high-fidelity 3D reconstruction across diverse curved form factors with a simple calibration procedure. We emphasize that our approach lowers the barrier to developing customized visuotactile sensors of diverse geometries, thereby making visuotactile sensing more accessible to the broader community.
[CV-88] CLoE: Expert Consistency Learning for Missing Modality Segmentation
【速读】:该论文旨在解决多模态医学图像分割中因推理阶段模态缺失导致的专家预测不一致与融合不稳定问题,尤其在小目标前景结构上表现更差。其解决方案的关键在于提出一致性驱动的专家学习框架(Consistency Learning of Experts, CLoE),通过双分支专家一致性学习目标实现鲁棒性控制:一是模态专家一致性(Modality Expert Consistency),强制全局专家预测一致以减少部分输入下的病例级漂移;二是区域专家一致性(Region Expert Consistency),聚焦临床关键前景区域的一致性,避免背景主导的正则化干扰。此外,引入轻量级门控网络将一致性分数映射为模态可靠性权重,实现融合前的可靠性感知特征重校准,从而显著提升在模态缺失场景下的分割性能与跨数据集泛化能力。
链接: https://arxiv.org/abs/2603.09316
作者: Xinyu Tong,Meihua Zhou,Bowu Fan,Haitao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal medical image segmentation often faces missing modalities at inference, which induces disagreement among modality experts and makes fusion unstable, particularly on small foreground structures. We propose Consistency Learning of Experts (CLoE), a consistency-driven framework for missing-modality segmentation that preserves strong performance when all modalities are available. CLoE formulates robustness as decision-level expert consistency control and introduces a dual-branch Expert Consistency Learning objective. Modality Expert Consistency enforces global agreement among expert predictions to reduce case-wise drift under partial inputs, while Region Expert Consistency emphasizes agreement on clinically critical foreground regions to avoid background-dominated regularization. We further map consistency scores to modality reliability weights using a lightweight gating network, enabling reliability-aware feature recalibration before fusion. Extensive experiments on BraTS 2020 and MSD Prostate demonstrate that CLoE outperforms state-of-the-art methods in incomplete multimodal segmentation, while exhibiting strong cross-dataset generalization and improving robustness on clinically critical structures.
[CV-89] IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework
【速读】:该论文旨在解决当前文本到可缩放矢量图形(Scalable Vector Graphics, SVG)生成方法中因自回归训练过程缺乏对最终渲染图像的视觉感知而造成的生成质量受限问题。解决方案的关键在于提出一种自省式SVG生成框架(IntroSVG),其核心创新是构建一个在闭环中同时扮演生成器与评论者双重角色的统一视觉语言模型(Visual Language Model, VLM)。通过监督微调(Supervised Fine-Tuning, SFT)使模型学会绘制SVG并对其渲染结果提供反馈,并将早期生成失败转化为高质量纠错训练数据以提升鲁棒性;进一步利用高容量教师VLM构建偏好数据集,通过直接偏好优化(Direct Preference Optimization, DPO)对生成策略进行对齐。推理阶段,优化后的生成器与评论者协同执行“生成-评审-精炼”的迭代循环,从不完善的中间草图出发自主提升输出质量,从而显著改善SVG的结构复杂度、语义一致性与可编辑性。
链接: https://arxiv.org/abs/2603.09312
作者: Feiyu Wang,Jiayuan Yang,Zhiyuan Zhao,Da Zhang,Bingyu Li,Peng Liu,Junyu Gao
机构: Fudan University (复旦大学); TeleAI; Northwestern Polytechnical University (西北工业大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator’s policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative “generate-review-refine” cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.
[CV-90] See Plan Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation CVPR
【速读】:该论文旨在解决机器人操作中任务进度感知不足导致的鲁棒性差的问题,即缺乏对当前任务状态的明确锚定、中间状态的可验证预测以及失败时的自主恢复能力。解决方案的关键在于提出一种名为“See, Plan, Rewind (SPR)”的视觉-语言-动作框架,其核心机制是通过一个闭环循环动态地将语言指令映射为一系列空间子目标(spatial subgoals),并在执行过程中持续监测进度:首先“看见”当前状态与下一个里程碑,然后“规划”通往下一二维路点的轨迹,若检测到进度停滞则“回溯”至可恢复状态,从而实现无需额外训练数据或辅助模型的鲁棒错误纠正。
链接: https://arxiv.org/abs/2603.09292
作者: Tingjun Dai,Mingfei Han,Tingwen Du,Zhiheng Liu,Zhihui Li,Salman Khan,Jun Yu,Xiaojun Chang
机构: University of Science and Technology of China (中国科学技术大学); ReLER Lab, AAII, UTS (ReLER实验室,AAII,UTS); MBZUAI (穆巴达拉人工智能研究所); CUHK (香港中文大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Suggested to CVPR Findings. this https URL
Abstract:Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce See, Plan, Rewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate the framework’s effectiveness, generalization and robustness: SPR outperforms the MolmoAct baseline by 5% on the LIBERO benchmark. On the challenging LIBERO-Plus benchmark with unseen instructions and initial states, SPR achieves state-of-the-art robustness with the smallest performance drop, surpassing OpenVLA-OFT and UniVLA, demonstrating superior out-of-distribution robustness.
[CV-91] DenoiseSplat: Feed-Forward Gaussian Splatting for Noisy 3D Scene Reconstruction
【速读】:该论文旨在解决现有3D场景重建与新视角合成方法(如NeRF和3D高斯泼溅(3D Gaussian Splatting))在面对真实世界噪声和伪影时性能显著下降的问题。其核心挑战在于如何在不依赖3D真值监督的前提下,从含噪多视角图像中恢复高质量的3D结构。解决方案的关键是提出DenoiseSplat——一种前馈式的3D高斯泼溅方法,通过构建大规模、场景一致的噪声-清洁图像基准RE10K(包含高斯、泊松、散斑及椒盐噪声),并采用轻量级MVSplat风格的前向网络架构,在仅使用干净2D渲染结果作为监督信号的情况下实现端到端训练,从而有效提升对多种噪声类型和强度的鲁棒性,在PSNR、SSIM和LPIPS指标上优于基线方法。
链接: https://arxiv.org/abs/2603.09291
作者: Fuzhen Jiang,Zhuoran Li,Yinlin Zhang
机构: Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:3D scene reconstruction and novel-view synthesis are fundamental for VR, robotics, and content creation. However, most NeRF and 3D Gaussian Splatting pipelines assume clean inputs and degrade under real noise and artifacts. We therefore propose DenoiseSplat, a feed-forward 3D Gaussian splatting method for noisy multi-view images. We build a large-scale, scene-consistent noisy–clean benchmark on RE10K by injecting Gaussian, Poisson, speckle, and salt-and-pepper noise with controlled intensities. With a lightweight MVSplat-style feed-forward backbone, we train end-to-end using only clean 2D renderings as supervision and no 3D ground truth. On noisy RE10K, DenoiseSplat outperforms vanilla MVSplat and a strong two-stage baseline (IDF + MVSplat) in PSNR/SSIM and LPIPS across noise types and levels.
[CV-92] Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking
【速读】:该论文旨在解决现有多模态目标跟踪方法中两个关键问题:一是采用统一融合策略忽视了不同模态(如红外、事件、深度和RGB)之间的本质差异;二是通过混合token传播时间信息,导致时间表征纠缠且缺乏判别力。解决方案的关键在于提出MDTrack框架,其核心创新包括:1)模态感知融合机制,利用Mixture of Experts(MoE)架构为每种模态分配专用专家,并通过门控机制动态选择最优专家实现自适应的模态特定融合;2)解耦的时间传播结构,引入两个独立的状态空间模型(State Space Model, SSM)分别存储和更新RGB与X模态流的隐藏状态,以有效捕捉各自独特的时间动态特性,并通过跨注意力模块在两者间隐式交换信息,最终将增强的时间特征融入主干网络,显著提升跟踪性能。
链接: https://arxiv.org/abs/2603.09287
作者: Shilei Wang,Pujian Lai,Dong Gao,Jifeng Ning,Gong Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most existing multimodal trackers adopt uniform fusion strategies, overlooking the inherent differences between modalities. Moreover, they propagate temporal information through mixed tokens, leading to entangled and less discriminative temporal representations. To address these limitations, we propose MDTrack, a novel framework for modality aware fusion and decoupled temporal propagation in multimodal object tracking. Specifically, for modality aware fusion, we allocate dedicated experts to each modality, including infrared, event, depth, and RGB, to process their respective representations. The gating mechanism within the Mixture of Experts dynamically selects the optimal experts based on the input features, enabling adaptive and modality specific fusion. For decoupled temporal propagation, we introduce two separate State Space Model structures to independently store and update the hidden states of the RGB and X modal streams, effectively capturing their distinct temporal information. To ensure synergy between the two temporal representations, we incorporate a set of cross attention modules between the input features of the two SSMs, facilitating implicit information exchange. The resulting temporally enriched features are then integrated into the backbone through another set of cross attention modules, enhancing MDTrack’s ability to leverage temporal information. Extensive experiments demonstrate the effectiveness of our proposed method. Both MDTrack S and MDTrack U achieve state of the art performance across five multimodal tracking benchmarks.
[CV-93] CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在文本到图像生成过程中难以控制图像所引发的认知属性(如情绪效价、唤醒度、支配感和记忆性)的问题,即模型虽能生成语义一致的图像,却无法精准匹配特定的心理意图。解决方案的关键在于提出 CogBlender 框架,其核心是建立认知空间(Cognitive Space)与语义流形(Semantic Manifold)之间的映射关系,并通过定义一组认知锚点(Cognitive Anchors)作为认知空间的边界点,重构流匹配(flow-matching)过程中的速度场,使其基于多维认知评分进行插值,从而实现对生成过程的连续、细粒度且多维度的认知属性调控。
链接: https://arxiv.org/abs/2603.09286
作者: Shengqi Dang,Jiaying Lei,Yi He,Ziqing Qian,Nan Cao
机构: Tongji University (同济大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Beyond conveying semantic information, an image can also manifest cognitive attributes that elicit specific cognitive processes from the viewer, such as memory encoding or emotional response. While modern text-to-image models excel at generating semantically coherent content, they remain limited in their ability to control such cognitive properties of images (e.g., valence, memorability), often failing to align with the specific psychological intent. To bridge this gap, we introduce CogBlender, a framework that enables continuous and multi-dimensional intervention of cognitive properties during text-to-image generation. Our approach is built upon a mapping between the Cognitive Space, representing the space of cognitive properties, and the Semantic Manifold, representing the manifold of the visual semantics. We define a set of Cognitive Anchors, serving as the boundary points for the cognitive space. Then we reformulate the velocity field within the flow-matching process by interpolating from the velocity field of different anchors. Consequently, the generative process is driven by the velocity field and dynamically steered by multi-dimensional cognitive scores, enabling precise, fine-grained, and continuous intervention. We validate the effectiveness of CogBlender across four representative cognitive dimensions: valence, arousal, dominance, and image memorability. Extensive experiments demonstrate that our method achieves effective cognitive intervention. Our work provides an effective paradigm for cognition-driven creative design.
[CV-94] Learning Convex Decomposition via Feature Fields
【速读】:该论文旨在解决三维形状的凸分解(Convex Decomposition)问题,即如何将任意复杂几何体高效、高质量地分解为多个凸体的并集,以加速物理仿真中的碰撞检测等应用。传统方法在处理开放世界(open-world)场景时存在局限性,难以泛化到未见过的形状或不同表示形式(如网格、CAD模型或高斯泼溅)。解决方案的关键在于提出一种基于特征场学习的新公式,通过自监督的纯几何目标(源自凸性的经典定义)来学习连续的特征场,并利用聚类策略从中提取高质量的凸分解结果。该方法不仅适用于单个形状优化,更重要的是通过特征预测实现了大规模数据上的可扩展自监督学习,从而构建了首个针对凸分解任务的开放世界学习模型。
链接: https://arxiv.org/abs/2603.09285
作者: Yuezhi Yang,Qixing Huang,Mikaela Angelina Uy,Nicholas Sharp
机构: NVIDIA(英伟达); The University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures
Abstract:This work proposes a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications. The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity. Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world model for convex decomposition. Experiments show that our decompositions are higher-quality than alternatives and generalize across open-world objects as well as across representations to meshes, CAD models, and even Gaussian splats. this https URL
[CV-95] From Ideal to Real: Stable Video Object Removal under Imperfect Conditions
【速读】:该论文旨在解决视频中物体移除(Video Object Removal)在现实世界复杂条件下的稳定性与一致性问题,尤其针对阴影、突发运动和掩码缺陷等挑战。现有基于扩散模型的视频修复方法在此类场景下常出现时间不一致或视觉闪烁等问题。其解决方案的关键在于提出一个名为 Stable Video Object Removal (SVOR) 的鲁棒框架,包含三项核心设计:(1) Mask Union for Stable Erasure (MUSE),通过窗口化掩码合并策略在时序掩码下采样过程中保留每个窗口内所有目标区域,有效应对突发运动并减少遗漏移除;(2) Denoising-Aware Segmentation (DA-Seg),在解耦侧分支上引入去噪感知自适应层归一化(Denoising-Aware AdaLN),结合掩码退化训练,提供内部扩散感知定位先验而不影响内容生成;(3) 课程两阶段训练机制,第一阶段利用未配对的真实背景视频进行自监督预训练以学习真实背景和时序先验,第二阶段在合成数据上使用掩码退化和侧效权重损失进行精调,同时移除物体及其关联阴影/反射,提升跨域鲁棒性。
链接: https://arxiv.org/abs/2603.09283
作者: Jiagao Hu,Yuxuan Chen,Fuhao Li,Zepeng Wang,Fei Wang,Daiguo Zhou,Jian Luan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: TBD
Abstract:Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.
[CV-96] Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists CVPR2026
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在学习过程中效率不足的问题,尤其是在每条光线(ray)上参与渲染的高斯数量较多导致计算冗余的问题。解决方案的关键在于两个核心策略:一是通过定期重置高斯的尺度(scale),使高斯体积缩小,从而减少其覆盖的像素范围,缩短每像素对应的高斯列表;二是引入熵约束于alpha混合过程,以锐化沿每条光线的高斯权重分布,增强主导权重、抑制次要权重,促使每个高斯更聚焦于其主导像素,进一步降低对邻近像素的影响,从而实现更短的高斯列表。最终,该方法还集成到一个渲染分辨率调度器中,通过逐步提升分辨率进一步优化效率,实现在不牺牲渲染质量的前提下显著提升训练与渲染效率。
链接: https://arxiv.org/abs/2603.09277
作者: Jiaqi Liu,Zhizhong Han
机构: Wayne State University (韦恩州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Project page: this https URL
Abstract:3D Gaussian splatting (3DGS) has become a vital tool for learning a radiance field from multiple posed images. Although 3DGS shows great advantages over NeRF in terms of rendering quality and efficiency, it remains a research challenge to further improve the efficiency of learning 3D Gaussians. To overcome this challenge, we propose novel training strategies and losses to shorten each Gaussian list used to render a pixel, which speeds up the splatting by involving fewer Gaussians along a ray. Specifically, we shrink the size of each Gaussian by resetting their scales regularly, encouraging smaller Gaussians to cover fewer nearby pixels, which shortens the Gaussian lists of pixels. Additionally, we introduce an entropy constraint on the alpha blending procedure to sharpen the weight distribution of Gaussians along each ray, which drives dominant weights larger while making minor weights smaller. As a result, each Gaussian becomes more focused on the pixels where it is dominant, which reduces its impact on nearby pixels, leading to even shorter Gaussian lists. Eventually, we integrate our method into a rendering resolution scheduler which further improves efficiency through progressive resolution increase. We evaluate our method by comparing it with state-of-the-art methods on widely used benchmarks. Our results show significant advantages over others in efficiency without sacrificing rendering quality.
[CV-97] ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph
【速读】:该论文旨在解决当前文本到三维(text-to-3D)生成方法在工业应用中面临的两大核心问题:一是领域适配挑战,传统LoRA融合机制导致不同类别间知识干扰;二是几何推理不足,成对一致性约束无法捕捉高阶结构依赖关系,难以满足精密制造需求。其解决方案的关键在于提出名为ForgeDreamer的新框架,包含两个创新模块:一是多专家LoRA集成机制(Multi-Expert LoRA Ensemble),将多个类别特定的LoRA模型整合为统一表征,实现跨类别泛化并消除知识干扰;二是跨视角超图几何增强方法(Cross-View Hypergraph Geometric Enhancement),通过超图建模同时捕获多视角下的结构依赖关系,从而提升语义理解与几何推理能力,并确保制造级的一致性。
链接: https://arxiv.org/abs/2603.09266
作者: Junhao Cai,Deyu Zeng,Junhao Pang,Lini Li,Zongze Wu,Xiaopin Zhong
机构: Shenzhen University (深圳大学); Guangzhou Maritime University (广州航海学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current text-to-3D generation methods excel in natural scenes but struggle with industrial applications due to two critical limitations: domain adaptation challenges where conventional LoRA fusion causes knowledge interference across categories, and geometric reasoning deficiencies where pairwise consistency constraints fail to capture higher-order structural dependencies essential for precision manufacturing. We propose a novel framework named ForgeDreamer addressing both challenges through two key innovations. First, we introduce a Multi-Expert LoRA Ensemble mechanism that consolidates multiple category-specific LoRA models into a unified representation, achieving superior cross-category generalization while eliminating knowledge interference. Second, building on enhanced semantic understanding, we develop a Cross-View Hypergraph Geometric Enhancement approach that captures structural dependencies spanning multiple viewpoints simultaneously. These components work synergistically improved semantic understanding, enables more effective geometric reasoning, while hypergraph modeling ensures manufacturing-level consistency. Extensive experiments on a custom industrial dataset demonstrate superior semantic generalization and enhanced geometric fidelity compared to state-of-the-art approaches. Our code and data are provided in the supplementary material attached in the appendix for review purposes.
[CV-98] Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos CVPR2025
【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)任务中因模拟器构建数据集多样性不足与可扩展性差而导致的现实环境适应性弱的问题。现有数据集难以捕捉真实室内场景的复杂性,限制了模型的泛化能力。其解决方案的关键在于提出一种基于网络房间导览视频的大规模视频指令框架,通过自然人类行走示范实现对多样化、真实室内环境的学习;同时创新性地引入隐式几何表示(implicit geometry representations),直接从RGB图像中提取空间线索,无需依赖脆弱的3D重建过程,从而显著提升数据利用率、缓解重建失败问题,并激活大量此前无法使用的视频数据,最终在多个VLN基准测试中实现新的最先进性能并支持零样本导航能力。
链接: https://arxiv.org/abs/2603.09259
作者: Mingfei Han,Haihong Hao,Liang Ma,Kamila Zhumakhanova,Ekaterina Radionova,Jingyi Zhang,Xiaojun Chang,Xiaodan Liang,Ivan Laptev
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Science and Technology of China (中国科学技术大学); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Extension of CVPR 2025 RoomTour3D with implicit geometric representations
Abstract:Vision-and-Language Navigation (VLN) has long been constrained by the limited diversity and scalability of simulator-curated datasets, which fail to capture the complexity of real-world environments. To overcome this limitation, we introduce a large-scale video-instruction framework derived from web-based room tour videos, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings. Unlike existing datasets, our framework integrates both open-ended description-enriched trajectories and action-enriched trajectories reconstructed in 3D, providing richer spatial and semantic supervision. A key extension in this work is the incorporation of implicit geometry representations, which extract spatial cues directly from RGB frames without requiring fragile 3D reconstruction. This approach substantially improves data utilization, alleviates reconstruction failures, and unlocks large portions of previously unusable video data. Comprehensive experiments across multiple VLN benchmarks (CVDN, SOON, R2R, and REVERIE) demonstrate that our method not only sets new state-of-the-art performance but also enables the development of robust zero-shot navigation agents. By bridging large-scale web videos with implicit spatial reasoning, this work advances embodied navigation towards more scalable, generalizable, and real-world applicable solutions.
[CV-99] Multimodal Graph Representation Learning with Dynamic Information Pathways
【速读】:该论文旨在解决多模态图(Multimodal Graph)表示学习中面临的挑战,即如何在节点包含异构特征(如图像和文本)的情况下实现灵活且高效的节点嵌入学习。现有方法通常基于传统图神经网络扩展而来,依赖静态结构或密集注意力机制,限制了消息传递的适应性和表达能力。其解决方案的关键在于提出一种名为“动态信息路径”(Dynamic information Pathways, DiP)的新框架:通过引入模态特定的伪节点(modality-specific pseudo nodes),DiP 在每个模态内利用基于邻近度的伪节点交互实现动态消息路由,并通过共享状态空间中的高效信息路径捕捉跨模态依赖关系,从而在保持线性复杂度的前提下实现自适应、高表达力且稀疏的消息传播。
链接: https://arxiv.org/abs/2603.09258
作者: Xiaobin Hong,Mingkai Lin,Xiaoli Wang,Chaoqun Wang,Wenzhong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, 6 tables
Abstract:Multimodal graphs, where nodes contain heterogeneous features such as images and text, are increasingly common in real-world applications. Effectively learning on such graphs requires both adaptive intra-modal message passing and efficient inter-modal aggregation. However, most existing approaches to multimodal graph learning are typically extended from conventional graph neural networks and rely on static structures or dense attention, which limit flexibility and expressive node embedding learning. In this paper, we propose a novel multimodal graph representation learning framework with Dynamic information Pathways (DiP). By introducing modality-specific pseudo nodes, DiP enables dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space. This design achieves adaptive, expressive, and sparse message propagation across modalities with linear complexity. We conduct the link prediction and node classification tasks to evaluate performance and carry out full experimental analyses. Extensive experiments across multiple benchmarks demonstrate that DiP consistently outperforms baselines.
[CV-100] Multi-model approach for autonomous driving: A comprehensive study on traffic sign- vehicle- and lane detection and behavioral cloning
【速读】:该论文旨在解决自动驾驶汽车在环境感知与决策执行中的核心挑战,包括交通标志分类、车道检测、车辆检测及行为克隆等问题。其解决方案的关键在于采用预训练与定制神经网络相结合的深度学习方法,并融合几何与颜色变换的数据增强技术、图像归一化处理以及迁移学习策略,从而提升模型在多源数据集(如GTSRB、道路分割数据集、车辆检测数据集及Udacity模拟器数据)上的泛化能力与鲁棒性,为构建更安全可靠的自动驾驶系统提供有效技术支持。
链接: https://arxiv.org/abs/2603.09255
作者: Kanishkha Jaisankar,Pranav M. Pawar,Diana Susane Joseph,Raja Muthalagu,Mithun Mukherjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 35 pages, 40 figures
Abstract:Deep learning and computer vision techniques have become increasingly important in the development of self-driving cars. These techniques play a crucial role in enabling self-driving cars to perceive and understand their surroundings, allowing them to safely navigate and make decisions in real-time. Using Neural Networks self-driving cars can accurately identify and classify objects such as pedestrians, other vehicles, and traffic signals. Using deep learning and analyzing data from sensors such as cameras and radar, self-driving cars can predict the likely movement of other objects and plan their own actions accordingly. In this study, a novel approach to enhance the performance of selfdriving cars by using pre-trained and custom-made neural networks for key tasks, including traffic sign classification, vehicle detection, lane detection, and behavioral cloning is provided. The methodology integrates several innovative techniques, such as geometric and color transformations for data augmentation, image normalization, and transfer learning for feature extraction. These techniques are applied to diverse datasets,including the German Traffic Sign Recognition Benchmark (GTSRB), road and lane segmentation datasets, vehicle detection datasets, and data collected using the Udacity selfdriving car simulator to evaluate the model efficacy. The primary objective of the work is to review the state-of-the-art in deep learning and computer vision for self-driving cars. The findings of the work are effective in solving various challenges related to self-driving cars like traffic sign classification, lane prediction, vehicle detection, and behavioral cloning, and provide valuable insights into improving the robustness and reliability of autonomous systems, paving the way for future research and deployment of safer and more efficient self-driving technologies.
[CV-101] owards Instance Segmentation with Polygon Detection Transformers
【速读】:该论文旨在解决实例分割(Instance Segmentation)中高分辨率输入与轻量化、实时推理之间的矛盾瓶颈。传统方法依赖密集的像素级掩码预测,导致计算资源消耗大且难以满足实时性要求。其解决方案的关键在于提出Polygon Detection Transformer (Poly-DETR),将实例分割重构为通过极坐标表示(Polar Representation)的稀疏顶点回归任务,从而摒弃对密集掩码预测的依赖;同时引入极坐标可变形注意力(Polar Deformable Attention)和位置感知训练策略(Position-Aware Training Scheme),动态调整监督信号并聚焦边界特征,显著提升精度与效率。实验表明,该方法在MS COCO等数据集上实现更高mAP,并在高分辨率场景下大幅降低内存占用,在特定领域如细胞和建筑轮廓分割中优于基于掩码的方法。
链接: https://arxiv.org/abs/2603.09245
作者: Jiacheng Sun,Jiaqi Lin,Wenlong Hu,Haoyang Li,Xinghong Zhou,Chenghai Mao,Yan Peng,Xiaomao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:One of the bottlenecks for instance segmentation today lies in the conflicting requirements of high-resolution inputs and lightweight, real-time inference. To address this bottleneck, we present a Polygon Detection Transformer (Poly-DETR) to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel-wise mask prediction. Considering the box-to-polygon reference shift in Detection Transformers, we propose Polar Deformable Attention and Position-Aware Training Scheme to dynamically update supervision and focus attention on boundary cues. Compared with state-of-the-art polar-based methods, Poly-DETR achieves a 4.7 mAP improvement on MS COCO test-dev. Moreover, we construct a parallel mask-based counterpart to support a systematic comparison between polar and mask representations. Experimental results show that Poly-DETR is more lightweight in high-resolution scenarios, reducing memory consumption by almost half on Cityscapes dataset. Notably, on PanNuke (cell segmentation) and SpaceNet (building footprints) datasets, Poly-DETR surpasses its mask-based counterpart on all metrics, which validates its advantage on regular-shaped instances in domain-specific settings.
[CV-102] When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection
【速读】:该论文旨在解决基于视觉基础模型(Vision Foundation Models, VFMs)的生成式AI图像检测器在面对未见过的生成流水线时泛化能力不足的问题。研究表明,现有方法存在一种称为“语义回退”(semantic fallback)的关键失败机制,即检测器过度依赖预训练模型中固有的语义先验(如身份信息),而非伪造痕迹本身。为应对这一问题,作者提出几何语义解耦(Geometric Semantic Decoupling, GSD)模块,其核心在于利用一个冻结的VFM作为语义引导,结合一个可训练的VFM作为伪造特征探测器,通过批处理统计估计语义方向并施加几何约束进行投影去除,从而强制检测器仅依赖与语义无关的伪造证据,显著提升了跨数据集和跨场景下的检测鲁棒性与泛化性能。
链接: https://arxiv.org/abs/2603.09242
作者: Chao Shuai,Zhenguang Liu,Shaojing Fan,Bin Gong,Weichen Lian,Xiuli Bi,Zhongjie Ba,Kui Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI-generated image detection has become increasingly important with the rapid advancement of generative AI. However, detectors built on Vision Foundation Models (VFMs, \emphe.g., CLIP) often struggle to generalize to images created using unseen generation pipelines. We identify, for the first time, a key failure mechanism, termed \emphsemantic fallback, where VFM-based detectors rely on dominant pre-trained semantic priors (such as identity) rather than forgery-specific traces under distribution shifts. To address this issue, we propose \textbfGeometric Semantic Decoupling (GSD), a parameter-free module that explicitly removes semantic components from learned representations by leveraging a frozen VFM as a semantic guide with a trainable VFM as an artifact detector. GSD estimates semantic directions from batch-wise statistics and projects them out via a geometric constraint, forcing the artifact detector to rely on semantic-invariant forensic evidence. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving 94.4% video-level AUC (+\textbf1.2%) in cross-dataset evaluation, improving robustness to unseen manipulations (+\textbf3.0% on DF40), and generalizing beyond faces to the detection of synthetic images of general scenes, including UniversalFakeDetect (+\textbf0.9%) and GenImage (+\textbf1.7%).
[CV-103] RAE-NWM: Navigation World Model in Dense Visual Representation Space
【速读】:该论文旨在解决视觉导航中世界模型(World Model)因在变分自编码器(Variational Autoencoder, VAE)压缩潜在空间中学习状态演化时,空间压缩导致细粒度结构信息丢失、进而影响精确控制的问题。其解决方案的关键在于:利用密集的DINOv2特征表示替代传统VAE潜在空间,以保留更丰富的视觉结构信息,并基于此构建Representation Autoencoder-based Navigation World Model (RAE-NWM);该模型采用带解耦扩散Transformer头(CDiT-DH)建模连续状态转移过程,同时引入时间驱动门控模块(time-driven gating module)动态调节动作注入强度,从而提升序列回放的结构稳定性和动作准确性,增强下游规划与导航性能。
链接: https://arxiv.org/abs/2603.09241
作者: Mingkun Zhang,Wangtian Shen,Fan Zhang,Haijian Qin,Zihao Pei,Ziyang Meng
机构: Tsinghua University (清华大学); University of Rochester (罗切斯特大学); Beijing Information Science and Technology University (北京信息科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Code is available at: this https URL
Abstract:Visual navigation requires agents to reach goals in complex environments through perception and planning. World models address this task by simulating action-conditioned state transitions to predict future observations. Current navigation world models typically learn state evolution under actions within the compressed latent space of a Variational Autoencoder, where spatial compression often discards fine-grained structural information and hinders precise control. To better understand the propagation characteristics of different representations, we conduct a linear dynamics probe and observe that dense DINOv2 features exhibit stronger linear predictability for action-conditioned transitions. Motivated by this observation, we propose the Representation Autoencoder-based Navigation World Model (RAE-NWM), which models navigation dynamics in a dense visual representation space. We employ a Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT-DH) to model continuous transitions, and introduce a separate time-driven gating module for dynamics conditioning to regulate action injection strength during generation. Extensive evaluations show that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.
[CV-104] BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off
【速读】:该论文旨在解决虚拟试衣(Virtual Try-On, VTOFF)中因忽略人体穿着状态与平面服装布局之间差异而导致的重建不一致和结构不稳定问题。现有方法通常依赖局部掩码或纯文本提示进行图像直接转换,难以准确恢复未观测区域的细节并保持服装结构稳定性。其解决方案的关键在于提出BridgeDiff框架,通过两个互补模块实现人本感知与平面服装合成之间的显式桥接:一是服装条件桥接模块(Garment Condition Bridge Module, GCBM),构建捕捉全局外观和语义身份的服装线索表示,提升在部分可见情况下的连续细节推理能力;二是平面结构约束模块(Flat Structure Constraint Module, FSCM),通过在特定去噪阶段引入平面结构先验的平面对齐注意力机制(Flat-Constraint Attention, FC-Attention),显著增强结构稳定性,优于仅依赖文本条件的方法。
链接: https://arxiv.org/abs/2603.09236
作者: Shuang Liu,Ao Yu,Linkang Cheng,Xiwen Huang,Li Zhao,Junhui Liu,Zhiting Lin,Yu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 33 pages, 16 figures
Abstract:Virtual try-off (VTOFF) aims to recover canonical flat-garment representations from images of dressed persons for standardized display and downstream virtual try-on. Prior methods often treat VTOFF as direct image translation driven by local masks or text-only prompts, overlooking the gap between on-body appearances and flat layouts. This gap frequently leads to inconsistent completion in unobserved regions and unstable garment structure. We propose BridgeDiff, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components. First, the Garment Condition Bridge Module (GCBM) builds a garment-cue representation that captures global appearance and semantic identity, enabling robust inference of continuous details under partial visibility. Second, the Flat Structure Constraint Module (FSCM) injects explicit flat-garment structural priors via Flat-Constraint Attention (FC-Attention) at selected denoising stages, improving structural stability beyond text-only conditioning. Extensive experiments on standard VTOFF benchmarks show that BridgeDiff achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.
[CV-105] HelixTrack: Event-Based Tracking and RPM Estimation of Propeller-like Objects
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicles, UAVs)和旋转机械中安全关键感知任务中的微秒级延迟跟踪问题,特别是在机体运动(egomotion)和强干扰背景下对快速周期性运动目标(如螺旋桨)的稳定跟踪与转速(RPM)估计难题。传统帧基(frame-based)和事件基(event-based)跟踪方法因假设运动连续平滑而无法有效处理螺旋桨类物体的周期性特征,导致漂移或失效。解决方案的关键在于提出HelixTrack——一种全事件驱动的方法,通过在线估计单应变换(homography)将事件从图像平面回投影至旋翼平面,并利用卡尔曼滤波(Kalman Filter)实时估计相位;同时,通过批量迭代更新机制将相位残差与几何信息耦合,实现高精度的物体姿态与RPM联合估计。该方法在自建的TQE数据集上实现了微秒级延迟和亚毫秒级精度,显著优于现有基于事件或聚合的基线方法。
链接: https://arxiv.org/abs/2603.09235
作者: Radim Spetlik,Michal Pliska,Vojtěch Vrba,Jiri Matas
机构: Czech Technical University in Prague (捷克理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Safety-critical perception for unmanned aerial vehicles and rotating machinery requires microsecond-latency tracking of fast, periodic motion under egomotion and strong distractors. Frame-based and event-based trackers drift or break on propellers because periodic signatures violate their smooth-motion assumptions. We tackle this gap with HelixTrack, a fully event-driven method that jointly tracks propeller-like objects and estimates their rotations per minute (RPM). Incoming events are back-warped from the image plane into the rotor plane via a homography estimated on the fly. A Kalman Filter maintains instantaneous estimates of phase. Batched iterative updates refine the object pose by coupling phase residuals to geometry. To our knowledge, no public dataset targets joint tracking and RPM estimation of propeller-like objects. We therefore introduce the Timestamped Quadcopter with Egomotion (TQE) dataset with 13 high-resolution event sequences, containing 52 rotating objects in total, captured at distances of 2 m / 4 m, with increasing egomotion and microsecond RPM ground truth. On TQE, HelixTrack processes full-rate events (approx. 11.8x real time) faster than real time and microsecond latency. It consistently outperforms per-event and aggregation-based baselines adapted for RPM estimation.
[CV-106] UniField: A Unified Field-Aware MRI Enhancement Framework
【速读】:该论文旨在解决当前磁共振成像(MRI)场强增强方法中存在的局限性问题,即现有技术多聚焦于孤立的场强转换任务(如64mT到3T或3T到7T),未能充分利用不同场强间共享的退化模式,导致模型泛化能力受限。其解决方案的关键在于提出一种统一框架\methodname,通过三个核心创新实现:首先,摒弃将3D MRI体积视为独立2D切片的传统方法,直接利用预训练的3D基础模型提取连续解剖结构的完整三维信息,从而增强表示学习的鲁棒性;其次,针对主流流匹配模型存在的频谱偏差问题(过度平滑高频细节),引入场感知频谱校正机制(FASRM),基于磁场物理机制对不同场强定制化修正频域特征;最后,构建并公开一个规模远超现有数据集的多场强配对MRI数据集,有效缓解数据瓶颈问题。
链接: https://arxiv.org/abs/2603.09223
作者: Yiyang Lin,Chenhui Wang,Zhihao Peng,Yixuan Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Magnetic Resonance Imaging (MRI) field-strength enhancement holds immense value for both clinical diagnostics and advanced research. However, existing methods typically focus on isolated enhancement tasks, such as specific 64mT-to-3T or 3T-to-7T transitions using limited subject cohorts, thereby failing to exploit the shared degradation patterns inherent across different field strengths and severely restricting model generalization. To address this challenge, we propose \methodname, a unified framework integrating multiple modalities and enhancement tasks to mutually promote representation learning by exploiting these shared degradation characteristics. Specifically, our main contributions are threefold. Firstly, to overcome MRI data scarcity and capture continuous anatomical structures, \methodname departs from conventional methods that treat 3D MRI volumes as independent 2D slices. Instead, we directly exploit comprehensive 3D volumetric information by leveraging pre-trained 3D foundation models, thereby embedding generalized and robust structural representations to significantly boost enhancement performance. In addition, to mitigate the spectral bias of mainstream flow-matching models that often over-smooth high-frequency details, we explicitly incorporate the physical mechanisms of magnetic fields to introduce a Field-Aware Spectral Rectification Mechanism (FASRM), tailoring customized spectral corrections to distinct field strengths. Finally, to resolve the fundamental data bottleneck, we organize and publicly release a comprehensive paired multi-field MRI dataset, which is an order of magnitude larger than existing datasets. Extensive experiments demonstrate our method’s superiority over state-of-the-art approaches, achieving an average improvement of approximately 1.81 dB in PSNR and 9.47% in SSIM. Code will be released upon acceptance.
[CV-107] Distributed Convolutional Neural Networks for Object Recognition
【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Network, CNN)在识别特定正类样本时难以有效区分正负类特征、且模型复杂度较高的问题。其解决方案的关键在于提出一种新型损失函数,通过将正样本映射到高维空间中的紧凑区域,而将负样本映射至原点(Origin),从而实现仅提取正类特征的分布式卷积神经网络(Distributed CNN, DisCNN)。该机制使正类特征从负类中解耦(disentangled),并因只需提取少量正类特征而具备轻量化架构,同时保持对未见类别的良好泛化能力,适用于复杂背景下的正类目标检测任务。
链接: https://arxiv.org/abs/2603.09220
作者: Liang Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper proposes a novel loss function for training a distributed convolutional neural network (DisCNN) to recognize only a specific positive class. By mapping positive samples to a compact set in high-dimensional space and negative samples to Origin, the DisCNN extracts only the features of the positive class. An experiment is given to prove this. Thus, the features of the positive class are disentangled from those of the negative classes. The model has a lightweight architecture because only a few positive-class features need to be extracted. The model demonstrates excellent generalization on the test data and remains effective even for unseen classes. Finally, using DisCNN, object detection of positive samples embedded in a large and complex background is straightforward.
[CV-108] ubeMLLM : A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy MICCAI2026
【速读】:该论文旨在解决医学血管样结构建模中因拓扑复杂性和数据分布偏移导致的拓扑不一致性问题,如人工断连和虚假合并等。其解决方案的关键在于提出TubeMLLM——一种统一的基础模型,通过显式自然语言提示引入拓扑先验,并在共享注意力架构中将其与视觉表征对齐,从而显著提升拓扑感知能力;同时构建了首个以拓扑为中心的多模态基准TubeMData,并采用自适应损失加权策略强化训练过程中拓扑关键区域的学习,实现跨模态零样本迁移与鲁棒性增强。
链接: https://arxiv.org/abs/2603.09217
作者: Yaoyu Liu,Minghui Zhang,Xin You,Hanxiao Zhang,Yun Gu
机构: Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures, extended version of the submission to MICCAI 2026
Abstract:Modeling medical vessel-like anatomy is challenging due to its intricate topology and sensitivity to dataset shifts. Consequently, task-specific models often suffer from topological inconsistencies, including artificial disconnections and spurious merges. Motivated by the promise of multimodal large language models (MLLMs) for zero-shot generalization, we propose TubeMLLM, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy. By integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture, TubeMLLM significantly enhances topology-aware perception. Furthermore, we construct TubeMData, a pionner multimodal benchmark comprising comprehensive topology-centric tasks, and introduce an adaptive loss weighting strategy to emphasize topology-critical regions during training. Extensive experiments on fifteen diverse datasets demonstrate our superiority. Quantitatively, TubeMLLM achieves state-of-the-art out-of-distribution performance, substantially reducing global topological discrepancies on color fundus photography (decreasing the \beta_0 number error from 37.42 to 8.58 compared to baselines). Notably, TubeMLLM exhibits exceptional zero-shot cross-modality transferring ability on unseen X-ray angiography, achieving a Dice score of 67.50% while significantly reducing the \beta_0 error to 1.21. TubeMLLM also maintains robustness against degradations such as blur, noise, and low resolution. Furthermore, in topology-aware understanding tasks, the model achieves 97.38% accuracy in evaluating mask topological quality, significantly outperforming standard vision-language baselines.
[CV-109] Geometry-Aware Metric Learning for Cross-Lingual Few-Shot Sign Language Recognition on Static Hand Keypoints
【速读】:该论文旨在解决手语识别(Sign Language Recognition, SLR)系统在低资源语言中因标注数据稀缺而导致性能受限的问题,特别是跨语言少样本迁移学习中的域偏移(domain shift)问题。解决方案的关键在于提出一种几何感知的度量学习框架,其核心是利用MediaPipe静态手势关键点提取出一个紧凑的20维关节间角度描述子(inter-joint angle descriptor),该描述子对SO(3)旋转、平移和各向同性缩放具有不变性,从而消除跨数据集的主要变异来源,显著提升类原型(class prototypes)的稳定性与区分度。实验表明,该方法在四种不同语言的手指拼写识别任务中,相比归一化坐标基线提升高达25个百分点,并能在冻结模型参数的情况下实现超越域内准确率的跨语言迁移效果。
链接: https://arxiv.org/abs/2603.09213
作者: Chayanin Chamachot,Kanokphan Lertniponphan
机构: Chulalongkorn University (朱拉隆功大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sign language recognition (SLR) systems typically require large labeled corpora for each language, yet the majority of the world’s 300+ sign languages lack sufficient annotated data. Cross-lingual few-shot transfer, pretraining on a data-rich source language and adapting with only a handful of target-language examples, offers a scalable alternative, but conventional coordinate-based keypoint representations are susceptible to domain shift arising from differences in camera viewpoint, hand scale, and recording conditions. This shift is particularly detrimental in the few-shot regime, where class prototypes estimated from only K examples are highly sensitive to extrinsic variance. We propose a geometry-aware metric-learning framework centered on a compact 20-dimensional inter-joint angle descriptor derived from MediaPipe static hand keypoints. These angles are invariant to SO(3) rotation, translation, and isotropic scaling, eliminating the dominant sources of cross-dataset shift and yielding tighter, more stable class prototypes. Evaluated on four fingerspelling alphabets spanning typologically diverse sign languages, ASL, LIBRAS, Arabic Sign Language, and Thai Sign Language, the proposed angle features improve over normalized-coordinate baselines by up to 25 percentage points within-domain and enable frozen cross-lingual transfer that frequently exceeds within-domain accuracy, using a lightweight MLP encoder with about 10^5 parameters. These findings demonstrate that invariant hand-geometry descriptors provide a portable and effective foundation for cross-lingual few-shot SLR in low-resource settings.
[CV-110] MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在缺乏初始数据(zero-data)条件下实现自我进化(self-evolving)的难题,即如何在不依赖任何种子图像或标注数据的前提下,使VLM具备自主生成视觉概念、构建推理任务并迭代优化自身能力的能力。其解决方案的关键在于提出Multi-model Multimodal Zero (MM-Zero)框架,该框架采用基于强化学习(RL)的多角色协同训练机制,包含三个专用角色:Proposer负责生成抽象视觉概念并设计问题,Coder将概念转化为可执行代码(如Python或SVG)以渲染图像,Solver则对生成的视觉内容进行多模态推理。三者均从同一基础模型初始化,并通过Group Relative Policy Optimization (GRPO)联合训练,结合执行反馈、视觉验证和难度平衡的奖励机制,实现了无需外部数据即可持续提升VLM推理性能的闭环自进化路径。
链接: https://arxiv.org/abs/2603.09206
作者: Zongxia Li,Hongyang Du,Chengsong Huang,Xiyang Wu,Lantao Yu,Yicheng He,Jing Xie,Xiaomin Wu,Zhichao Liu,Jiarui Zhang,Fuxiao Liu
机构: Qwen3-VL-Instruct; Mimo-VL-7B-Instruct
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.
[CV-111] Point Cloud as a Foreign Language for Multi-modal Large Language Model
【速读】:该论文旨在解决当前基于编码器的多模态大语言模型(Multi-modal Large Language Models, MLLMs)在3D理解任务中存在的语义错位(semantic misalignment)、分辨率敏感性(resolution sensitivity)以及计算开销过大等问题。其解决方案的关键在于提出SAGE,首个端到端的3D MLLM,直接处理原始点云数据而无需依赖预训练的3D编码器;通过引入轻量级3D分词器(3D tokenizer),结合几何采样与邻域聚合及向量量化技术,将点云转化为离散标记(tokens),从而将3D数据视为一种“外语”以自然扩展语言模型词汇表;同时设计了一种基于语义对齐的偏好优化训练策略,提升模型在开放式3D问答任务中的推理能力。该方法在多个3D理解基准上显著优于现有编码器方法,并展现出更高的计算效率、更强的跨LLM骨干网络泛化能力及对输入分辨率变化的鲁棒性。
链接: https://arxiv.org/abs/2603.09173
作者: Sneha Paul,Zachary Patterson,Nizar Bouguila
机构: Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026
Abstract:Multi-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead. In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens–treating 3D data as a foreign language that naturally extends the LLM’s vocabulary. Furthermore, to enhance the model’s reasoning capability on complex 3D tasks, we propose a preference optimization training strategy with a semantic alignment-based reward, specifically designed for open-ended 3D question answering where responses are descriptive. Extensive experiments across diverse 3D understanding benchmarks demonstrate that our end-to-end approach outperforms existing encoder-based methods while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations. Code is available at: this http URL.
[CV-112] Progressive Split Mamba: Effective State Space Modelling for Image Restoration
【速读】:该论文旨在解决图像恢复任务中同时保持细粒度局部结构与长程空间一致性的问题。现有方法如卷积网络受限于感受野,而基于注意力机制的模型存在二次复杂度,尽管状态空间模型(State Space Models, SSMs)如Mamba提供了线性时间复杂度的长程依赖建模方案,但其直接应用于二维图像时存在两个关键缺陷:一是将二维特征图展平为一维序列破坏了空间拓扑结构,导致局部信息失真;二是SSMs的稳定性驱动的递归动态引发长程衰减,削弱全局一致性。为此,作者提出渐进式分块Mamba(Progressive Split-Mamba, PS-Mamba),其核心创新在于:首先通过几何一致的分块策略(逐级分割为半区、四分之一、八分之一)保留邻域完整性,实现局部结构感知的层次化建模;其次引入对称跨尺度快捷路径,直接传递低频全局上下文信息,抑制长程衰减,从而在保持线性复杂度的前提下显著提升图像恢复质量。
链接: https://arxiv.org/abs/2603.09171
作者: Mohammed Hassanin,Nour Moustafa,Weijian Deng,Ibrahim Radwan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image restoration requires simultaneously preserving fine-grained local structures and maintaining long-range spatial coherence. While convolutional networks struggle with limited receptive fields, and Transformers incur quadratic complexity for global attention, recent State Space Models (SSMs), such as Mamba, provide an appealing linear-time alternative for long-range dependency modelling. However, naively extending Mamba to 2D images exposes two intrinsic shortcomings. First, flattening 2D feature maps into 1D sequences disrupts spatial topology, leading to locality distortion that hampers precise structural recovery. Second, the stability-driven recurrent dynamics of SSMs induce long-range decay, progressively attenuating information across distant spatial positions and weakening global consistency. Together, these effects limit the effectiveness of state-space modelling in high-fidelity restoration. We propose Progressive Split-Mamba (PS-Mamba), a topology-aware hierarchical state-space framework designed to reconcile locality preservation with efficient global propagation. Instead of sequentially flattening entire feature maps, PS-Mamba performs geometry-consistent partitioning, maintaining neighbourhood integrity prior to state-space processing. A progressive split hierarchy (halves, quadrants, octants) enables structured multi-scale modelling while retaining linear complexity. To counteract long-range decay, we introduce symmetric cross-scale shortcut pathways that directly transmit low-frequency global context across hierarchical levels, stabilising information flow over large spatial extents. Extensive experiments on super-resolution, denoising, and JPEG artifact reduction show consistent improvements over recent Mamba-based and attention-based models with a clear margin.
[CV-113] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
【速读】:该论文旨在解决密集图像描述(Dense Image Captioning)在视觉-语言预训练和文本到图像生成中,高质量标注数据难以规模化的问题。当前依赖人工标注成本高昂,而基于强大视觉语言模型(VLM)的合成标注虽具可行性,但监督蒸馏常导致输出多样性不足与泛化能力弱。为此,作者提出RubiCap——一种新颖的强化学习(Reinforcement Learning, RL)框架,其关键在于通过大语言模型(LLM)撰写的评语(rubrics)提取细粒度、样本特定的奖励信号。RubiCap首先构建候选描述委员会,再由LLM评语撰写器识别当前策略的优势与缺陷,并将其转化为明确的评估标准,进而由LLM裁判将整体质量评估分解为多维结构化评价,替代传统的粗粒度标量奖励。这一机制显著提升了生成描述的质量与多样性,在多个基准测试中超越监督蒸馏、已有RL方法、人类专家标注及GPT-4V增强结果,且使用小型模型(RubiCap-3B)生成的描述可训练出优于使用专有模型标注数据所得到的预训练VLM。
链接: https://arxiv.org/abs/2603.09160
作者: Tzu-Heng Huang,Sirajul Salekin,Javier Movellan,Frederic Sala,Manjot Bilkhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers – a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.
[CV-114] RTFDNet: Fusion-Decoupling for Robust RGB-T Segmentation
【速读】:该论文旨在解决RGB-Thermal(RGB-T)语义分割在传感器信号部分缺失时性能严重下降的问题,传统方法因过度强调模态平衡而缺乏鲁棒性。其解决方案的关键在于提出RTFDNet,一种三分支编码器-解码器结构,通过协同特征融合(Synergistic Feature Fusion, SFF)与跨模态解耦正则化(Cross-Modal Decouple Regularization, CMDR)及区域解耦正则化(Region Decouple Regularization, RDR)实现融合与解耦的统一:SFF通过通道门控交换和轻量级空间注意力注入互补信息;CMDR从融合表示中分离出模态特有成分并以stop-gradient目标监督单模态解码器;RDR在置信区域强制类别选择性预测一致性并阻断梯度至融合分支,形成反馈机制,在不损害融合流的前提下增强单模态路径,从而实现测试时高效的独立推理。
链接: https://arxiv.org/abs/2603.09149
作者: Kunyu Tan,Mingjian Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:RGB-Thermal (RGB-T) semantic segmentation is essential for robotic systems operating in low-light or dark environments. However, traditional approaches often overemphasize modality balance, resulting in limited robustness and severe performance degradation when sensor signals are partially missing. Recent advances such as cross-modal knowledge distillation and modality-adaptive fine-tuning attempt to enhance cross-modal interaction, but they typically decouple modality fusion and modality adaptation, requiring multi-stage training with frozen models or teacher-student frameworks. We present RTFDNet, a three-branch encoder-decoder that unifies fusion and decoupling for robust RGB-T segmentation. Synergistic Feature Fusion (SFF) performs channel-wise gated exchange and lightweight spatial attention to inject complementary cues. Cross-Modal Decouple Regularization (CMDR) isolates modality-specific components from the fused representation and supervises unimodal decoders via stop-gradient targets. Region Decouple Regularization (RDR) enforces class-selective prediction consistency in confident regions while blocking gradients to the fusion branch. This feedback loop strengthens unimodal paths without degrading the fused stream, enabling efficient standalone inference at test time. Extensive experiments demonstrate the effectiveness of RTFDNet, showing consistent performance across varying modality conditions. Our implementation will be released to facilitate further research. Our source code are publicly available at this https URL.
[CV-115] Agent ic AI as a Network Control-Plane Intelligence Layer for Federated Learning over 6G
【速读】:该论文旨在解决6G网络环境下联邦学习(Federated Learning, FL)在分布式、异构设备上进行用户定制化本地训练时所面临的挑战,包括严格的延迟、带宽和可靠性约束。传统FL方法仅关注模型训练本身,而忽略了网络状态对训练效率的影响。为此,作者提出将**代理式人工智能(Agentic AI)**作为控制层嵌入到FL框架中,其核心创新在于将FL任务视为学习与网络管理的联合优化问题。关键解决方案是引入一组专业化智能体(包括检索、规划、编码和评估模块),通过实时监测网络条件(如信噪比、带宽、设备能力)并结合闭环评估机制与记忆功能,动态执行客户端选择、激励结构设计、调度、资源分配、自适应本地训练及代码生成等决策操作,从而实现高效、鲁棒且可迭代优化的FL系统。
链接: https://arxiv.org/abs/2603.09141
作者: Loc X. Nguyen,Ji Su Yoon,Huy Q. Le,Yu Qiao,Avi Deb Raha,Eui-Nam Huh,Nguyen H. Tran,Choong Seon Hong
机构: Kyung Hee University (庆熙大学); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The shift toward user-customized on-device learning places new demands on wireless systems: models must be trained on diverse, distributed data while meeting strict latency, bandwidth, and reliability constraints. To address this, we propose an Agentic AI as the control layer for managing federated learning (FL) over 6G networks, which translates high-level task goals into actions that are aware of network conditions. Rather than simply viewing FL as a learning challenge, our system sees it as a combined task of learning and network management. A set of specialized agents focused on retrieval, planning, coding, and evaluation utilizes monitoring tools and optimization methods to handle client selection, incentive structuring, scheduling, resource allocation, adaptive local training, and code generation. The use of closed-loop evaluation and memory allows the system to consistently refine its decisions, taking into account varying signal-to-noise ratios, bandwidth conditions, and device capabilities. Finally, our case study has demonstrated the effectiveness of the Agentic AI system’s use of tools for achieving high performance.
[CV-116] Rotation Equivariant Mamba for Vision Tasks
【速读】:该论文旨在解决当前基于Mamba的视觉架构缺乏旋转等变性(rotation equivariance)的问题,导致模型对图像旋转敏感、鲁棒性和跨任务泛化能力受限。解决方案的关键在于提出EQ-VMamba,这是首个具备旋转等变性的视觉Mamba架构,其核心创新包括:设计了一种旋转等变的交叉扫描策略(cross-scan strategy)和群Mamba块(group Mamba blocks),并通过严格的理论分析证明了该架构能够实现端到端的旋转等变性。实验表明,EQ-VMamba在多个视觉任务中性能优于或相当优于非等变基线模型,同时参数量减少约50%,验证了嵌入旋转等变性可显著提升模型鲁棒性与参数效率。
链接: https://arxiv.org/abs/2603.09138
作者: Zhongchen Zhao,Qi Xie,Keyu Huang,Lei Zhang,Deyu Meng,Zongben Xu
机构: Xi’an Jiaotong University (西安交通大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross-task generalization. To address this limitation, we propose to incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba-based architectures. Specifically, we introduce EQ-VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ-VMamba include a carefully designed rotation equivariant cross-scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end-to-end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks - including high-level image classification task, mid-level semantic segmentation task, and low-level image super-resolution task - demonstrate that EQ-VMamba achieves superior or competitive performance compared to non-equivariant baselines, while requiring approximately 50% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at this https URL.
[CV-117] ransformer-Based Multi-Region Segmentation and Radiomic Analysis of HR-pQCT Imaging
【速读】:该论文旨在解决传统骨质疏松症诊断方法(如双能X射线吸收测定法,DXA)仅依赖面积骨密度而忽略骨微结构及周围软组织信息的局限性。其核心问题是:如何充分利用高分辨率外周定量计算机断层扫描(HR-pQCT)所获取的多维图像数据,以提升骨质疏松症的自动化分类准确性。解决方案的关键在于提出了一种完全自动化的框架,首次采用基于Transformer架构的SegFormer模型实现对胫骨、腓骨皮质骨与松质骨以及周围软组织的多区域分割,分割平均F1分数达95.36%;随后从各解剖区域提取939个影像组学特征并进行降维处理,结合机器学习分类器构建模型。结果显示,软组织(尤其是肌腱组织)的影像组学特征在图像层面达到80.08%准确率和0.85 AUROC,在患者层面将诊断性能从0.792提升至0.875,显著优于仅基于骨成分的模型,证明了多组织整合分析对骨质疏松检测的重要性。
链接: https://arxiv.org/abs/2603.09137
作者: Mohseu Rashid Subah,Mohammed Abdul Gani Zilani,Thomas L. Nickolas,Matthew R. Allen,Stuart J. Warden,Rachel K. Surowiec
机构: Purdue University (普渡大学); Washington University Medicine (华盛顿大学医学部); Indiana University School of Medicine (印第安纳大学医学院); Indiana University Indianapolis (印第安纳大学印第安纳波利斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Osteoporosis is a skeletal disease typically diagnosed using dual-energy X-ray absorptiometry (DXA), which quantifies areal bone mineral density but overlooks bone microarchitecture and surrounding soft tissues. High-resolution peripheral quantitative computed tomography (HR-pQCT) enables three-dimensional microstructural imaging with minimal radiation. However, current analysis pipelines largely focus on mineralized bone compartments, leaving much of the acquired image data underutilized. We introduce a fully automated framework for binary osteoporosis classification using radiomics features extracted from anatomically segmented HR-pQCT images. To our knowledge, this work is the first to leverage a transformer-based segmentation architecture, i.e., the SegFormer, for fully automated multi-region HR-pQCT analysis. The SegFormer model simultaneously delineated the cortical and trabecular bone of the tibia and fibula along with surrounding soft tissues and achieved a mean F1 score of 95.36%. Soft tissues were further subdivided into skin, myotendinous, and adipose regions through post-processing. From each region, 939 radiomic features were extracted and dimensionally reduced to train six machine learning classifiers on an independent dataset comprising 20,496 images from 122 HR-pQCT scans. The best image level performance was achieved using myotendinous tissue features, yielding an accuracy of 80.08% and an area under the receiver operating characteristic curve (AUROC) of 0.85, outperforming bone-based models. At the patient level, replacing standard biological, DXA, and HR-pQCT parameters with soft tissue radiomics improved AUROC from 0.792 to 0.875. These findings demonstrate that automated, multi-region HR-pQCT segmentation enables the extraction of clinically informative signals beyond bone alone, highlighting the importance of integrated tissue assessment for osteoporosis detection.
[CV-118] QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model ICASSP2026
【速读】:该论文旨在解决扩散模型在真实世界图像超分辨率(ISR)任务中面临的挑战,即当退化类型未知且空间分布不均匀时,传统方法易导致细节丢失或产生视觉伪影。其解决方案的关键在于提出一种名为QUSR的新颖扩散模型,该模型通过两个核心模块实现:一是不确定性引导的噪声生成(Uncertainty-Guided Noise Generation, UNG)模块,能够自适应地调整噪声注入强度,在高不确定性区域(如边缘和纹理)施加更强扰动以重建复杂细节,同时在低不确定性区域(如平坦区域)最小化噪声干扰以保留原始信息;二是质量感知先验(Quality-Aware Prior, QAP)模块,利用先进的多模态大语言模型(Multimodal Large Language Model, MLLM)生成可靠的质量描述,为恢复过程提供有效且可解释的质量先验。
链接: https://arxiv.org/abs/2603.09125
作者: Junjie Yin,Jiaju Li,Hanfa Xing
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ICASSP 2026
Abstract:Diffusion-based image super-resolution (ISR) has shown strong potential, but it still struggles in real-world scenarios where degradations are unknown and spatially non-uniform, often resulting in lost details or visual artifacts. To address this challenge, we propose a novel super-resolution diffusion model, QUSR, which integrates a Quality-Aware Prior (QAP) with an Uncertainty-Guided Noise Generation (UNG) module. The UNG module adaptively adjusts the noise injection intensity, applying stronger perturbations to high-uncertainty regions (e.g., edges and textures) to reconstruct complex details, while minimizing noise in low-uncertainty regions (e.g., flat areas) to preserve original information. Concurrently, the QAP leverages an advanced Multimodal Large Language Model (MLLM) to generate reliable quality descriptions, providing an effective and interpretable quality prior for the restoration process. Experimental results confirm that QUSR can produce high-fidelity and high-realism images in real-world scenarios. The source code is available at this https URL.
[CV-119] Progressive Representation Learning for Multimodal Sentiment Analysis with Incomplete Modalities
【速读】:该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)在实际应用中常遇到的模态缺失问题,即当文本、语音或视觉模态因噪声、硬件故障或隐私限制而部分缺失时,现有方法因无法有效处理不完整模态与完整模态之间的特征错位,导致融合效果下降甚至破坏已学习的模态表示。其解决方案的关键在于提出一种渐进式表示学习框架(Progressive Representation Learning Framework, PRLF),核心包括两个模块:一是自适应模态可靠性估计器(Adaptive Modality Reliability Estimator, AMRE),通过识别置信度和Fisher信息动态评估各模态可靠性并确定主导模态;二是渐进交互模块(ProgInteract),迭代地将其他模态与主导模态对齐,增强跨模态一致性的同时抑制噪声,从而提升模型在不确定模态缺失条件下的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2603.09111
作者: Jindi Bao,Jianjun Qian,Mengkai Yan,Jian Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Sentiment Analysis (MSA) seeks to infer human emotions by integrating textual, acoustic, and visual cues. However, existing approaches often rely on all modalities are completeness, whereas real-world applications frequently encounter noise, hardware failures, or privacy restrictions that result in missing modalities. There exists a significant feature misalignment between incomplete and complete modalities, and directly fusing them may even distort the well-learned representations of the intact modalities. To this end, we propose PRLF, a Progressive Representation Learning Framework designed for MSA under uncertain missing-modality conditions. PRLF introduces an Adaptive Modality Reliability Estimator (AMRE), which dynamically quantifies the reliability of each modality using recognition confidence and Fisher information to determine the dominant modality. In addition, the Progressive Interaction (ProgInteract) module iteratively aligns the other modalities with the dominant one, thereby enhancing cross-modal consistency while suppressing noise. Extensive experiments on CMU-MOSI, CMU-MOSEI, and SIMS verify that PRLF outperforms state-of-the-art methods across both inter- and intra-modality missing scenarios, demonstrating its robustness and generalization capability.
[CV-120] VIVID-Med: LLM -Supervised Structured Pretraining for Deployable Medical ViTs
【速读】:该论文旨在解决当前医学图像分析中视觉编码器监督信号不足的问题,即现有方法多采用one-hot标签或自由文本进行监督,难以有效捕捉临床发现之间的复杂语义关系。其解决方案的关键在于提出VIVID-Med框架,利用冻结的大语言模型(LLM)作为结构化语义教师,通过统一医学本体(Unified Medical Schema, UMS)将临床发现转化为可验证的JSON字段-状态对,并采用答案感知掩码机制聚焦优化目标;同时引入结构化预测分解(Structured Prediction Decomposition, SPD)将交叉注意力划分为正交性正则化的查询组,从而提取互补的视觉特征。训练完成后丢弃LLM,仅保留轻量级视觉Transformer(ViT)主干,显著提升效率与可部署性。
链接: https://arxiv.org/abs/2603.09109
作者: Xiyao Wang,Xiaoyu Tan,Yang Dai,Yuxuan Fu,Shuo Li,Xihe Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures
Abstract:Vision-language pretraining has driven significant progress in medical image analysis. However, current methods typically supervise visual encoders using one-hot labels or free-form text, neither of which effectively captures the complex semantic relationships among clinical findings. In this study, we introduce VIVID-Med, a novel framework that leverages a frozen large language model (LLM) as a structured semantic teacher to pretrain medical vision transformers (ViTs). VIVID-Med translates clinical findings into verifiable JSON field-state pairs via a Unified Medical Schema (UMS), utilizing answerability-aware masking to focus optimization. It then employs Structured Prediction Decomposition (SPD) to partition cross-attention into orthogonality-regularized query groups, extracting complementary visual aspects. Crucially, the LLM is discarded post-training, yielding a lightweight, deployable ViT-only backbone. We evaluated VIVID-Med across multiple settings: on CheXpert linear probing, it achieves a macro-AUC of 0.8588, outperforming BiomedCLIP by +6.65 points while using 500x less data. It also demonstrates robust zero-shot cross-domain transfer to NIH ChestX-ray14 (0.7225 macro-AUC) and strong cross-modality generalization to CT, achieving 0.8413 AUC on LIDC-IDRI lung nodule classification and 0.9969 macro-AUC on OrganAMNIST 11-organ classification. VIVID-Med offers a highly efficient, scalable alternative to deploying resource-heavy vision-language models in clinical settings.
[CV-121] Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations
【速读】:该论文旨在解决皮肤癌医学图像检索中如何有效融合视觉与文本信息以提升检索精度的问题,尤其针对临床实践中常见的“参考病变图像+文本描述”组合查询场景。其解决方案的关键在于提出一种基于Transformer的分层复合查询表示框架,通过联合全局-局部对齐机制实现跨模态匹配:局部对齐利用多个空间注意力掩码聚合判别性区域特征,全局对齐提供整体语义监督,最终通过一个领域知识驱动的凸加权策略融合两者,突出临床显著的局部证据并保持全局一致性,从而在公开的Derm7pt数据集上显著优于现有最先进方法。
链接: https://arxiv.org/abs/2603.09108
作者: Yuheng Wang,Yuji Lin,Dongrun Zhu,Jiayue Cai,Sunil Kalia,Harvey Lui,Chunqi Chang,Z. Jane Wang,Tim K. Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.
[CV-122] raining-free Motion Factorization for Compositional Video Generation CVPR2026
【速读】:该论文旨在解决当前组合视频生成方法中对提示词中多样运动类别理解不足的问题,即现有方法主要关注语义绑定,而忽视了对提示中指定的复杂运动类型的建模。其解决方案的关键在于提出一种运动因子分解框架,将复杂运动解耦为三种基本类别:静止(motionlessness)、刚性运动(rigid motion)和非刚性运动(non-rigid motion),并采用“先规划后生成”的范式:在规划阶段,通过运动图推理每帧中各实例的形状与位置变化,从而结构化地组织用户提示以消除语义模糊;在生成阶段,通过条件引导分支分别稳定静止区域外观、保持刚体几何结构并规范局部非刚性形变,实现运动类别的解耦控制。该框架具有模型无关性,可无缝集成至多种扩散模型架构中。
链接: https://arxiv.org/abs/2603.09104
作者: Zixuan Wang,Ziqin Zhou,Feng Chen,Duo Peng,Yixin Hu,Changsheng Li,Yinjie Lei
机构: Sichuan University(四川大学); The University of Adelaide(阿德莱德大学); Singapore University of Technology and Design(新加坡科技设计大学); Beijing Institute of Technology(北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Our code will be released soon.
[CV-123] MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration CVPR2026
【速读】:该论文旨在解决当前医学视觉-语言预训练(Medical Vision-Language Pretraining, MedVLP)方法在同时学习简单与复杂概念时存在的认知冲突问题,导致特征表示欠优,尤其在分布偏移场景下表现不佳。其解决方案的关键在于提出一种知识驱动的认知编排框架(Knowledge-driven Cognitive Orchestration for Medical VLP, MedKCO),通过两个层面的课程学习机制优化预训练过程:一是基于诊断敏感性和类内样本代表性设计两级数据排序策略;二是引入自适应非对称对比损失(self-paced asymmetric contrastive loss),动态调整不同样本在视觉-语言对比学习中的参与权重,从而提升模型在多任务医学视觉-语言下游任务中的泛化能力。
链接: https://arxiv.org/abs/2603.09101
作者: Chenran Zhang,Ruiqi Wu,Tao Zhou,Yi Zhou
机构: Southeast University (东南大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026
Abstract:Medical vision-language pretraining (VLP) models have recently been investigated for their generalization to diverse downstream tasks. However, current medical VLP methods typically force the model to learn simple and complex concepts simultaneously. This anti-cognitive process leads to suboptimal feature representations, especially under distribution shift. To address this limitation, we propose a Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast. Specifically, we design a two level curriculum by incorporating diagnostic sensitivity and intra-class sample representativeness for the ordering of the pretraining data. Moreover, considering the inter-class similarity of medical images, we introduce a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective. We evaluate the proposed pretraining method on three medical imaging scenarios in multiple vision-language downstream tasks, and compare it with several curriculum learning methods. Extensive experiments show that our method significantly surpasses all baselines. this https URL.
[CV-124] Chain of Event-Centric Causal Thought for Physically Plausible Video Generation CVPR2026
【速读】:该论文旨在解决生成式视频模型在模拟真实物理现象时缺乏因果连续性的问题,即当前方法通常将物理现象视为由提示词定义的单一瞬间,而未能建模事件之间的动态演化与因果关系。解决方案的关键在于提出一种基于因果事件链推理和跨模态提示对齐的新框架:首先设计“物理驱动的事件链推理”模块,通过链式思维分解物理描述为基本事件单元,并引入物理公式作为约束以明确因果依赖;其次构建“过渡感知的跨模态提示”(Transition-aware Cross-modal Prompting, TCP)模块,将因果事件单元转化为时序对齐的视觉-语言提示,从而保持事件间的连贯性并逐步合成关键帧。该方法显著提升了视频扩散模型在多个物理场景下的物理合理性表现。
链接: https://arxiv.org/abs/2603.09094
作者: Zixuan Wang,Yixin Hu,Haolan Wang,Feng Chen,Yan Liu,Wen Li,Yinjie Lei
机构: Sichuan University(四川大学); The University of Adelaide(阿德莱德大学); Hong Kong Polytechnic University(香港理工大学); University of Electronic Science and Technology of China(电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. Our code will be released soon.
[CV-125] OmniEdit: A Training-free framework for Lip Synchronization and Audio-Visual Editing
【速读】:该论文旨在解决多模态学习中的唇音同步(lip synchronization)与视听编辑(audio-visual editing)问题,现有方法通常依赖于预训练模型的监督微调,导致计算开销大且数据需求高。其解决方案的关键在于提出一种无需训练(training-free)的框架OmniEdit,通过将FlowEdit中的编辑序列替换为目标序列,实现对期望输出的无偏估计;同时,消除生成过程中的随机性,构建平滑稳定的编辑轨迹,从而在不依赖额外训练的情况下显著提升编辑效果的准确性和鲁棒性。
链接: https://arxiv.org/abs/2603.09084
作者: Lixiang Lin,Siyuan Jin,Jinshan Zhang
机构: HiThink Research(思凯瑞研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lip synchronization and audio-visual editing have emerged as fundamental challenges in multimodal learning, underpinning a wide range of applications, including film production, virtual avatars, and telepresence. Despite recent progress, most existing methods for lip synchronization and audio-visual editing depend on supervised fine-tuning of pre-trained models, leading to considerable computational overhead and data requirements. In this paper, we present OmniEdit, a training-free framework designed for both lip synchronization and audio-visual editing. Our approach reformulates the editing paradigm by substituting the edit sequence in FlowEdit with the target sequence, yielding an unbiased estimation of the desired output. Moreover, by removing stochastic elements from the generation process, we establish a smooth and stable editing trajectory. Extensive experimental results validate the effectiveness and robustness of the proposed framework. Code is available at this https URL.
[CV-126] GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models
【速读】:该论文旨在解决视觉语言动作模型(VLA)在处理视觉观测时缺乏显式几何结构表示的问题,导致其在需要高精度空间理解的任务中表现受限。解决方案的关键在于提出GST-VLA框架,包含两个核心创新:一是引入高斯空间分词器(Gaussian Spatial Tokenizer, GST),将冻结的稠密深度和语义patch特征转换为128个各向异性3D高斯原型,通过均值、对数尺度协方差和可学习不透明度参数编码局部表面朝向与几何置信度;二是设计3D深度感知思维链(Depth-Aware Chain-of-Thought, DA-CoT)推理机制,在训练损失中显式监督四个结构化中间空间思考目标(包括3D物体定位、抓取接触几何、成对度量距离及粗略SE(3)位姿路径),并通过跨注意力子层实现对原始256个高斯原型场的直接访问。该方法显著提升了复杂任务中的精度表现,如LIBERO基准上达到96.4%的成功率(+2.0%)。
链接: https://arxiv.org/abs/2603.09079
作者: Md Selim Sarowar,Omer Tariq,Sungho Kim
机构: Yeungnam University (延世大学); Korea Advanced Institute of Science and Technology, KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: The results presented in this paper are preliminary. Please note that the experiments are currently ongoing, and the final data is subject to change upon the completion of the study. All ideas, results, methods, and any content herein are the sole property of the authors
Abstract:VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. We introduce GST-VLA with two contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into N_g=128 anisotropic 3D Gaussian primitives, each parameterized by a metric residual mean \mu \in \mathbbR^3 , log-scale covariance \log \sigma \in \mathbbR^3 , and learned opacity \alpha \in (0,1) . The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering 3D object grounding, grasp affordance contact geometry, pairwise metric distances, and coarse SE(3) waypoints, as explicit generation targets in the training loss. A cross-attention sublayer at every VLM transformer block provides direct access to the raw 256-primitive Gaussian field during DA-CoT generation. A 300M-parameter flow-matching action expert with mixture-of-experts feedforward sublayers decodes 7-DoF delta action chunks via conditional ODE integration, conditioned on both VLM hidden states and DA-CoT outputs through dual cross-attention. Trained with composite \mathcalL_\mathrmflow + \mathcalL_\mathrmCoT + \mathcalL_\mathrmdepth across three progressive stages, GST-VLA achieves 96.4% on LIBERO (+2.0%), and 80.2% on SimplerEnv (+5.4%). Ablations isolate the contribution of each GST component, each DA-CoT thought, and each training stage, confirming independent and synergistic gains concentrated on precision demanding tasks.
[CV-127] Intelligent Spatial Estimation for Fire Hazards in Engineering Sites: An Enhanced YOLOv8-Powered Proximity Analysis Framework
【速读】:该论文旨在解决传统视觉监控系统仅能实现火灾简单检测、缺乏对潜在风险进行量化评估的问题,从而提升智能火灾监测系统的实用性与决策支持能力。其解决方案的关键在于提出了一种增强型双模型YOLOv8框架:主模型(YOLOv8实例分割)用于精确识别火源和烟雾,辅模型(基于COCO预训练的目标检测模型)用于识别周围人员、车辆及基础设施等关键对象;通过像素级距离计算与像素到米的标定方法,将空间位置关系转化为真实世界距离,并结合火情证据、物体脆弱性与距离暴露度构建定量风险评分机制,最终实现从检测到风险优先级排序的闭环处理。
链接: https://arxiv.org/abs/2603.09069
作者: Ammar K. AlMhdawi,Nonso Nnamoko,Alaa Mashan Ubaid
机构: Edge Hill University (埃奇希尔大学); United Kingdom Foundation (英国基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study proposes an enhanced dual-model YOLOv8 framework for intelligent fire detection and proximity-aware risk assessment, extending conventional vision-based monitoring beyond simple detection to actionable hazard prioritization. The system is trained on a dataset of 9,860 annotated images to segment fire and smoke across complex environments. The framework combines a primary YOLOv8 instance segmentation model for fire and smoke detection with a secondary object detection model pretrained on the COCO dataset to identify surrounding entities such as people, vehicles, and infrastructure. By integrating the outputs of both models, the system computes pixel-based distances between detected fire regions and nearby objects and converts these values into approximate real-world measurements using a pixel-to-meter scaling approach. This proximity information is incorporated into a risk assessment mechanism that combines fire evidence, object vulnerability, and distance-based exposure to produce a quantitative risk score and alert level. The proposed framework achieves strong performance, with precision, recall, and F1 scores exceeding 90% and mAP@0.5 above 91%. The system generates annotated visual outputs showing fire locations, detected objects, estimated distances, and contextual risk information to support situational awareness. Implemented using open-source tools within the Google Colab environment, the framework is lightweight and suitable for deployment in industrial and resource-constrained settings.
[CV-128] Spectral-Structured Diffusion for Single-Image Rain Removal
【速读】:该论文旨在解决单图像去雨(single-image rain removal)问题,其核心挑战在于雨痕表现为多尺度下方向性与频谱集中性的结构特征,传统扩散模型难以有效建模此类结构化频域特性。解决方案的关键在于提出SpectralDiff框架,通过引入结构化的频谱扰动机制,在扩散过程中引导对多方向雨滴成分的逐步抑制,同时设计全乘积U-Net架构利用卷积定理将卷积操作替换为元素级乘法层,从而在保持模型表达能力的同时提升计算效率。
链接: https://arxiv.org/abs/2603.09054
作者: Yucheng Xing,Xin Wang
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures
Abstract:Rain streaks manifest as directional and frequency-concentrated structures that overlap across multiple scales, making single-image rain removal particularly challenging. While diffusion-based restoration models provide a powerful framework for progressive denoising, standard spatial-domain diffusion does not explicitly account for such structured spectral characteristics. We introduce SpectralDiff, a spectral-structured diffusion-based framework tailored for single-image rain removal. Rather than redefining the diffusion formulation, our method incorporates structured spectral perturbations to guide the progressive suppression of multi-directional rain components. To support this design, we further propose a full-product U-Net architecture that leverages the convolution theorem to replace convolution operations with element-wise product layers, improving computational efficiency while preserving modeling capacity. Extensive experiments on synthetic and real-world benchmarks demonstrate that SpectralDiff achieves competitive rain removal performance with improved model compactness and favorable inference efficiency compared to existing diffusion-based approaches.
[CV-129] WS-Net: Weak-Signal Representation Learning and Gated Abundance Reconstruction for Hyperspectral Unmixing via State-Space and Weak Signal Attention Fusion
【速读】:该论文旨在解决高光谱图像中弱信号(weak spectral responses)因主导端元(dominant endmembers)和传感器噪声掩盖而导致丰度估计不准确的问题。解决方案的关键在于提出WS-Net框架,其核心创新包括:基于状态空间模型(state-space modelling)的多分辨率小波融合编码器,用于同时捕捉高频不连续性和平滑光谱变化;引入弱信号注意力分支(Weak Signal Attention branch),选择性增强低相似度光谱特征;通过可学习门控机制自适应融合两种表征;以及利用KL散度正则化约束解码器以强化主导与弱端元之间的可分性。该设计显著提升了弱信号识别稳定性,尤其在低信噪比(low-SNR)条件下表现优异。
链接: https://arxiv.org/abs/2603.09037
作者: Zekun Long,Ali Zia,Guanyiman Fu,Vivien Rolland,Jun Zhou
机构: Griffith University (格里菲斯大学); La Trobe University (拉特罗布大学); CSIRO Agriculture and Food (澳大利亚联邦科学与工业研究组织农业与食品部门)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Weak spectral responses in hyperspectral images are often obscured by dominant endmembers and sensor noise, resulting in inaccurate abundance estimation. This paper introduces WS-Net, a deep unmixing framework specifically designed to address weak-signal collapse through state-space modelling and Weak Signal Attention fusion. The network features a multi-resolution wavelet-fused encoder that captures both high-frequency discontinuities and smooth spectral variations with a hybrid backbone that integrates a Mamba state-space branch for efficient long-range dependency modelling. It also incorporates a Weak Signal Attention branch that selectively enhances low-similarity spectral cues. A learnable gating mechanism adaptively fuses both representations, while the decoder leverages KL-divergence-based regularisation to enforce separability between dominant and weak endmembers. Experiments on one simulated and two real datasets (synthetic dataset, Samson, and Apex) demonstrate consistent improvements over six state-of-the-art baselines, achieving up to 55% and 63% reductions in RMSE and SAD, respectively. The framework maintains stable accuracy under low-SNR conditions, particularly for weak endmembers, establishing WS-Net as a robust and computationally efficient benchmark for weak-signal hyperspectral unmixing.
[CV-130] An accurate flatness measure to estimate the generalization performance of CNN models
【速读】:该论文旨在解决现有平坦度(flatness)度量方法在应用于现代卷积神经网络(Convolutional Neural Networks, CNNs)时存在的两个核心问题:一是多数现有定义仅适用于全连接结构,依赖于Hessian矩阵迹的随机估计器;二是忽略了CNN特有的几何结构。其解决方案的关键在于推导出一种针对使用全局平均池化(global average pooling)后接线性分类器的CNN架构下,交叉熵损失关于卷积核的Hessian矩阵迹的闭式表达式,并在此基础上提出一种参数感知的相对平坦度度量,该度量能准确刻画卷积与池化操作所引入的尺度对称性和滤波器交互关系。这一方法实现了对CNN架构的忠实建模,从而为评估和比较CNN模型的泛化性能提供了更可靠工具。
链接: https://arxiv.org/abs/2603.09016
作者: Rahman Taleghani,Maryam Mohammadi,Francesco Marchetti
机构: University of Padova (帕多瓦大学); Khaje Nasir Toosi University of Technology (沙里夫理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Flatness measures based on the spectrum or the trace of the Hessian of the loss are widely used as proxies for the generalization ability of deep networks. However, most existing definitions are either tailored to fully connected architectures, relying on stochastic estimators of the Hessian trace, or ignore the specific geometric structure of modern Convolutional Neural Networks (CNNs). In this work, we develop a flatness measure that is both exact and architecturally faithful for a broad and practically relevant class of CNNs. We first derive a closed-form expression for the trace of the Hessian of the cross-entropy loss with respect to convolutional kernels in networks that use global average pooling followed by a linear classifier. Building on this result, we then specialize the notion of relative flatness to convolutional layers and obtain a parameterization-aware flatness measure that properly accounts for the scaling symmetries and filter interactions induced by convolution and pooling. Finally, we empirically investigate the proposed measure on families of CNNs trained on standard image-classification benchmarks. The results obtained suggest that the proposed measure can serve as a robust tool to assess and compare the generalization performance of CNN models, and to guide the design of architecture and training choices in practice.
[CV-131] he Coupling Within: Flow Matching via Distilled Normalizing Flows ICML2026
【速读】:该论文旨在解决流模型(Flow Models)训练中耦合策略选择对性能影响的问题,特别是传统独立耦合(independent coupling)在采样效率和生成质量上的局限性。现有研究表明,基于最优传输(Optimal Transport, OT)等方法的自适应耦合虽能提升效果,但计算复杂度高且难以高效优化。解决方案的关键在于提出一种新的“归一化流匹配”(Normalized Flow Matching, NFM)方法:通过蒸馏预训练归一化流(Normalizing Flows, NF)模型所具备的双射映射能力(bijection),将其中隐含的准确定义耦合关系迁移至学生流模型中,从而实现更高效、高质量的训练与推理。此方法使学生模型在性能上超越使用独立或OT耦合的流模型,并优于原始教师AR-NF模型。
链接: https://arxiv.org/abs/2603.09014
作者: David Berthelot,Tianrong Chen,Jiatao Gu,Marco Cuturi,Laurent Dinh,Bhavik Chandna,Michal Klein,Josh Susskind,Shuangfei Zhai
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICML 2026
Abstract:Flow models have rapidly become the go-to method for training and deploying large-scale generators, owing their success to inference-time flexibility via adjustable integration steps. A crucial ingredient in flow training is the choice of coupling measure for sampling noise/data pairs that define the flow matching (FM) regression loss. While FM training defaults usually to independent coupling, recent works show that adaptive couplings informed by noise/data distributions (e.g., via optimal transport, OT) improve both model training and inference. We radicalize this insight by shifting the paradigm: rather than computing adaptive couplings directly, we use distilled couplings from a different, pretrained model capable of placing noise and data spaces in bijection – a property intrinsic to normalizing flows (NF) through their maximum likelihood and invertibility requirements. Leveraging recent advances in NF image generation via auto-regressive (AR) blocks, we propose Normalized Flow Matching (NFM), a new method that distills the quasi-deterministic coupling of pretrained NF models to train student flow models. These students achieve the best of both worlds: significantly outperforming flow models trained with independent or even OT couplings, while also improving on the teacher AR-NF model.
[CV-132] Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning WACV2026
【速读】:该论文旨在解决传统防伪认证系统在面对高分辨率打印/扫描设备和生成式深度学习技术进步时,难以区分高质量伪造品与真品的问题。其解决方案的关键在于提出一种基于扩散模型(diffusion-based)的认证框架,该框架联合利用原始二值模板、印刷后的复制检测图案(Copy Detection Pattern, CDP)以及捕获打印机语义信息的打印机身份表示,将认证任务建模为多类打印机分类问题,并通过空间和文本条件控制的去噪过程提取设备特异性细粒度特征,从而实现对未见过的伪造类型具有泛化能力的可靠鉴别。
链接: https://arxiv.org/abs/2603.08998
作者: Bolutife Atoki,Iuliia Tkachenko,Bertrand Kerautret,Carlos Crispim-Junior
机构: Université Lumière Lyon 2, CNRS, INSA Lyon, Universite Claude Bernard Lyon 1, LIRIS (LIRIS实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026
Abstract:Counterfeiting affects diverse industries, including pharmaceuticals, electronics, and food, posing serious health and economic risks. Printable unclonable codes, such as Copy Detection Patterns (CDPs), are widely used as an anti-counterfeiting measure and are applied to products and packaging. However, the increasing availability of high-resolution printing and scanning devices, along with advances in generative deep learning, undermines traditional authentication systems, which often fail to distinguish high-quality counterfeits from genuine prints. In this work, we propose a diffusion-based authentication framework that jointly leverages the original binary template, the printed CDP, and a representation of printer identity that captures relevant semantic information. Formulating authentication as multi-class printer classification over printer signatures lets our model capture fine-grained, device-specific features via spatial and textual conditioning. We extend ControlNet by repurposing the denoising process for class-conditioned noise prediction, enabling effective printer classification. On the Indigo 1 x 1 Base dataset, our method outperforms traditional similarity metrics and prior deep learning approaches. Results show the framework generalises to counterfeit types unseen during training.
[CV-133] SkipGS: Post-Densification Backward Skipping for Efficient 3DGS Training
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在后密度细化阶段(post-densification refinement phase)训练效率低下的问题,特别是反向传播(backward pass)占用大量计算时间且存在显著的更新冗余——许多采样视图的损失已接近平稳状态,但仍执行完整的反向传播,导致资源浪费。解决方案的关键在于提出SkipGS,其核心是一个新颖的视图自适应反向门控机制(view-adaptive backward gating mechanism),该机制在保持前向传播以更新每视图损失统计的同时,动态跳过那些损失稳定、梯度收益微弱的视图的反向传播,同时通过设定最小反向传播预算保障优化稳定性。此方法无需修改渲染器、表示或损失函数,具备即插即用特性,并可与其他高效策略兼容,实现显著加速。
链接: https://arxiv.org/abs/2603.08997
作者: Jingxing Li,Yongjae Leeand,Deliang Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) achieves real-time novel-view synthesis by optimizing millions of anisotropic Gaussians, yet its training remains expensive, with the backward pass dominating runtime in the post-densification refinement phase. We observe substantial update redundancy in this phase: many sampled views have near-plateaued losses and provide diminishing gradient benefits, but standard training still runs full backpropagation. We propose SkipGS with a novel view-adaptive backward gating mechanism for efficient post-densification training. SkipGS always performs the forward pass to update per-view loss statistics, and selectively skips backward passes when the sampled view’s loss is consistent with its recent per-view baseline, while enforcing a minimum backward budget for stable optimization. On Mip-NeRF 360, compared to 3DGS, SkipGS reduces end-to-end training time by 23.1%, driven by a 42.0% reduction in post-densification time, with comparable reconstruction quality. Because it only changes when to backpropagate – without modifying the renderer, representation, or loss – SkipGS is plug-and-play and compatible with other complementary efficiency strategies for additive speedups.
[CV-134] SurgCalib: Gaussian Splatting-Based Hand-Eye Calibration for Robot-Assisted Minimally Invasive Surgery
【速读】:该论文旨在解决达芬奇手术机器人(da Vinci surgical robot)中手眼标定(hand-eye calibration)的问题,即准确估计机器人基座坐标系与相机坐标系之间的刚性变换(rigid transformation),以确保视觉引导下的闭环控制可靠性。针对缆绳驱动型手术机器人因缆绳拉伸和回差导致的本体感知测量不准确问题,传统基于已知标志物的标定方法难以在手术室(OR)环境中应用,因其可能破坏无菌规范并干扰手术流程。解决方案的关键在于提出SurgCalib框架——一种全自动、无需标记的标定方法:首先利用原始运动学数据初始化器械位姿,随后通过两阶段优化程序,在共点约束(RCM, Remote Center of Motion)下结合高斯点绘(Gaussian Splatting-based)可微渲染管线对位姿进行精修,从而实现高精度标定,实验表明其在公共dVRK基准数据集SurgPose上达到亚毫米级误差水平(2D重投影误差<12.24 px,3D工具尖端欧氏距离误差<5.98 mm)。
链接: https://arxiv.org/abs/2603.08983
作者: Zijian Wu,Shuojue Yang,Yu Chung Lee,Eitan Prisman,Yueming Jin,Septimiu E. Salcudean
机构: University of British Columbia(不列颠哥伦比亚大学); National University of Singapore(新加坡国立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures
Abstract:We present a Gaussian Splatting-based framework for hand-eye calibration of the da Vinci surgical robot. In a vision-guided robotic system, accurate estimation of the rigid transformation between the robot base and the camera frame is essential for reliable closed-loop control. For cable-driven surgical robots, this task faces unique challenges. The encoders of surgical instruments often produce inaccurate proprioceptive measurements due to cable stretch and backlash. Conventional hand-eye calibration approaches typically rely on known fiducial patterns and solve the AX = XB formulation. While effective, introducing additional markers into the operating room (OR) environment can violate sterility protocols and disrupt surgical workflows. In this study, we propose SurgCalib, an automatic, markerless framework that has the potential to be used in the OR. SurgCalib first initializes the pose of the surgical instrument using raw kinematic measurements and subsequently refines this pose through a two-phase optimization procedure under the RCM constraint within a Gaussian Splatting-based differentiable rendering pipeline. We evaluate the proposed method on the public dVRK benchmark, SurgPose. The results demonstrate average 2D tool-tip reprojection errors of 12.24 px (2.06 mm) and 11.33 px (1.9 mm), and 3D tool-tip Euclidean distance errors of 5.98 mm and 4.75 mm, for the left and right instruments, respectively.
[CV-135] SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing
【速读】:该论文旨在解决扩散模型(Diffusion Models)中视频生成任务的计算效率瓶颈问题,具体表现为传统扩散 Transformer(DiTs)因二次复杂度的注意力机制导致的高计算开销。现有稀疏注意力方法要么直接丢弃部分注意力块造成信息损失,要么依赖额外训练的预测器近似缺失块,从而引入训练负担和输出分布偏移。解决方案的关键在于提出 SVG-EAR——一种无需训练的线性补偿分支,其核心思想是:通过语义聚类发现每个注意力块内键(key)和值(value)的高度相似性,利用聚类中心点(centroid)对被跳过的块进行准确近似;进一步地,引入轻量级误差探测器实现误差感知路由(error-aware routing),动态选择补偿误差与计算成本比最高的块进行精确计算,从而在保持生成质量的同时显著提升推理效率。理论分析表明注意力重建误差与聚类质量呈负相关,实验验证了该方法在多个视频扩散基准上实现了更优的质量-效率帕累托前沿(Pareto frontier)。
链接: https://arxiv.org/abs/2603.08982
作者: Xuanyi Zhou,Qiuyang Mang,Shuo Yang,Haocheng Xi,Jintao Zhang,Huanzhi Mao,Joseph E. Gonzalez,Kurt Keutzer,Ion Stoica,Alvin Cheung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77 \times and 1.93 \times speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.
[CV-136] Can You Hear Localize and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation
【速读】:该论文旨在解决音频-视觉分割(Audio-Visual Segmentation, AVS)在动态现实环境中面临的持续学习挑战,即现有方法假设训练数据分布静态不变,而实际场景中音频与视觉信号的分布会随时间演化,导致模型出现灾难性遗忘(catastrophic forgetting)。为应对这一问题,作者提出了首个免示例(exemplar-free)的持续学习基准,并设计了一个强基线模型ATLAS,其关键创新在于通过音频引导的预融合条件机制(audio-guided pre-fusion conditioning),利用投影后的音频上下文在跨模态注意力之前调制视觉特征通道;同时引入低秩锚定(Low-Rank Anchoring, LRA)策略,基于损失敏感性稳定适应后的权重,从而有效缓解灾难性遗忘,实现在多样化持续学习场景下的高性能表现。
链接: https://arxiv.org/abs/2603.08967
作者: Siddeshwar Raghavan,Gautham Vinod,Bruce Coburn,Fengqing Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:
Abstract:Audio-Visual Segmentation (AVS) aims to produce pixel-level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are inherently dynamic, causing audio and visual distributions to evolve over time, which challenge existing AVS systems that assume static training settings. To address this gap, we introduce the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets. We further propose a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention. Finally, we mitigate catastrophic forgetting by introducing Low-Rank Anchoring (LRA), which stabilizes adapted weights based on loss sensitivity. Extensive experiments demonstrate competitive performance across diverse continual scenarios, establishing a foundation for lifelong audio-visual perception. Code is available at ^* \footnotePaper under review - \hyperlinkthis https URLthis https URL \keywordsContinual Learning \and Audio-Visual Segmentation \and Multi-Modal Learning Subjects: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS) Cite as: arXiv:2603.08967 [cs.CV] (or arXiv:2603.08967v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.08967 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-137] Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning
【速读】:该论文旨在解决农业数字孪生中植物仿真配置生成的高复杂度与低吞吐量问题,即功能结构植物模型(Functional-Structural Plant Models, FSPMs)在大规模部署时面临的效率瓶颈。其解决方案的关键在于利用先进的开源视觉语言模型(Vision Language Models, VLMs),如Gemma 3和Qwen3-VL,直接从无人机遥感图像中生成结构化的JSON格式仿真参数,从而实现对田间作物布局的自动化、可扩展重建,显著提升数字孪生系统在农业场景中的构建效率与实用性。
链接: https://arxiv.org/abs/2603.08930
作者: Heesup Yun,Isaac Kazuo Uyehara,Earl Ranario,Lars Lundqvist,Christine H. Diepenbrock,Brian N. Bailey,J. Mason Earles
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs – Gemma 3 and Qwen3-VL – to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models’ reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.
[CV-138] IDE: Text-Informed Dynamic Extrapolation with Step-Aware Temperature Control for Diffusion Transformers
【速读】:该论文针对扩散 Transformer(Diffusion Transformer, DiT)在生成高于训练分辨率图像时面临的结构退化问题展开研究,尤其关注因注意力稀释导致的提示信息丢失和细粒度语义细节丧失。解决方案的关键在于提出一种无需额外训练的文本到图像(text-to-image, T2I)外推方法 TIDE,其核心创新包括:1)通过引入文本锚定机制(text anchoring mechanism),校正文本与图像 token 之间的不平衡,从而缓解提示信息损失;2)设计动态温度控制机制,利用扩散过程中频谱演进的规律有效消除伪影。该方法可在不增加采样开销的前提下实现任意分辨率和宽高比的高质量图像生成,并可无缝集成至现有先进模型中。
链接: https://arxiv.org/abs/2603.08928
作者: Yihua Liu,Fanjiang Ye,Bowen Lin,Rongyu Fang,Chengming Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformer (DiT) faces challenges when generating images with higher resolution compared at training resolution, causing especially structural degradation due to attention dilution. Previous approaches attempt to mitigate this by sharpening attention distributions, but fail to preserve fine-grained semantic details and introduce obvious artifacts. In this work, we analyze the characteristics of DiTs and propose TIDE, a training-free text-to-image (T2I) extrapolation method that enables generation with arbitrary resolution and aspect ratio without additional sampling overhead. We identify the core factor for prompt information loss, and introduce a text anchoring mechanism to correct the imbalance between text and image tokens. To further eliminate artifacts, we design a dynamic temperature control mechanism that leverages the pattern of spectral progression in the diffusion process. Extensive evaluations demonstrate that TIDE delivers high-quality resolution extrapolation capability and integrates seamlessly with existing state-of-the-art methods.
[CV-139] MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering
【速读】:该论文旨在解决面部微表情(Facial Micro-expressions, MEs)在复杂现实场景中的理解与分析难题,尤其聚焦于如何利用多模态大语言模型(Multimodal Large Language Models, MLLMs)和大视觉语言模型(Large Vision-Language Models, LVLMs)提升对微表情的识别、定位及推理能力。其解决方案的关键在于提出两个新型任务:微表情视频问答(ME-VQA)和长视频微表情问答(ME-LVQA),分别针对短时视频和长时间序列中微表情的理解与推理问题,借助大模型强大的跨模态语义理解和时间序列建模能力,实现对微表情相关意图、情绪状态及变化过程的精准解析。
链接: https://arxiv.org/abs/2603.08927
作者: Xinqi Fan,Jingting Li,John See,Moi Hoon Yap,Su-Jing Wang,Adrian K. Davison
机构: Manchester Metropolitan University (曼彻斯特都会大学); Institute of Psychology, CAS (中国科学院心理研究所); University of the Chinese Academy of Sciences (中国科学院大学); Heriot-Watt University Malaysia (赫瑞-瓦特大学马来西亚分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: MEGC 2026 at IEEE FG 2026
Abstract:Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. The emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2026 introduces two tasks that reflect these evolving research directions: (1) ME video question answering (ME-VQA), which explores ME understanding through visual question answering on relatively short video sequences, leveraging MLLMs or LVLMs to address diverse question types related to MEs; and (2) ME long-video question answering (ME-LVQA), which extends VQA to long-duration video sequences in realistic settings, requiring models to handle temporal reasoning and subtle micro-expression detection across extended time periods. All participating algorithms are required to submit their results on a public leaderboard. More details are available at this https URL.
[CV-140] Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning CVPR2026
【速读】:该论文旨在解决医学影像分析中模型可解释性不足的问题,特别是在复杂临床场景下,传统概念瓶颈模型(Concept Bottleneck Models, CBMs)因忽略诊断指南和专家经验等临床上下文信息而难以保证可靠性。其解决方案的关键在于提出MedCBR框架,该框架通过融合临床指南与视觉-语言及推理模型,将标注的临床描述转化为符合指南的文本,并采用多任务目标联合训练:包括跨模态对比对齐、概念监督和诊断分类,从而实现图像特征、概念与病理之间的联合锚定;进一步利用推理模型生成结构化临床叙事,模拟基于指南的专家推理过程,最终在超声和乳腺X线摄影数据集上分别达到94.2%和84.0%的AUROC,显著提升了诊断准确性和概念层面的可解释性。
链接: https://arxiv.org/abs/2603.08921
作者: Mohamed Harmanani,Bining Long,Zhuoxin Guo,Paul F.R. Wilson,Amirhossein Sabour,Minh Nguyen Nhat To,Gabor Fichtinger,Purang Abolmaesumi,Parvin Mousavi
机构: Queen’s University (皇后大学); University of British Columbia (不列颠哥伦比亚大学); McMaster University (麦克马斯特大学); Vector Institute (向量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2026 Findings
Abstract:Concept Bottleneck Models (CBMs) are a prominent framework for interpretable AI that map learned visual features to a set of meaningful concepts for task-specific downstream predictions. Their sequential structure enhances transparency by connecting model predictions to the underlying concepts that support them. In medical imaging, where transparency is essential, CBMs offer an appealing foundation for explainable model design. However, discrete concept representations often overlook broader clinical context such as diagnostic guidelines and expert heuristics, reducing reliability in complex cases. We propose MedCBR, a concept-based reasoning framework that integrates clinical guidelines with vision-language and reasoning models. Labeled clinical descriptors are transformed into guideline-conformant text, and a concept-based model is trained with a multitask objective combining multimodal contrastive alignment, concept supervision, and diagnostic classification to jointly ground image features, concepts, and pathology. A reasoning model then converts these predictions into structured clinical narratives that explain the diagnosis, emulating expert reasoning based on established guidelines. MedCBR achieves superior diagnostic and concept-level performance, with AUROCs of 94.2% on ultrasound and 84.0% on mammography. Further experiments on non-medical datasets achieve 86.1% accuracy. Our framework enhances interpretability and forms an end-to-end bridge from medical image analysis to decision-making.
[CV-141] Multi-Kernel Gated Decoder Adapters for Robust Multi-Task Thyroid Ultrasound under Cross-Center Shift
【速读】:该论文旨在解决甲状腺超声(Thyroid Ultrasound, US)自动化中多任务学习面临的跨中心域偏移问题,即在不同医疗中心的数据分布差异下,几何特征驱动的结节分割任务与纹理特征驱动的恶性风险评估任务之间存在不对称性能退化,而现有基于单一共享主干网络的多任务框架常引发负迁移。解决方案的关键在于提出了一种轻量级解码器侧适配器结构——多核门控适配器(Multi-Kernel Gated Adapter, MKGA)及其残差变体(ResMKGA),通过利用互补感受野对多尺度跳跃特征进行精细化调整,并引入语义感知的上下文门控机制,在融合前抑制受伪影干扰的内容,从而提升模型在跨中心场景下的鲁棒性,尤其在CNN主干上显著改善TI-RADS诊断准确率。
链接: https://arxiv.org/abs/2603.08906
作者: Maziar Sabouri,Nourhan Bayasi,Arman Rahmim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:Thyroid ultrasound (US) automation couples two competing requirements: global, geometry-driven reasoning for nodule delineation and local, texture-driven reasoning for malignancy risk assessment. Under cross-center domain shift, these cues degrade asymmetrically, yet most multi-task pipelines rely on a single shared backbone, often inducing negative transfer. In this paper, we characterize this interference across CNN (ResNet34) and medical ViT (MedSAM) backbones, and observe a consistent trend: ViTs transfer geometric priors that benefit segmentation, whereas CNNs more reliably preserve texture cues for malignancy discrimination under strong shift and artifacts. Motivated by this failure mode, we propose a lightweight family of decoder-side adapters, the Multi-Kernel Gated Adapter (MKGA) and a residual variant (ResMKGA), which refine multi-scale skip features using complementary receptive fields and apply semantic, context-conditioned gating to suppress artifact-prone content before fusion. Across two US benchmarks, the proposed adapters improve cross-center robustness: they strengthen out-of-domain segmentation and, in the CNN setting, yield clear gains in clinical TI-RADS diagnostic accuracy compared to standard multi-task baselines. Code and models will be released.
[CV-142] owards Visual Query Segmentation in the Wild
【速读】:该论文旨在解决视觉查询定位(Visual Query Localization, VQL)任务中仅能定位目标物体最后一次出现位置(通过边界框)的局限性,提出一种新的范式——视觉查询分割(Visual Query Segmentation, VQS),其目标是在未修剪视频中精确分割出所有像素级的目标物体实例。为推动该任务的研究,作者构建了首个专门针对VQS的大规模基准数据集VQS-4K,包含4,111个视频、超过130万帧及222类物体,并提供高质量的空间-时间掩码片段标注。解决方案的关键在于提出VQ-SAM方法,该方法基于SAM 2架构,引入目标特定特征与背景干扰线索,通过一个新颖的多阶段框架和自适应记忆生成(Adaptive Memory Generation, AMG)模块,在视频中逐步演化记忆表示,从而实现更精准的像素级分割,实验表明VQ-SAM在VQS-4K上显著优于现有方法。
链接: https://arxiv.org/abs/2603.08898
作者: Bing Fan,Minghao Li,Hanzhi Zhang,Shaohua Dong,Naga Prudhvi Mareedu,Weishi Shi,Yunhe Feng,Yan Huang,Heng Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we introduce visual query segmentation (VQS), a new paradigm of visual query localization (VQL) that aims to segment all pixel-level occurrences of an object of interest in an untrimmed video, given an external visual query. Compared to existing VQL locating only the last appearance of a target using bounding boxes, VQS enables more comprehensive (i.e., all object occurrences) and precise (i.e., pixel-level masks) localization, making it more practical for real-world scenarios. To foster research on this task, we present VQS-4K, a large-scale benchmark dedicated to VQS. Specifically, VQS-4K contains 4,111 videos with more than 1.3 million frames and covers a diverse set of 222 object categories. Each video is paired with a visual query defined by a frame outside the search video and its target mask, and annotated with spatial-temporal masklets corresponding to the queried target. To ensure high quality, all videos in VQS-4K are manually labeled with meticulous inspection and iterative refinement. To the best of our knowledge, VQS-4K is the first benchmark specifically designed for VQS. Furthermore, to stimulate future research, we present a simple yet effective method, named VQ-SAM, which extends SAM 2 by leveraging target-specific and background distractor cues from the video to progressively evolve the memory through a novel multi-stage framework with an adaptive memory generation (AMG) module for VQS, significantly improving the performance. In our extensive experiments on VQS-4K, VQ-SAM achieves promising results and surpasses all existing approaches, demonstrating its effectiveness. With the proposed VQS-4K and VQ-SAM, we expect to go beyond the current VQL paradigm and inspire more future research and practical applications on VQS. Our benchmark, code, and results will be made publicly available.
[CV-143] Comparative Analysis of Patch Attack on VLM-Based Autonomous Driving Architectures
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶场景中对物理对抗攻击的鲁棒性问题,当前对此类威胁的研究仍属空白。解决方案的关键在于提出一个系统性的对比评估框架,通过黑盒优化结合语义同质化(semantic homogenization)方法,在CARLA仿真环境中对三种主流VLM架构(Dolphins、OmniDrive (Omni-L) 和 LeapVAD)进行可物理实现的贴片攻击(patch attacks)测试,从而揭示不同架构下的脆弱性模式,并验证现有设计在安全关键应用中对对抗威胁的应对不足。
链接: https://arxiv.org/abs/2603.08897
作者: David Fernandez,Pedram MohajerAnsari,Amir Salarpour,Long Cheng,Abolfazl Razi,Mert D. Pesé
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2025 IEEE Intelligent Vehicles Symposium (IV 2025)
Abstract:Vision-language models are emerging for autonomous driving, yet their robustness to physical adversarial attacks remains unexplored. This paper presents a systematic framework for comparative adversarial evaluation across three VLM architectures: Dolphins, OmniDrive (Omni-L), and LeapVAD. Using black-box optimization with semantic homogenization for fair comparison, we evaluate physically realizable patch attacks in CARLA simulation. Results reveal severe vulnerabilities across all architectures, sustained multi-frame failures, and critical object detection degradation. Our analysis exposes distinct architectural vulnerability patterns, demonstrating that current VLM designs inadequately address adversarial threats in safety-critical autonomous driving applications.
[CV-144] HECTOR: Hybrid Editable Compositional Object References for Video Generation
【速读】:该论文旨在解决当前视频生成模型在处理复杂场景时缺乏细粒度组合控制能力的问题,即现有方法通常以整体方式合成场景,难以实现对视觉元素空间位置、尺度和运动轨迹的精确调控。解决方案的关键在于提出HECTOR框架,其核心创新是支持混合参考条件(hybrid reference conditioning),允许用户同时使用静态图像和/或动态视频作为参考,并显式指定每个参考元素的轨迹,从而在保持高保真参考一致性的前提下,实现对视频中物体时空运动的精准控制。
链接: https://arxiv.org/abs/2603.08850
作者: Guofeng Zhang,Angtian Wang,Jacob Zhiyuan Fang,Liming Jiang,Haotian Yang,Alan Yuille,Chongyang Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.
[CV-145] A Lightweight Multi-Cancer Tumor Localization Framework for Deployable Digital Pathology
【速读】:该论文旨在解决跨癌种肿瘤区域定位的泛化能力问题,即现有基于深度学习的肿瘤检测模型在特定癌种训练后,在未见过的癌种上性能显著下降的问题。其解决方案的关键在于采用小规模但均衡的多癌种数据集(四种癌症:黑色素瘤、肝细胞癌、结直肠癌和非小细胞肺癌)进行联合训练,并利用迁移学习策略(以DenseNet169为骨干网络),从而构建一个具备较强跨癌种泛化能力的多癌种肿瘤定位模型(MuCTaL)。实验表明,该模型在训练癌种中达到高精度(tile-level ROC-AUC=0.97),并在独立胰腺导管腺癌队列中保持可接受的性能(ROC-AUC=0.71),验证了该方法的有效性与可扩展性。
链接: https://arxiv.org/abs/2603.08844
作者: Brian Isett,Rebekah Dadey,Aofei Li,Ryan C. Augustin,Kate Smith,Aatur D. Singhi,Qiangqiang Gu,Riyue Bao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures
Abstract:Accurate localization of tumor regions from hematoxylin and eosin-stained whole-slide images is fundamental for translational research including spatial analysis, molecular profiling, and tissue architecture investigation. However, deep learning-based tumor detection trained within specific cancers may exhibit reduced robustness when applied across different tumor types. We investigated whether balanced training across cancers at modest scale can achieve high performance and generalize to unseen tumor types. A multi-cancer tumor localization model (MuCTaL) was trained on 79,984 non-overlapping tiles from four cancers (melanoma, hepatocellular carcinoma, colorectal cancer, and non-small cell lung cancer) using transfer learning with DenseNet169. The model achieved a tile-level ROC-AUC of 0.97 in validation data from the four training cancers, and 0.71 on an independent pancreatic ductal adenocarcinoma cohort. A scalable inference workflow was built to generate spatial tumor probability heatmaps compatible with existing digital pathology tools. Code and models are publicly available at this https URL.
[CV-146] Computer Vision-Based Vehicle Allotment System using Perspective Mapping
【速读】:该论文旨在解决城市交通拥堵与停车难问题,特别是在高密度 urban areas 中传统智能停车系统因传感器局限性和集成复杂性而难以高效运行的问题。其解决方案的关键在于引入基于计算机视觉(Computer Vision)的车辆识别与空位检测机制,利用 YOLOv8 目标检测模型结合逆透视映射(Inverse Perspective Mapping, IPM)技术,将四路摄像头图像融合以生成三维停车位分布信息,并通过三维笛卡尔坐标系可视化可用车位,从而实现高精度、低成本且易于部署的智能停车管理。
链接: https://arxiv.org/abs/2603.08827
作者: Prachi Nandi,Sonakshi Satapathy,Suchismita Chinara
机构: National Institute of Technology Rourkela (印度理工学院鲁尔克兰分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Smart city research envisions a future in which data-driven solutions and sustainable infrastructure work together to define urban living at the crossroads of urbanization and technology. Within this framework, smart parking systems play an important role in reducing urban congestion and supporting sustainable transportation. Automating parking solutions have considerable benefits, such as increased efficiency and less reliance on human involvement, but obstacles such as sensor limitations and integration complications remain. To overcome them, a more sophisticated car allotment system is required, particularly in heavily populated urban areas. Computer vision, with its higher accuracy and adaptability, outperforms traditional sensor-based systems for recognizing vehicles and vacant parking spaces. Unlike fixed sensor technologies, computer vision can dynamically assess a wide range of visual inputs while adjusting to changing parking layouts. This research presents a cost-effective, easy-to-implement smart parking system utilizing computer vision and object detection models like YOLOv8. Using inverse perspective mapping (IPM) to merge images from four camera views, we extract data on vacant spaces. The system simulates a 3D parking environment, representing available spots with a 3D Cartesian plot to guide users.
[CV-147] VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agent ic Model
【速读】:该论文旨在解决当前视觉内容生成代理(Visual Generation Agent)在多图像工作流中缺乏系统性反思机制的问题,导致其难以纠正执行过程中的视觉错误。现有方法多依赖计划驱动策略,无法有效实现轨迹中段的自我修正。解决方案的关键在于提出一种具有显式反思能力的统一代理 VisionCreator-R1,并设计了反射-计划协同优化(Reflection-Plan Co-Optimization, RPCO)训练方法:首先利用自建的 VCR-SFT 数据集分别训练强反思单图轨迹与强规划多图轨迹,再通过强化学习(Reinforcement Learning, RL)在 VCR-RL 数据集上进行协同优化,从而克服强化学习中反射模块因信用分配噪声而难以学习的问题,最终显著提升代理在单图和多图任务上的性能表现。
链接: https://arxiv.org/abs/2603.08812
作者: Jinxiang Lai,Wenzhe Zhao,Zexin Lu,Hualei Zhang,Qinyu Yang,Rongwei Quan,Zhimin Li,Shuai Shao,Song Guo,Qinglin Lu
机构: Tencent Hunyuan; Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual content generation has advanced from single-image to multi-image workflows, yet existing agents remain largely plan-driven and lack systematic reflection mechanisms to correct mid-trajectory visual errors. To address this limitation, we propose VisionCreator-R1, a native visual generation agent with explicit reflection, together with a Reflection-Plan Co-Optimization (RPCO) training methodology. Through extensive experiments and trajectory-level analysis, we uncover reflection-plan optimization asymmetry in reinforcement learning (RL): planning can be reliably optimized via plan rewards, while reflection learning is hindered by noisy credit assignment. Guided by this insight, our RPCO first trains on the self-constructed VCR-SFT dataset with reflection-strong single-image trajectories and planning-strong multi-image trajectories, then co-optimization on VCR-RL dataset via RL. This yields our unified VisionCreator-R1 agent, which consistently outperforms Gemini2.5Pro on existing benchmarks and our VCR-bench covering single-image and multi-image tasks.
[CV-148] Where What Why: Toward Explainable 3D-GS Watermarking CVPR2026
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting)表示中鲁棒且不可感知的水印嵌入问题,以保障交互式3D资产的版权保护。其核心解决方案在于提出一种原生表示框架,将“水印载体选择”与“质量保持机制”解耦:通过三专家模块(Trio-Experts)直接在高斯原始体上推导载体优先级,结合安全与预算感知门控机制(SBAG)优化比特抗扰性和比特率约束下的载体分配;同时引入通道分组掩码控制梯度传播,限制高斯参数更新、修复局部伪影并保留高频细节,从而实现视图一致的水印持久性及对压缩、噪声等常见图像失真较强的鲁棒性,相较现有方法在PSNR提升0.83 dB、比特准确率提升1.24%的基础上,还支持逐高斯粒度的可解释性分析。
链接: https://arxiv.org/abs/2603.08809
作者: Mingshu Cai,Jiajun Li,Osamu Yoshie,Yuya Ieiri,Yixuan Li
机构: Waseda University (早稻田大学); Southeast University (东南大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:As 3D Gaussian Splatting becomes the de facto representation for interactive 3D assets, robust yet imperceptible watermarking is critical. We present a representation-native framework that separates where to write from how to preserve quality. A Trio-Experts module operates directly on Gaussian primitives to derive priors for carrier selection, while a Safety and Budget Aware Gate (SBAG) allocates Gaussians to watermark carriers, optimized for bit resilience under perturbation and bitrate budgets, and to visual compensators that are insulated from watermark loss. To maintain fidelity, we introduce a channel-wise group mask that controls gradient propagation for carriers and compensators, thereby limiting Gaussian parameter updates, repairing local artifacts, and preserving high-frequency details without increasing runtime. Our design yields view-consistent watermark persistence and strong robustness against common image distortions such as compression and noise, while achieving a favorable robustness-quality trade-off compared with prior methods. In addition, decoupled finetuning provides per-Gaussian attributions that reveal where the message is carried and why those carriers are selected, enabling auditable explainability. Compared with state-of-the-art methods, our approach achieves a PSNR improvement of +0.83 dB and a bit-accuracy gain of +1.24%.
[CV-149] Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉理解中难以兼顾细粒度与粗粒度语义推理的问题。现有基于CLIP的视觉编码器侧重全局语义对齐,但缺乏像素级感知能力;而DINOv3虽具备强大的像素级表征能力,却缺乏高层次语义抽象,限制了跨粒度推理性能。解决方案的关键在于提出Granulon,其核心创新包括:一个文本条件驱动的粒度控制器(text-conditioned granularity Controller),可根据文本输入的语义范围动态调整视觉抽象层级;以及一个自适应标记聚合模块(Adaptive Token Aggregation module),通过粒度引导的池化和关系感知聚类生成紧凑且语义丰富的视觉标记。这一设计实现了单次前向传播中的统一“像素→细粒度→粗粒度”推理路径,显著提升了准确率(约+30%)并降低幻觉(约-20%)。
链接: https://arxiv.org/abs/2603.08800
作者: Junyuan Mao,Qiankun Li,Linghao Meng,Zhicheng He,Xinliang Zhou,Kun Wang,Yang Liu,Yueming Jin
机构: National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified “pixel-to-fine-to-coarse” reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.
[CV-150] Performance Analysis of Edge and In-Sensor AI Processors: A Comparative Review
【速读】:该论文旨在解决超低功耗边缘处理器在持续运行(always-on)和延迟敏感型人工智能(AI)任务中的性能与能效瓶颈问题。其解决方案的关键在于系统性地分类和评估不同架构范式的边缘AI处理器——包括异构片上系统(SoC)、神经网络加速器、近传感器及传感器内计算架构,以及新兴的数据流与内存中心设计,并通过实证基准测试验证其在延迟、推理效率、能量效率和能量-延迟乘积(energy-delay product)等方面的差异。结果表明,传感器内处理(in-sensor processing)技术已展现出显著优势,如Sony IMX500实现了86.2 MAC/周期的高利用率和最低的能量-延迟乘积,凸显了其在能效和实时性上的潜力;而GAP9在微控制器级功耗预算下提供最佳能量效率,STM32N6则以更低的原始延迟但更高能耗满足对响应速度要求严苛的应用场景。整体而言,研究揭示了当前超低功耗AI处理器的设计趋势与实际权衡,为下一代边缘智能硬件提供了理论依据与实践参考。
链接: https://arxiv.org/abs/2603.08725
作者: Luigi Capogrosso,Pietro Bonazzi,Michele Magno
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the IEEE International Instrumentation and Measurement Technology Conference (I2MTC) 2026
Abstract:This review examines the rapidly evolving landscape of ultra-low-power edge processors, covering heterogeneous Systems-on-Chips (SoCs), neural accelerators, near-sensor and in-sensor architectures, and emerging dataflow and memory-centric designs. We categorize commercially available and research-grade platforms according to their compute paradigms, power envelopes, and memory hierarchies, and analyze their suitability for always-on and latency-critical Artificial Intelligence (AI) workloads. To complement the architectural overview with empirical evidence, we benchmark a 336 million Multiply-Accumulate (MAC) segmentation model (PicoSAM2) on three representative processors: GAP9, leveraging a multi-core RISC-V architecture augmented with hardware accelerators; the STM32N6, which pairs an advanced ARM Cortex-M55 core with a dedicated neural architecture accelerator; and the Sony IMX500, representing in-sensor stacked-Complementary Metal-Oxide-Semiconductor (CMOS) compute. Collectively, these platforms span MCU-class, embedded neural accelerator, and in-sensor paradigms. The evaluation reports latency, inference efficiency, energy efficiency, and energy-delay product. The results show a clear divergence in hardware behavior, with the IMX500 achieving the highest utilization (86.2 MAC/cycle) and the lowest energy-delay product, highlighting the growing significance and technological maturity of in-sensor processing. GAP9 offers the best energy efficiency within microcontroller-class power budgets, and the STM32N6 provides the lowest raw latency at a significantly higher energy cost. Together, the review and benchmarks provide a unified view of the current design directions and practical trade-offs that are shaping the next generation of ultra-low-power and in-sensor AI processors.
[CV-151] CycleULM: A unified label-free deep learning framework for ultrasound localisation microscopy
【速读】:该论文旨在解决超分辨率超声成像中微泡(microbubble, MB)定位与追踪的性能瓶颈及数据获取和处理效率低的问题,尤其针对在体实验中标注数据稀缺和仿真到现实域差距(simulation-to-reality domain gap)带来的深度学习方法局限性。其解决方案的关键在于提出首个统一的无标签深度学习框架 CycleULM,通过引入 CycleGAN 实现真实对比增强超声(contrast-enhanced ultrasound, CEUS)数据域与简化微泡仅域之间的物理模拟映射,无需配对标注数据即可完成高质量图像转换,从而显著提升微泡定位精度(召回率提高40%、精度提高46%、平均定位误差降低14.0 μm)、图像对比度(信噪比提升达15.3 dB)和分辨率(点扩散函数半高宽缩小2.5倍),同时实现每秒18.3帧的实时处理速度,较传统方法提速达14.5倍,为临床实用化超分辨率超声局部显微成像提供了高效可行的技术路径。
链接: https://arxiv.org/abs/2603.09840
作者: Su Yan,Clara Rodrigo Gonzalez,Vincent C. H. Leung,Herman Verinaz-Jadan,Jiakang Chen,Matthieu Toulemonde,Kai Riemer,Jipeng Yan,Clotilde Vié,Qingyuan Tan,Peter D. Weinberg,Pier Luigi Dragotti,Kevin G. Murphy,Meng-Xing Tang
机构: Imperial College London (帝国理工学院); Escuela Superior Politécnica del Litoral (ESPOL) (厄瓜多尔滨海理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 43 pages, 14 figures, 2 tables, journal
Abstract:Super-resolution ultrasound via microbubble (MB) localisation and tracking, also known as ultrasound localisation microscopy (ULM), can resolve microvasculature beyond the acoustic diffraction limit. However, significant challenges remain in localisation performance and data acquisition and processing time. Deep learning methods for ULM have shown promise to address these challenges, however, they remain limited by in vivo label scarcity and the simulation-to-reality domain gap. We present CycleULM, the first unified label-free deep learning framework for ULM. CycleULM learns a physics-emulating translation between the real contrast-enhanced ultrasound (CEUS) data domain and a simplified MB-only domain, leveraging the power of CycleGAN without requiring paired ground truth data. With this translation, CycleULM removes dependence on high-fidelity simulators or labelled data, and makes MB localisation and tracking substantially easier. Deployed as modular plug-and-play components within existing pipelines or as an end-to-end processing framework, CycleULM delivers substantial performance gains across both in silico and in vivo datasets. Specifically, CycleULM improves image contrast (contrast-to-noise ratio) by up to 15.3 dB and sharpens CEUS resolution with a 2.5\times reduction in the full width at half maximum of the point spread function. CycleULM also improves MB localisation performance, with up to +40% recall, +46% precision, and a -14.0 \mum mean localisation error, yielding more faithful vascular reconstructions. Importantly, CycleULM achieves real-time processing throughput at 18.3 frames per second with order-of-magnitude speed-ups (up to ~14.5\times). By combining label-free learning, performance enhancement, and computational efficiency, CycleULM provides a practical pathway toward robust, real-time ULM and accelerates its translation to clinical applications.
[CV-152] Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts
【速读】:该论文旨在解决放射学上肺胸膜间质纤维弹性组织增生(Pleuroparenchymal fibroelastosis, PPFE)的纵向进展是否独立关联于死亡率和呼吸系统不良事件的问题,尤其是在肺癌筛查人群中。其解决方案的关键在于采用自动化算法对低剂量CT扫描图像进行定量分析,计算PPFE体积的年化变化量(dPPFE),并基于分布阈值定义PPFE进展状态,从而在两个大型肺癌筛查队列(NLST和SUMMIT)中验证dPPFE与死亡风险及临床呼吸结局之间的独立关联,结果表明dPPFE是预测死亡率和呼吸系统并发症的重要影像生物标志物。
链接: https://arxiv.org/abs/2603.09531
作者: Shahab Aslani,Mehran Azimbagirad,Daryl Cheng,Daisuke Yamada,Ryoko Egashira,Adam Szmul,Justine Chan-Fook,Robert Chapman,Alfred Chung Pui So,Shanshan Wang,John McCabe,Tianqi Yang,Jose M Brenes,Eyjolfur Gudmundsson, TheSUMMIT Consortium,Susan M. Astley,Daniel C. Alexander,Sam M. Janes,Joseph Jacob
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Applications (stat.AP)
备注:
Abstract:Background: Pleuroparenchymal fibroelastosis (PPFE) is an upper lobe predominant fibrotic lung abnormality associated with increased mortality in established interstitial lung disease. However, the clinical significance of radiologic PPFE progression in lung cancer screening populations remains unclear. We investigated whether longitudinal change in PPFE quantified on low dose CT independently associates with mortality and respiratory morbidity. Methods: We analysed longitudinal low-dose CT scans and clinical data from two lung cancer screening studies: the National Lung Screening Trial (NLST; n=7980) and the SUMMIT study (n=8561). An automated algorithm quantified PPFE volume on baseline and follow up scans. Annualised change in PPFE (dPPFE) was derived and dichotomised using a distribution based threshold to define progressive PPFE. Associations between dPPFE and mortality were evaluated using Cox proportional hazards models adjusted for demographic and clinical variables. In the SUMMIT cohort, dPPFE was also examined in relation to clinical outcomes. Findings: dPPFE independently associated with mortality in both cohorts (NLST: HR 1.25, 95% CI 1.01-1.56, p=0.042; SUMMIT: HR 3.14, 95% CI 1.66-5.97, p0.001). Kaplan-Meier curves showed reduced survival among participants with progressive PPFE in both cohorts. In SUMMIT, dPPFE was associated with higher respiratory admissions (IRR 2.79, p0.001), increased antibiotic and steroid use (IRR 1.55, p=0.010), and a trend towards higher mMRC scores (OR 1.40, p=0.055). Interpretation: Radiologic PPFE progression independently associates with mortality across two large lung cancer screening cohorts and with adverse clinical outcomes. Quantitative assessment of PPFE progression may provide a clinically relevant imaging biomarker for identifying individuals at increased respiratory risk within screening programmes. Subjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Applications (stat.AP) Cite as: arXiv:2603.09531 [q-bio.QM] (or arXiv:2603.09531v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2603.09531 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shahab Aslani [view email] [v1] Tue, 10 Mar 2026 11:37:50 UTC (1,114 KB)
[CV-153] POLISHing the Sky: Wide-Field and High-Dynamic Range Interferometric Image Reconstruction with Application to Strong Lens Discovery
【速读】:该论文旨在解决射电干涉成像中因高动态范围(High Dynamic Range, HDR)、大视场(Large Field of View, LFOV)以及训练与测试条件不匹配导致的重建质量下降问题,这些问题限制了深度学习(Deep Learning, DL)方法在真实天文观测场景中的部署。解决方案的关键在于对POLISH框架进行两项核心改进:一是采用分块训练与拼接策略(patch-wise training and stitching strategy),以支持宽视场成像;二是引入非线性arcsinh强度变换(arcsinh-based intensity transformation),有效管理高动态范围数据。实验表明,该方法在T-RECS仿真数据集上显著提升了重建质量和鲁棒性,并在强引力透镜系统中实现了超分辨率恢复,有望使深综合巡天阵列(DSA)探测到的星系-星系透镜数量提升10倍以上。
链接: https://arxiv.org/abs/2603.09162
作者: Zihui Wu,Liam Connor,Samuel McCarty,Katherine L. Bouman
机构: California Institute of Technology (Caltech); Center for Astrophysics ∣\mid Harvard Smithsonian
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Radio interferometry enables high-resolution imaging of astronomical radio sources by synthesizing a large effective aperture from an array of antennas and solving a deconvolution problem to reconstruct the image. Deep learning has emerged as a promising solution to the imaging problem, reducing computational costs and enabling super-resolution. However, existing DL-based methods often fall short of the requirements for real-world deployment due to limitations in handling high dynamic range, large field of view, and mismatches between training and test conditions. In this work, we build upon and extend the POLISH framework, a recent DL model for radio interferometric imaging. We introduce key improvements to enable robust reconstruction and super-resolution under real-world conditions: (1) a patch-wise training and stitching strategy for scaling to wide-field imaging and (2) a nonlinear arcsinh-based intensity transformation to manage high dynamic range. We conduct comprehensive evaluations using the T-RECS simulation suite with realistic sky models and point spead functions (PSF), and demonstrate that our approach significantly improves reconstruction quality and robustness. We test the model on realistic simulated strong gravitational lenses and show that lens systems with Einstein radii near the PSF scale can be recovered after deconvolution with our POLISH model, potentially yielding 10 \times more galaxy-galaxy lensing systems from the Deep Synoptic Array (DSA) survey than with image-plane CLEAN. Our results highlight the potential of DL models as practical, scalable tools for next-generation radio astronomy.
人工智能
[AI-0] owards a Neural Debugger for Python
【速读】:该论文旨在解决现有神经解释器(neural interpreters)缺乏交互式控制能力的问题,即传统方法无法模拟开发者在调试过程中通过设置断点、单步执行(step into/over/out)等操作来动态干预程序执行流程。其解决方案的关键在于引入“神经调试器”(neural debuggers),这是一种基于大语言模型(LLM)微调或小模型从头预训练得到的新型架构,能够建模正向执行(预测未来状态和输出)与逆向执行(推断先前状态或输入),并响应调试动作进行条件化执行推理。实验表明,该方法在CruxEval基准上实现了输出和输入预测任务的优异性能,为构建具备仿真调试环境的世界模型(world model)奠定了基础,从而推动智能体编程系统的发展。
链接: https://arxiv.org/abs/2603.09951
作者: Maximilian Beck,Jonas Gehring,Jannik Kossen,Gabriel Synnaeve
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 22 pages
Abstract:Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs step by step; instead, they use debuggers to stop execution at certain breakpoints and step through relevant portions only while inspecting or modifying program variables. Existing neural interpreter approaches lack such interactive control. To address this limitation, we introduce neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines. We show that neural debuggers – obtained via fine-tuning large LLMs or pre-training smaller models from scratch – can reliably model both forward execution (predicting future states and outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions. Evaluated on CruxEval, our models achieve strong performance on both output and input prediction tasks, demonstrating robust conditional execution modeling. Our work takes first steps towards future agentic coding systems in which neural debuggers serve as a world model for simulated debugging environments, providing execution feedback or enabling agents to interact with real debugging tools. This capability lays the foundation for more powerful code generation, program understanding, and automated debugging.
[AI-1] When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning)中策略优化(如Proximal Policy Optimization, PPO)对学习率(Learning Rate, LR)高度敏感的问题,即小LR导致收敛缓慢、大LR引发训练不稳定或崩溃,从而增加超参数调优的复杂性。其解决方案的关键在于引入一种基于隐藏神经元激活模式的高效指标——过拟合-欠拟合指示器(Overfitting-Underfitting Indicator, OUI),通过分析网络在固定探针批次上的二值激活模式变化,量化神经元内部结构演化与LR之间的理论关联。实证表明,OUI在训练早期(仅10%阶段)即可有效区分不同LR下的性能表现,并揭示出演员(actor)和评论家(critic)网络的最佳返回值分别对应于高OUI值和中间OUI区间,进而实现基于OUI的早期筛选机制,在保持高召回率的同时显著提升筛选精度,支持对低潜力训练路径进行高效剪枝,无需完成全部训练过程。
链接: https://arxiv.org/abs/2603.09950
作者: Alberto Fernández-Hernández,Cristian Pérez-Corral,Jose I. Mestre,Manuel F. Dolz,Jose Duato,Enrique S. Quintana-Ortí
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep Reinforcement Learning systems are highly sensitive to the learning rate (LR), and selecting stable and performant training runs often requires extensive hyperparameter search. In Proximal Policy Optimization (PPO) actor–critic methods, small LR values lead to slow convergence, whereas large LR values may induce instability or collapse. We analyse this phenomenon from the behavior of the hidden neurons in the network using the Overfitting-Underfitting Indicator (OUI), a metric that quantifies the balance of binary activation patterns over a fixed probe batch. We introduce an efficient batch-based formulation of OUI and derive a theoretical connection between LR and activation sign changes, clarifying how a correct evolution of the neuron’s inner structure depends on the step size. Empirically, across three discrete-control environments and multiple seeds, we show that OUI measured at only 10% of training already discriminates between LR regimes. We observe a consistent asymmetry: critic networks achieving highest return operate in an intermediate OUI band (avoiding saturation), whereas actor networks achieving highest return exhibit comparatively high OUI values. We then compare OUI-based screening rules against early return, clip-based, divergence-based, and flip-based criteria under matched recall over successful runs. In this setting, OUI provides the strongest early screening signal: OUI alone achieves the best precision at broader recall, while combining early return with OUI yields the highest precision in best-performing screening regimes, enabling aggressive pruning of unpromising runs without requiring full training. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.09950 [cs.LG] (or arXiv:2603.09950v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.09950 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-2] he Confidence Gate Theorem: When Should Ranked Decision Systems Abstain?
【速读】:该论文旨在解决排名决策系统(如推荐系统、广告拍卖和临床分诊队列)在何时应干预 ranked 输出以及何时应放弃决策的问题,核心在于分析基于置信度的弃权机制(confidence-based abstention)何时能单调提升决策质量,何时会失效。解决方案的关键在于识别出两个形式条件:rank-alignment(排序对齐)与无反转区域(no inversion zones),并进一步揭示这些条件成立或失败的根本原因——即结构性不确定性(structural uncertainty,如冷启动问题)与情境性不确定性(contextual uncertainty,如时间漂移)之间的区别。研究表明,结构性不确定性下置信度弃权可带来近单调收益,而情境性不确定性则会导致置信度信号失效,即便使用情境感知替代方案(如集成分歧或近期特征)也难以完全恢复单调性,从而为实际部署提供了诊断工具:应在部署前于保留数据上验证C1和C2条件,并确保置信度信号与主导不确定性类型匹配。
链接: https://arxiv.org/abs/2603.09947
作者: Ronald Doku
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Ranked decision systems – recommenders, ad auctions, clinical triage queues – must decide when to intervene in ranked outputs and when to abstain. We study when confidence-based abstention monotonically improves decision quality, and when it fails. The formal conditions are simple: rank-alignment and no inversion zones. The substantive contribution is identifying why these conditions hold or fail: the distinction between structural uncertainty (missing data, e.g., cold-start) and contextual uncertainty (missing context, e.g., temporal drift). Empirically, we validate this distinction across three domains: collaborative filtering (MovieLens, 3 distribution shifts), e-commerce intent detection (RetailRocket, Criteo, Yoochoose), and clinical pathway triage (MIMIC-IV). Structural uncertainty produces near-monotonic abstention gains in all domains; structurally grounded confidence signals (observation counts) fail under contextual drift, producing as many monotonicity violations as random abstention on our MovieLens temporal split. Context-aware alternatives – ensemble disagreement and recency features – substantially narrow the gap (reducing violations from 3 to 1–2) but do not fully restore monotonicity, suggesting that contextual uncertainty poses qualitatively different challenges. Exception labels defined from residuals degrade substantially under distribution shift (AUC drops from 0.71 to 0.61–0.62 across three splits), providing a clean negative result against the common practice of exception-based intervention. The results provide a practical deployment diagnostic: check C1 and C2 on held-out data before deploying a confidence gate, and match the confidence signal to the dominant uncertainty type.
[AI-3] PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLM s
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在计算病理学任务中缺乏显式结构化知识整合机制与可解释的记忆控制问题,导致其难以一致地将病理特定的诊断标准融入推理过程。解决方案的关键在于提出PathMem——一个以记忆为中心的多模态框架,通过将结构化病理知识组织为长期记忆(Long-Term Memory, LTM),并引入Memory Transformer模块,实现从LTM到工作记忆(Working Memory, WM)的动态迁移,该迁移基于多模态记忆激活和上下文感知的知识锚定机制,从而支持下游推理中的上下文感知记忆精炼,显著提升了病理图像报告生成和开放性诊断任务的性能。
链接: https://arxiv.org/abs/2603.09943
作者: Jinyue Li,Yuci Liang,Qiankun Li,Xinheng Lyu,Jiayu Qian,Huabao Chen,Kun Wang,Zhigang Zeng,Anil Anthony Bharath,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.
[AI-4] owards Flexible Spectrum Access: Data-Driven Insights into Spectrum Demand
【速读】:该论文旨在解决6G网络中因无线连接需求激增而面临的频谱资源有限问题,核心挑战在于如何精准刻画时空维度上的频谱需求模式。解决方案的关键在于提出一种数据驱动的方法,结合地理空间分析(geospatial analytics)与机器学习技术,用于估计城市区域内的频谱需求动态变化,并识别其关键驱动因素。该方法在加拿大案例研究中表现出良好泛化能力——模型在训练一个城市区域后,可在另一城市区域解释70%的频谱需求变异,从而为监管机构制定适应未来6G网络需求的有效政策提供量化依据。
链接: https://arxiv.org/abs/2603.09942
作者: Mohamad Alkadamani,Amir Ghasemi,Halim Yanikomeroglu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 7 pages, 5 figures. Presented at IEEE VTC 2024, Washington, DC. Published in the IEEE conference proceedings
Abstract:In the diverse landscape of 6G networks, where wireless connectivity demands surge and spectrum resources remain limited, flexible spectrum access becomes paramount. The success of crafting such schemes hinges on our ability to accurately characterize spectrum demand patterns across space and time. This paper presents a data-driven methodology for estimating spectrum demand variations over space and identifying key drivers of these variations in the mobile broadband landscape. By leveraging geospatial analytics and machine learning, the methodology is applied to a case study in Canada to estimate spectrum demand dynamics in urban regions. Our proposed model captures 70% of the variability in spectrum demand when trained on one urban area and tested on another. These insights empower regulators to navigate the complexities of 6G networks and devise effective policies to meet future network demands.
[AI-5] AI-Enabled Data-driven Intelligence for Spectrum Demand Estimation
【速读】:该论文旨在解决无线服务需求快速增长背景下,移动网络运营商和监管机构在确保频谱资源充足可用方面面临的挑战,核心问题是如何准确预测频谱需求以实现高效频谱资源配置与管理。解决方案的关键在于提出一种数据驱动的方法,利用人工智能(AI)和机器学习(ML)技术,基于站点许可数据和众包数据构建多种频谱需求代理指标(proxy),并通过真实移动网络流量数据进行验证,最终实现对频谱需求的高精度估计(增强型代理指标的R²值达0.89),并在加拿大五大主要城市中验证了模型的泛化能力和鲁棒性,从而支持动态频谱规划与政策调整。
链接: https://arxiv.org/abs/2603.09916
作者: Colin Brown,Mohamad Alkadamani,Halim Yanikomeroglu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: Presented at an IEEE ICC 2025 Workshop and published in the conference proceedings
Abstract:Accurately forecasting spectrum demand is a key component for efficient spectrum resource allocation and management. With the rapid growth in demand for wireless services, mobile network operators and regulators face increasing challenges in ensuring adequate spectrum availability. This paper presents a data-driven approach leveraging artificial intelligence (AI) and machine learning (ML) to estimate and manage spectrum demand. The approach uses multiple proxies of spectrum demand, drawing from site license data and derived from crowdsourced data. These proxies are validated against real-world mobile network traffic data to ensure reliability, achieving an R ^2 value of 0.89 for an enhanced proxy. The proposed ML models are tested and validated across five major Canadian cities, demonstrating their generalizability and robustness. These contributions assist spectrum regulators in dynamic spectrum planning, enabling better resource allocation and policy adjustments to meet future network demands.
[AI-6] MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在临床决策支持中面临的架构碎片化与多模态融合标准缺失问题,具体包括非统一的数据输入管道、不一致的视觉推理评估方法以及跨专科基准测试的缺乏。其解决方案的关键在于提出MedMASLab——一个统一的框架与基准平台,包含三项核心创新:(1) 标准化的多模态智能体通信协议,实现11种异构MAS架构与24种医学模态的无缝集成;(2) 自动化临床推理评估器,采用零样本语义评估范式,利用大视觉语言模型验证诊断逻辑与视觉定位准确性,突破传统词法匹配的局限性;(3) 当前最全面的基准测试集,覆盖11个器官系统和473种疾病,标准化来自11个临床基准的数据。该工作为未来自主临床系统提供了新的技术基线,并揭示了当前MAS在跨专科迁移中的显著性能脆弱性。
链接: https://arxiv.org/abs/2603.09909
作者: Yunhang Qian,Xiaobin Hu,Jiaquan Yu,Siyang Xin,Xiaokun Chen,Jiangning Zhang,Peng-Tao Jiang,Jiawei Liu,Hongwei Bran Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Multi-Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration. Current medical MAS research suffers from non-uniform data ingestion pipelines, inconsistent visual-reasoning evaluation, and a lack of cross-specialty benchmarking. To address these challenges, we present MedMASLab, a unified framework and benchmarking platform for multimodal medical multi-agent systems. MedMASLab introduces: (1) A standardized multimodal agent communication protocol that enables seamless integration of 11 heterogeneous MAS architectures across 24 medical modalities. (2) An automated clinical reasoning evaluator, a zero-shot semantic evaluation paradigm that overcomes the limitations of lexical string-matching by leveraging large vision-language models to verify diagnostic logic and visual grounding. (3) The most extensive benchmark to date, spanning 11 organ systems and 473 diseases, standardizing data from 11 clinical benchmarks. Our systematic evaluation reveals a critical domain-specific performance gap: while MAS improves reasoning depth, current architectures exhibit significant fragility when transitioning between specialized medical sub-domains. We provide a rigorous ablation of interaction mechanisms and cost-performance trade-offs, establishing a new technical baseline for future autonomous clinical systems. The source code and data is publicly available at: this https URL
[AI-7] LCA: Local Classifier Alignment for Continual Learning
【速读】:该论文旨在解决持续学习(Continual Learning)中因环境变化导致模型发生灾难性遗忘(Catastrophic Forgetting)的问题。现有方法如仅微调首个任务或统一整合任务知识至主干网络,常面临任务特定分类器与适应后的主干之间存在潜在不匹配(Mismatch)的挑战,从而影响性能和鲁棒性。解决方案的关键在于提出一种新颖的局部分类器对齐(Local Classifier Alignment, LCA)损失函数,该损失通过增强分类器与主干特征表示之间的对齐关系,不仅提升模型在所有已见任务上的泛化能力,还增强了整体鲁棒性;结合模型合并策略,构建了一个完整的持续学习框架,在多个标准基准上实现了领先性能,部分场景显著超越现有最优方法。
链接: https://arxiv.org/abs/2603.09888
作者: Tung Tran,Danilo Vasconcellos Vargas,Khoat Than
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A fundamental requirement for intelligent systems is the ability to learn continuously under changing environments. However, models trained in this regime often suffer from catastrophic forgetting. Leveraging pre-trained models has recently emerged as a promising solution, since their generalized feature extractors enable faster and more robust adaptation. While some earlier works mitigate forgetting by fine-tuning only on the first task, this approach quickly deteriorates as the number of tasks grows and the data distributions diverge. More recent research instead seeks to consolidate task knowledge into a unified backbone, or adapting the backbone as new tasks arrive. However, such approaches may create a (potential) \textitmismatch between task-specific classifiers and the adapted backbone. To address this issue, we propose a novel \textitLocal Classifier Alignment (LCA) loss to better align the classifier with backbone. Theoretically, we show that this LCA loss can enable the classifier to not only generalize well for all observed tasks, but also improve robustness. Furthermore, we develop a complete solution for continual learning, following the model merging approach and using LCA. Extensive experiments on several standard benchmarks demonstrate that our method often achieves leading performance, sometimes surpasses the state-of-the-art methods with a large margin.
[AI-8] Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning
【速读】:该论文旨在解决在杂乱场景中实现外在灵巧性(extrinsic dexterity)的挑战,即如何利用环境接触来克服抓取式操作(prehensile manipulation)的局限性,尤其是在多个交互物体具有内在耦合动力学的复杂环境中。现有方法因缺乏对这种复杂动力学的显式建模,难以在杂乱场景中实现非抓取式操作(non-prehensile manipulation),限制了其在真实世界中的应用。解决方案的关键在于提出一种动态感知策略学习框架(Dynamics-Aware Policy Learning, DAPL),通过显式的世界建模学习接触诱导的物体动力学表征,并将其作为条件输入用于强化学习,从而无需手工设计的接触启发式规则或复杂的奖励塑造即可让外在灵巧性自然涌现。
链接: https://arxiv.org/abs/2603.09882
作者: Yixin Zheng,Jiangran Lyu,Yifan Zhang,Jiayi Chen,Mi Yan,Yuntian Deng,Xuesong Shi,Xiaoguang Zhao,Yizhou Wang,Zhizheng Zhang,He Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short in non-prehensile manipulation in cluttered environments, which in turn limits their practical applicability in real-world environments. In this paper, we introduce a Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments. This representation is learned through explicit world modeling and used to condition reinforcement learning, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping. We evaluate our approach in both simulation and the real world. Our method outperforms prehensile manipulation, human teleoperation, and prior representation-based policies by over 25% in success rate on unseen simulated cluttered scenes with varying densities. The real-world success rate reaches around 50% across 10 cluttered scenes, while a practical grocery deployment further demonstrates robust sim-to-real transfer and applicability.
[AI-9] A Graph-Based Approach to Spectrum Demand Prediction Using Hierarchical Attention Networks
【速读】:该论文旨在解决无线连接需求激增与频谱资源有限性之间的矛盾,核心问题在于如何实现高效频谱管理。为此,作者提出了一种分层分辨率图注意力网络(Hierarchical Resolution Graph Attention Network, HR-GAT)模型,其关键创新在于利用地理空间数据对频谱需求进行精准预测,并有效处理空间自相关性(spatial autocorrelation)这一传统机器学习模型常面临的挑战,从而显著提升预测精度与泛化能力。在加拿大五大城市的实证测试中,HR-GAT相较八种基线模型提升了21%的预测准确率,验证了其优越性能。
链接: https://arxiv.org/abs/2603.09859
作者: Mohamad Alkadamani,Halim Yanikomeroglu,Amir Ghasemi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
备注: 7 pages, 6 figures. Presented at IEEE GLOBECOM 2025, Taiwan. To appear in the conference proceedings
Abstract:The surge in wireless connectivity demand, coupled with the finite nature of spectrum resources, compels the development of efficient spectrum management approaches. Spectrum sharing presents a promising avenue, although it demands precise characterization of spectrum demand for informed policy-making. This paper introduces HR-GAT, a hierarchical resolution graph attention network model, designed to predict spectrum demand using geospatial data. HR-GAT adeptly handles complex spatial demand patterns and resolves issues of spatial autocorrelation that usually challenge standard machine learning models, often resulting in poor generalization. Tested across five major Canadian cities, HR-GAT improves predictive accuracy of spectrum demand by 21% over eight baseline models, underscoring its superior performance and reliability.
[AI-10] SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases EACL2026
【速读】:该论文旨在解决当前大型音频语言模型(Large Audio Language Models, LALMs)在音频理解评估中过度依赖自动语音识别(ASR)而忽视非语音成分及多维音频语义理解的问题。其解决方案的关键在于提出一个名为SCENEBench的基准测试套件,涵盖空间感知、跨语言、环境噪声和非语音特征四个维度,以系统性地评估LALMs对背景声音、噪声定位、跨语言语音理解和声纹识别等实际场景下的音频理解能力。此外,通过合成数据与自然音频样本相结合的方式验证了基准的生态效度,并量化了模型延迟,从而为模型性能提供更全面、贴近真实应用的评估标准。
链接: https://arxiv.org/abs/2603.09853
作者: Laya Iyer,Angelina Wang,Sanmi Koyejo
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to EACL 2026 (Main Conference). 10 pages, 10 figures. Camera-ready version
Abstract:Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what words are said - rather, how they are said and the non-speech components of the audio. Because our audio samples are synthetically constructed (e.g., by overlaying two natural audio samples), we further validate our benchmark against 20 natural audio items per task, sub-sampled from existing datasets to match our task criteria, to assess ecological validity. We assess five state-of-the-art LALMs and find critical gaps: performance varies across tasks, with some tasks performing below random chance and others achieving high accuracy. These results provide direction for targeted improvements in model capabilities.
[AI-11] Correction of Transformer-Based Models with Smoothing Pseudo-Projector
【速读】:该论文旨在解决现有语言模型及其他神经网络在训练过程中对噪声敏感、收敛不稳定的问题,特别是由标签无关的输入内容所引发的冗余方向干扰。解决方案的关键在于引入一种轻量级模块——伪投影器(pseudo-projector),其本质是一个隐藏表示修正器,通过抑制由标签无关输入内容诱导的方向来降低模型对噪声的敏感性。该设计受多网格(multigrid, MG)方法启发,利用可学习的限制(restriction)与延拓(prolongation)算子模拟正交投影行为,虽不严格满足理想投影性质,但能显著改善训练动态和鲁棒性,在基于Transformer的文本分类任务及可控合成基准上均验证了有效性。
链接: https://arxiv.org/abs/2603.09815
作者: Vitaly Bulgakov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 23 figures
Abstract:The pseudo-projector is a lightweight modification that can be integrated into existing language models and other neural networks without altering their core architecture. It can be viewed as a hidden-representation corrector that reduces sensitivity to noise by suppressing directions induced by label-irrelevant input content. The design is inspired by the multigrid (MG) paradigm, originally developed to accelerate the convergence of iterative solvers for partial differential equations and boundary value problems, and later extended to more general linear systems through algebraic multigrid methods. We refer to the method as a pseudo-projector because its linear prototype corresponds to a strictly idempotent orthogonal projector, whereas the practical formulation employs learnable restriction and prolongation operators and therefore does not, in general, satisfy the properties of an exact orthogonal projection. We evaluate the proposed approach on transformer-based text classification tasks, as well as controlled synthetic benchmarks, demonstrating its effectiveness in improving training dynamics and robustness. Experimental results, together with supporting theoretical heuristics, indicate consistent improvements in training behavior across a range of settings, with no adverse effects observed otherwise. Our next step will be to extend this approach to language models.
[AI-12] Exploiting Label-Aware Channel Scoring for Adaptive Channel Pruning in Split Learning
【速读】:该论文旨在解决Split Learning (SL) 中因传输中间特征表示(称为 smashed data)而导致的显著通信开销问题,尤其是在大量客户端设备参与时。解决方案的关键在于提出一种自适应通道剪枝辅助的Split Learning(ACP-SL)方案:其核心是设计了一个标签感知的通道重要性评分(Label-aware Channel Importance Scoring, LCIS)模块,用于识别重要与非重要通道;进而通过自适应通道剪枝(Adaptive Channel Pruning, ACP)模块对低重要性通道进行剪枝,从而压缩smashed data,降低通信负担。实验表明,ACP-SL在测试准确率上优于基准方案,并能在更少训练轮次内达到目标精度,有效减少通信开销。
链接: https://arxiv.org/abs/2603.09792
作者: Jialei Tan,Zheng Lin,Xiangming Cai,Ruoxi Zhu,Zihan Fang,Pingping Chen,Wei Ni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures,
Abstract:Split learning (SL) transfers most of the training workload to the server, which alleviates computational burden on client devices. However, the transmission of intermediate feature representations, referred to as smashed data, incurs significant communication overhead, particularly when a large number of client devices are involved. To address this challenge, we propose an adaptive channel pruning-aided SL (ACP-SL) scheme. In ACP-SL, a label-aware channel importance scoring (LCIS) module is designed to generate channel importance scores, distinguishing important channels from less important ones. Based on these scores, an adaptive channel pruning (ACP) module is developed to prune less important channels, thereby compressing the corresponding smashed data and reducing the communication overhead. Experimental results show that ACP-SL consistently outperforms benchmark schemes in test accuracy. Furthermore, it reaches a target test accuracy in fewer training rounds, thereby reducing communication overhead.
[AI-13] A Hybrid Quantum-Classical Framework for Financial Volatility Forecasting Based on Quantum Circuit Born Machines
【速读】:该论文旨在解决金融市场价格波动预测中传统计量经济模型和经典机器学习方法难以有效处理时间序列数据非线性和非平稳特性的问题。其解决方案的关键在于提出了一种混合量子-经典计算框架,将长短期记忆网络(LSTM)与量子电路玻恩机(QCBM)相结合:LSTM负责从历史时序数据中提取复杂的动态特征,而QCBM作为可学习的先验模块,提供高质量的概率分布以指导预测过程,从而显著提升预测精度。实验表明,该混合模型在均方误差(MSE)、均方根误差(RMSE)和QLIKE损失等多个指标上优于纯经典LSTM基线模型,验证了量子计算增强金融预测能力的潜力。
链接: https://arxiv.org/abs/2603.09789
作者: Yixiong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:
Abstract:Accurate forecasting of financial market volatility is crucial for risk management, option pricing, and portfolio optimization. Traditional econometric models and classical machine learning methods face challenges in handling the inherent non-linear and non-stationary characteristics of financial time series. In recent years, the rapid development of quantum computing has provided a new paradigm for solving complex optimization and sampling problems. This paper proposes a novel hybrid quantum-classical computing framework aimed at combining the powerful representation capabilities of classical neural networks with the unique advantages of quantum models. For the specific task of financial market volatility forecasting, we designed and implemented a hybrid model based on this framework, which combines a Long Short-Term Memory (LSTM) network with a Quantum Circuit Born Machine (QCBM). The LSTM is responsible for extracting complex dynamic features from historical time series data, while the QCBM serves as a learnable prior module, providing the model with a high-quality prior distribution to guide the forecasting process. We evaluated the model on two real financial datasets consisting of 5-minute high-frequency data from the Shanghai Stock Exchange (SSE) Composite Index and CSI 300 Index. Experimental results show that, compared to a purely classical LSTM baseline model, our hybrid quantum-classical model demonstrates significant advantages across multiple key metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and QLIKE loss, proving the great potential of quantum computing in enhancing the capabilities of financial forecasting models. More broadly, the proposed hybrid framework offers a flexible architecture that may be adapted to other machine learning tasks involving high-dimensional, complex, or non-linear data distributions.
[AI-14] Quantifying the Necessity of Chain of Thought through Opaque Serial Depth
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中难以监测其内部推理过程的问题,特别是识别那些未被显式外化为思维链(Chain of Thought, CoT)的深层推理行为。其核心挑战在于量化模型在不依赖可解释中间步骤(如CoT)的情况下所能执行的最大串行计算深度,即“不透明串行深度”(opaque serial depth)。解决方案的关键在于提出并形式化这一概念,并开发一种自动化方法来计算任意神经网络的不透明串行深度上界;实验表明,Gemma 3模型的该指标具有数值上限,且混合专家(Mixture-of-Experts, MoE)架构的不透明串行深度普遍低于密集模型,从而揭示了模型潜在非外化推理能力的量化评估路径。
链接: https://arxiv.org/abs/2603.09786
作者: Jonah Brown-Cohen,David Lindner,Rohin Shah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) tend to externalize their reasoning in their chain of thought, making the chain of thought a good target for monitoring. This is partially an inherent feature of the Transformer architecture: sufficiently long serial cognition must pass through the chain of thought (Korbak et al., 2025). We formalize this argument through the notion of opaque serial depth, given by the length of the longest computation that can be done without the use of interpretable intermediate steps like chain of thought. Given this formalization, we compute numeric upper bounds on the opaque serial depth of Gemma 3 models, as well as asymptotic results for additional architectures beyond standard LLMs. We also open-source an automated method that can calculate upper bounds on the opaque serial depth of arbitrary neural networks, and use it to demonstrate that Mixture-of-Experts models likely have lower depth than dense models. Overall, our results suggest that opaque serial depth is a useful tool for understanding the potential for models to do significant reasoning that is not externalized.
[AI-15] World2Mind: Cognition Toolkit for Allocentric Spatial Reasoning in Foundation Models
【速读】:该论文旨在解决当前多模态基础模型(Multimodal Foundation Models, MFMs)在空间推理任务中表现不佳的问题,尤其是现有方法因过度依赖3D定位数据中的统计捷径或局限于二维视觉感知,导致空间推理准确率低且泛化能力差。其解决方案的关键在于提出一种无需训练的空间智能工具包World2Mind,该工具包通过结合3D重建与实例分割模型构建结构化的空间认知地图,并引入一种基于椭圆参数建模的“外周空间树”(Allocentric-Spatial Tree, AST),以提供鲁棒的几何-拓扑先验信息;同时设计了一个三阶段推理链——工具调用评估、模态解耦线索收集和几何-语义交织推理,有效缓解了3D重建误差对空间推理的影响,从而显著提升MFMs在复杂场景下的空间理解能力。
链接: https://arxiv.org/abs/2603.09774
作者: Shouwei Ruan,Bin Wang,Zhenyu Wu,Qihui Zhu,Yuxiang Zhang,Hang Su,Yubin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Achieving robust spatial reasoning remains a fundamental challenge for current Multimodal Foundation Models (MFMs). Existing methods either overfit statistical shortcuts via 3D grounding data or remain confined to 2D visual perception, limiting both spatial reasoning accuracy and generalization in unseen scenarios. Inspired by the spatial cognitive mapping mechanisms of biological intelligence, we propose World2Mind, a training-free spatial intelligence toolkit. At its core, World2Mind leverages 3D reconstruction and instance segmentation models to construct structured spatial cognitive maps, empowering MFMs to proactively acquire targeted spatial knowledge regarding interested landmarks and routes of interest. To provide robust geometric-topological priors, World2Mind synthesizes an Allocentric-Spatial Tree (AST) that uses elliptical parameters to model the top-down layout of landmarks accurately. To mitigate the inherent inaccuracies of 3D reconstruction, we introduce a three-stage reasoning chain comprising tool invocation assessment, modality-decoupled cue collection, and geometry-semantics interwoven reasoning. Extensive experiments demonstrate that World2Mind boosts the performance of frontier models, such as GPT-5.2, by 5%~18%. Astonishingly, relying solely on the AST-structured text, purely text-only foundation models can perform complex 3D spatial reasoning, achieving performance approaching that of advanced multimodal models.
[AI-16] AutoAgent : Evolving Cognition and Elastic Memory Orchestration for Adaptive Agents
【速读】:该论文旨在解决自主代理框架中长期经验学习与实时、情境敏感决策难以协调的问题,具体表现为静态认知、工作流依赖僵化及情境利用效率低下,从而限制了在开放且非平稳环境中的适应能力。其解决方案的关键在于提出AutoAgent这一自演化多智能体框架,核心由三个紧密耦合的组件构成:演进式认知(evolving cognition)、动态情境决策(on-the-fly contextual decision-making)和弹性记忆编排(elastic memory orchestration)。其中,每个智能体维护结构化的提示级认知(prompt-level cognition),涵盖工具、自我能力、同伴专长与任务知识,并在执行过程中结合实时任务上下文从统一动作空间(包括工具调用、大语言模型生成和跨智能体请求)中选择行动;同时,弹性记忆编排模块通过保留原始记录、压缩冗余轨迹并构建可复用的情境抽象,显著降低token开销并保留决策关键证据;整个系统通过闭环认知进化机制,将意图动作与观测结果对齐以持续更新认知并扩展可复用技能,无需外部再训练,从而实现高效长程推理与可靠情境感知决策的统一。
链接: https://arxiv.org/abs/2603.09716
作者: Xiaoxing Wang,Ning Liao,Shikun Wei,Chen Tang,Feiyu Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous agent frameworks still struggle to reconcile long-term experiential learning with real-time, context-sensitive decision-making. In practice, this gap appears as static cognition, rigid workflow dependence, and inefficient context usage, which jointly limit adaptability in open-ended and non-stationary environments. To address these limitations, we present AutoAgent, a self-evolving multi-agent framework built on three tightly coupled components: evolving cognition, on-the-fly contextual decision-making, and elastic memory orchestration. At the core of AutoAgent, each agent maintains structured prompt-level cognition over tools, self-capabilities, peer expertise, and task knowledge. During execution, this cognition is combined with live task context to select actions from a unified space that includes tool calls, LLM-based generation, and inter-agent requests. To support efficient long-horizon reasoning, an Elastic Memory Orchestrator dynamically organizes interaction history by preserving raw records, compressing redundant trajectories, and constructing reusable episodic abstractions, thereby reducing token overhead while retaining decision-critical evidence. These components are integrated through a closed-loop cognitive evolution process that aligns intended actions with observed outcomes to continuously update cognition and expand reusable skills, without external retraining. Empirical results across retrieval-augmented reasoning, tool-augmented agent benchmarks, and embodied task environments show that AutoAgent consistently improves task success, tool-use efficiency, and collaborative robustness over static and memory-augmented baselines. Overall, AutoAgent provides a unified and practical foundation for adaptive autonomous agents that must learn from experience while making reliable context-aware decisions in dynamic environments.
[AI-17] Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT
【速读】:该论文旨在解决视觉-语言大模型(Vision-Language Large Models, VLLMs)在视觉指令微调(Visual Instruction Tuning)过程中,因大量样本可仅通过语言模式或常识捷径即可解答,而缺乏真正跨模态推理能力的问题。现有数据选择方法通常依赖昂贵的代理模型训练,且仅关注样本难度或多样性,未能有效识别对视觉-语言联合推理具有实质性贡献的样本。其解决方案的关键在于提出一种无需训练的数据选择方法(CVS),核心思想是:高质量的多模态样本中,加入问题条件会显著改变模型对答案有效性的判断;CVS 利用冻结的 VLLM 作为评估器,通过比较有无问题条件时模型对答案有效性的差异,从而筛选出需要视觉-语言联合推理的样本,并过滤语义冲突噪声。
链接: https://arxiv.org/abs/2603.09715
作者: Peng Sun,Huawen Shen,Yi Ban,Tianfan Fu,Yanbo Wang,Yuqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample’s true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model’s assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.
[AI-18] OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全对齐方面的局限性,尤其是现有方法主要关注恶意意图或情境违规,而忽视了由因果链引发的潜在后果风险。为此,作者提出将安全对齐范式从“意图驱动”转向“后果驱动”,以提升自主和具身智能体的鲁棒部署能力。其关键解决方案是提出一种名为Consequence-Aware Safety Policy Optimization (CASPO) 的框架,该框架通过利用模型内在推理过程作为动态参考,生成token级自蒸馏奖励信号,从而增强模型对潜在风险后果的预测能力。实验表明,CASPO显著降低了风险识别失败率(如Qwen2.5-VL-7B降至7.3%,Qwen3-VL-4B降至5.7%),同时保持整体性能。
链接: https://arxiv.org/abs/2603.09706
作者: Ming Wen,Kun Yang,Jingyu Zhang,Yuxuan Liu,shiwen cui,Shouling Ji,Xingjun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages
Abstract:While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention, current paradigms primarily target malicious intent or situational violations. We propose shifting the safety frontier toward consequence-driven safety, a paradigm essential for the robust deployment of autonomous and embodied agents. To formalize this shift, we introduce OOD-MMSafe, a benchmark comprising 455 curated query-image pairs designed to evaluate a model’s ability to identify latent hazards within context-dependent causal chains. Our analysis reveals a pervasive causal blindness among frontier models, with the highest 67.5% failure rate in high-capacity closed-source models, and identifies a preference ceiling where static alignment yields format-centric failures rather than improved safety reasoning as model capacity grows. To address these bottlenecks, we develop the Consequence-Aware Safety Policy Optimization (CASPO) framework, which integrates the model’s intrinsic reasoning as a dynamic reference for token-level self-distillation rewards. Experimental results demonstrate that CASPO significantly enhances consequence projection, reducing the failure ratio of risk identification to 7.3% for Qwen2.5-VL-7B and 5.7% for Qwen3-VL-4B while maintaining overall effectiveness.
[AI-19] EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在代码生成任务中表现优异但实际依赖训练数据记忆而非真正推理能力的问题。其解决方案的关键在于提出EsoLang-Bench,一个基于五种极客编程语言(Brainfuck、Befunge-98、Whitespace、Unlambda和Shakespeare)的基准测试集,这些语言因缺乏经济合理性而难以被模型通过预训练数据“作弊”式学习。该基准通过要求模型从文档、解释器反馈和迭代实验中学习新语言,模拟人类习得新技能的过程,从而评估模型是否具备可迁移的推理能力,而非仅依赖数据污染或模式匹配。
链接: https://arxiv.org/abs/2603.09678
作者: Aman Sharma,Paras Chopra
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 24 pages, 7 figures, preprint
Abstract:Large language models achieve near-ceiling performance on code generation benchmarks, yet these results increasingly reflect memorization rather than genuine reasoning. We introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) that lack benchmark gaming incentives due to their economic irrationality for pre-training. These languages require the same computational primitives as mainstream programming but have 1,000-100,000x fewer public repositories than Python (based on GitHub search counts). We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning. EsoLang-Bench provides the first benchmark designed to mimic human learning by acquiring new languages through documentation, interpreter feedback, and iterative experimentation, measuring transferable reasoning skills resistant to data contamination.
[AI-20] Logics-Parsing-Omni Technical Report
【速读】:该论文旨在解决多模态解析中任务定义碎片化与非结构化数据异构性带来的挑战,提出Omni Parsing框架以实现对文档、图像和音视频流的统一建模与解析。其关键在于构建一个涵盖三层次的渐进式解析范式:首先通过整体检测(Holistic Detection)实现对象或事件的空间-时间精确定位,奠定感知基础;继而通过细粒度识别(Fine-grained Recognition)完成符号化(如OCR/ASR)与属性提取,实现结构化实体解析;最后通过多层级解释(Multi-level Interpreting)建立从局部语义到全局逻辑的推理链,并引入证据锚定机制(evidence anchoring mechanism),确保高层语义描述与底层事实严格对齐,从而实现基于证据的逻辑归纳,将非结构化信号转化为可定位、可枚举、可追溯的标准知识。
链接: https://arxiv.org/abs/2603.09677
作者: Xin An,Jingyi Cai,Xiangyang Chen,Huayao Liu,Peiting Liu,Peng Wang,Bei Yang,Xiuwen Zhu,Yongfan Chen,Baoyu Hou,Shuzhao Li,Weidong Ren,Fan Yang,Jiangtao Zhang,Xiaoxiao Xu,Lin Qu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine-grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables ``evidence-based’’ logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge. Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at this https URL.
[AI-21] GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation
【速读】:该论文旨在解决时间序列异常检测(Time Series Anomaly Detection, TSAD)领域中缺乏标准化评估框架的问题,尤其是在基于图神经网络(Graph Neural Networks, GNNs)的方法中,存在指标设计不统一、阈值设定不合理以及结果难以解释等挑战。解决方案的关键在于提出一个开源的、可复现的TSAD框架,支持跨数据集、图结构和评估策略的系统性比较;该框架不仅提升了模型性能的可比性,还增强了对异常检测结果的可解释性分析,并通过实证表明注意力机制增强的GNN在图结构不确定或推断时具有更强鲁棒性,从而为构建高效且可信的图基TSAD系统提供了工具与方法论支撑。
链接: https://arxiv.org/abs/2603.09675
作者: Federico Bello,Gonzalo Chiarlone,Marcelo Fiori,Gastón García González,Federico Larroca
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:There is growing interest in applying graph-based methods to Time Series Anomaly Detection (TSAD), particularly Graph Neural Networks (GNNs), as they naturally model dependencies among multivariate signals. GNNs are typically used as backbones in score-based TSAD pipelines, where anomalies are identified through reconstruction or prediction errors followed by thresholding. However, and despite promising results, the field still lacks standardized frameworks for evaluation and suffers from persistent issues with metric design and interpretation. We thus present an open-source framework for TSAD using GNNs, designed to support reproducible experimentation across datasets, graph structures, and evaluation strategies. Built with flexibility and extensibility in mind, the framework facilitates systematic comparisons between TSAD models and enables in-depth analysis of performance and interpretability. Using this tool, we evaluate several GNN-based architectures alongside baseline models across two real-world datasets with contrasting structural characteristics. Our results show that GNNs not only improve detection performance but also offer significant gains in interpretability, an especially valuable feature for practical diagnosis. We also find that attention-based GNNs offer robustness when graph structure is uncertain or inferred. In addition, we reflect on common evaluation practices in TSAD, showing how certain metrics and thresholding strategies can obscure meaningful comparisons. Overall, this work contributes both practical tools and critical insights to advance the development and evaluation of graph-based TSAD systems.
[AI-22] MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM -Powered Assistants
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在生成交互式应用(称为MiniApps)时缺乏有效评估基准的问题。现有评测主要关注算法正确性或静态界面重建,无法衡量模型生成符合现实原则的动态交互逻辑的能力。解决方案的关键在于提出MiniAppBench——首个面向原则驱动型交互应用生成的综合性基准,涵盖6个领域共500个任务;并设计MiniAppEval,一个基于浏览器自动化的代理评估框架,通过类人探索性测试从意图一致性(Intention)、静态结构(Static)和动态行为(Dynamic)三个维度系统评估应用质量,从而为LLM在交互式应用生成领域的研究提供可靠、可量化的评估标准。
链接: https://arxiv.org/abs/2603.09652
作者: Zuhao Zhang,Chengyue Yu,Yuante Li,Chenyi Zhuang,Linjian Mo,Shuai Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construct customized interaction logic that adheres to real-world principles. However, existing benchmarks primarily focus on algorithmic correctness or static layout reconstruction, failing to capture the capabilities required for this new paradigm. To address this gap, we introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. Sourced from a real-world application with 10M+ generations, MiniAppBench distills 500 tasks across six domains (e.g., Games, Science, and Tools). Furthermore, to tackle the challenge of evaluating open-ended interactions where no single ground truth exists, we propose MiniAppEval, an agentic evaluation framework. Leveraging browser automation, it performs human-like exploratory testing to systematically assess applications across three dimensions: Intention, Static, and Dynamic. Our experiments reveal that current LLMs still face significant challenges in generating high-quality MiniApps, while MiniAppEval demonstrates high alignment with human judgment, establishing a reliable standard for future research. Our code is available in this http URL.
[AI-23] MM-tau-p2: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)驱动的智能体评估框架在多模态场景下的局限性问题,尤其是现有基准测试大多聚焦于文本对话驱动的代理(agent),忽视了用户个性(persona)对代理行为的影响,且未考虑实时语音合成(TTS)与多模态语言模型发展背景下代理的多模态能力演化。其解决方案的关键在于提出MM-tau-p²基准测试体系,引入12项新颖指标,用于在双控制(dual control)设置下评估多模态代理在是否适配用户个性(persona adaptation)以及是否纳入用户输入参与规划过程时的鲁棒性表现;同时通过LLM-as-judge方法,在电信和零售领域对这些指标进行量化估计,从而为多模态代理提供自动化、系统化的评估路径。
链接: https://arxiv.org/abs/2603.09643
作者: Anupam Purwar,Aditya Choudhary
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注: A benchmark for evaluating multimodal both voice and text LLM agents in dualcontrol settings. We introduce persona adaptive prompting and 12 new metrics to assess robustness safety efficiency and recovery in customer support scenarios
Abstract:Current evaluation frameworks and benchmarks for LLM powered agents focus on text chat driven agents, these frameworks do not expose the persona of user to the agent, thus operating in a user agnostic environment. Importantly, in customer experience management domain, the agent’s behaviour evolves as the agent learns about user personality. With proliferation of real time TTS and multi-modal language models, LLM based agents are gradually going to become multi-modal. Towards this, we propose the MM-tau-p ^2 benchmark with metrics for evaluating the robustness of multi-modal agents in dual control setting with and without persona adaption of user, while also taking user inputs in the planning process to resolve a user query. In particular, our work shows that even with state of-the-art frontier LLMs like GPT-5, GPT 4.1, there are additional considerations measured using metrics viz. multi-modal robustness, turn overhead while introducing multi-modality into LLM based agents. Overall, MM-tau-p ^2 builds on our prior work FOCAL and provides a holistic way of evaluating multi-modal agents in an automated way by introducing 12 novel metrics. We also provide estimates of these metrics on the telecom and retail domains by using the LLM-as-judge approach using carefully crafted prompts with well defined rubrics for evaluating each conversation.
[AI-24] Routing without Forgetting
【速读】:该论文旨在解决在线持续学习(Online Continual Learning, OCL)中Transformer模型的适应性问题,即在数据以非平稳流形式到达且每条样本仅能被观测一次的场景下,如何实现高效、无遗忘的模型更新。现有基于提示(prompt)、适配器(adapter)或LoRA模块的方法依赖于渐进式梯度优化,在OCL设置下表现不佳。其解决方案的关键在于将持续学习重构为一个基于能量的关联检索路由问题:通过引入受现代霍普菲尔德网络启发的能量基关联检索层,RwF(Routing without Forgetting)在每个前向传播中直接根据输入条件动态生成提示,无需显式任务标识或重复优化。该机制基于严格凸自由能函数的闭式最小化,实现了无需迭代梯度调整的输入驱动路由,从而在类增量基准测试中显著优于现有提示方法,尤其在少样本场景下表现出更强鲁棒性。
链接: https://arxiv.org/abs/2603.09576
作者: Alessio Masano,Giovanni Bellitto,Dipam Goswani,Joost Van de Weijer,Concetto Spampinato
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual learning in transformers is commonly addressed through parameter-efficient adaptation: prompts, adapters, or LoRA modules are specialized per task while the backbone remains frozen. Although effective in controlled multi-epoch settings, these approaches rely on gradual gradient-based specialization and struggle in Online Continual Learning (OCL), where data arrive as a non-stationary stream and each sample may be observed only once. We recast continual learning in transformers as a routing problem: under strict online constraints, the model must dynamically select the appropriate representational subspace for each input without explicit task identifiers or repeated optimization. We thus introduce Routing without Forgetting (RwF), a transformer architecture augmented with energy-based associative retrieval layers inspired by Modern Hopfield Networks. Instead of storing or merging task-specific prompts, RwF generates dynamic prompts through single-step associative retrieval over the transformer token embeddings at each layer. Retrieval corresponds to the closed-form minimization of a strictly convex free-energy functional, enabling input-conditioned routing within each forward pass, independently of iterative gradient refinement. Across challenging class-incremental benchmarks, RwF improves over existing prompt-based methods. On Split-ImageNet-R and Split-ImageNet-S, RwF outperforms prior prompt-based approaches by a large margin, even in few-shot learning regimes. These results indicate that embedding energy-based associative routing directly within the transformer backbone provides a principled and effective foundation for OCL.
[AI-25] Compiler-First State Space Duality and Portable O(1) Autoregressive Caching for Inference
【速读】:该论文旨在解决状态空间模型(State-Space Model, SSM)在部署时对特定硬件(如NVIDIA GPU)的强依赖问题,即传统实现通常耦合于CUDA或Triton自定义内核,限制了跨平台兼容性。其解决方案的关键在于利用XLA(Accelerated Linear Algebra)编译器的融合与分块优化能力,将Mamba-2的状态空间对偶算法(包含对角状态结构、可分段递归和以einsum为主导的静态控制流计算)映射为标准原语(standard primitives),从而无需编写定制内核即可实现高效执行。这一方法使模型推理路径(预填充与缓存自回归解码)可在CPU、NVIDIA GPU和Google Cloud TPU上统一运行,并实现理论上的O(1)状态管理,通过设备端缓存避免主机同步,显著提升跨平台可移植性和硬件利用率。
链接: https://arxiv.org/abs/2603.09555
作者: Cosmo Santoni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注: 18 pages, 6 figures. Code available at: this https URL
Abstract:State-space model releases are typically coupled to fused CUDA and Triton kernels, inheriting a hard dependency on NVIDIA hardware. We show that Mamba-2’s state space duality algorithm – diagonal state structure, chunkable recurrence, and einsum-dominated compute with static control flow – maps cleanly onto what XLA’s fusion and tiling passes actually optimise, making custom kernels optional rather than required. We implement the full inference path (prefill, cached autoregressive decoding) as shaped standard primitives under XLA, without hand-written kernels, and realise the architecture’s theoretical O(1) state management as a compiled on-device cache requiring no host synchronisation during generation. The implementation runs unmodified on CPU, NVIDIA GPU, and Google Cloud TPU from a single JAX source. On TPU v6e across five model scales (130M–2.7B parameters), XLA-generated code reaches approximately 140 TFLOPS on single-stream prefill ( 15% MFU) and up to 64% bandwidth utilisation on decode. Greedy decoding matches the PyTorch/CUDA reference token-for-token across 64 steps, with hidden-state agreement within float32 rounding tolerance. The pattern transfers to any SSM recurrence satisfying the same structural conditions, on any platform with a mature XLA backend. The implementation is publicly available at this https URL and merged into the Bonsai JAX model library.
[AI-26] Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation
【速读】:该论文旨在解决生成式 AI(Generative AI)中,推测解码(speculative decoding)在目标大语言模型(Large Language Models, LLMs)经过特定领域微调后性能下降的问题。传统方法需为每个微调后的目标模型重新训练完整的草稿模型(draft model),成本高且效率低。其解决方案的关键在于提出一种参数与数据高效的适配框架 EDA(Efficient Draft Adaptation):首先采用解耦架构,通过共享组件和轻量私有组件分别建模通用与目标特定输出分布,仅更新私有组件实现参数高效适配;其次设计数据再生策略,利用微调后的目标模型生成训练数据以提升训练与推测解码的一致性,从而提高平均接受长度;最后引入样本选择机制,优先选取高价值样本进行训练,进一步降低适配成本。
链接: https://arxiv.org/abs/2603.09527
作者: Luxi Lin,Zhihang Lin,Zhanpeng Zeng,Yuhao Chen,Qingyu Zhang,Jixiang Luo,Xuelong Li,Rongrong Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and inefficient. To address this, we introduce a parameter- and data-efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models. EDA introduces three innovations: (1) a decoupled architecture that utilizes shared and private components to model the shared and target-specific output distributions separately, enabling parameter-efficient adaptation by updating only the lightweight private component;(2) a data regeneration strategy that utilizes the fine-tuned target model to regenerate training data, thereby improving the alignment between training and speculative decoding, leading to higher average acceptance length;(3) a sample selection mechanism that prioritizes high-value data for efficient adaptation. Our experiments show that EDA effectively restores speculative performance on fine-tuned models, achieving superior average acceptance lengths with significantly reduced training costs compared to full retraining. Code is available at this https URL.
[AI-27] mporal-Conditioned Normalizing Flows for Multivariate Time Series Anomaly Detection
【速读】:该论文旨在解决时间序列数据中异常检测的问题,特别是如何准确建模时间依赖性和不确定性以提升检测精度。其解决方案的关键在于提出了一种时序条件归一化流(temporal-conditioned normalizing flows, tcNF)框架,通过将归一化流模型条件化于历史观测值,有效捕捉复杂的时间动态,并生成预期行为的概率分布,从而实现基于低概率事件识别的鲁棒异常检测。
链接: https://arxiv.org/abs/2603.09490
作者: David Baumgartner,Helge Langseth,Kenth Engø-Monsen,Heri Ramampiaro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces temporal-conditioned normalizing flows (tcNF), a novel framework that addresses anomaly detection in time series data with accurate modeling of temporal dependencies and uncertainty. By conditioning normalizing flows on previous observations, tcNF effectively captures complex temporal dynamics and generates accurate probability distributions of expected behavior. This autoregressive approach enables robust anomaly detection by identifying low-probability events within the learned distribution. We evaluate tcNF on diverse datasets, demonstrating good accuracy and robustness compared to existing methods. A comprehensive analysis of strengths and limitations and open-source code is provided to facilitate reproducibility and future research.
[AI-28] Vibe-Creation: The Epistemology of Human-AI Emergent Cognition
【速读】:该论文试图解决的问题是:当前对人类推理与生成式人工智能(Generative AI)之间互动关系的理解,无法被传统的工具使用、增强或协作伙伴关系等隐喻充分描述。为应对这一问题,作者提出了一种新的认知-认识论结构——“第三实体”(Third Entity),其本质是一种由两种本体论上不可通约的认知模式通过转导耦合(transductive coupling)而产生的临时性涌现结构。解决方案的关键在于引入“氛围创造”(vibe-creation)概念,用以指称第三实体在高维语义空间中导航的前反思认知模式,并论证这种模式实质上是对默会知识(tacit knowledge)的自动化,从而对认识论、心智哲学及教育理论产生深远影响;同时,通过“不对称涌现”(asymmetric emergence)概念阐明第三实体的能动性——既具有真正新颖且不可还原的特性,又根植于人类意图的责任基础之上。
链接: https://arxiv.org/abs/2603.09486
作者: Ilya Levin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 1 fugure
Abstract:The encounter between human reasoning and generative artificial intelligence (GenAI) cannot be adequately described by inherited metaphors of tool use, augmentation, or collaborative partnership. This article argues that such interactions produce a qualitatively distinct cognitive-epistemic formation, designated here as the Third Entity: an emergent, transient structure that arises from the transductive coupling of two ontologically incommensurable modes of cognition. Drawing on Peirce semiotics, Polanyi theory of tacit knowledge, Simondon philosophy of individuation, Ihde postphenomenology, and Morin complexity theory, we develop a multi-layered theoretical account of this formation. We introduce the concept of vibe-creation to designate the pre-reflective cognitive mode through which the Third Entity navigates high-dimensional semantic space and argue that this mode constitutes the automation of tacit knowledge - a development with far-reaching consequences for epistemology, the philosophy of mind, and educational theory. We further propose the notion of asymmetric emergence to characterize the agency of the Third Entity: genuinely novel and irreducible, yet anchored in human intentional responsibility. The article concludes by examining the implications of this theoretical framework for the transformation of educational institutions and the redefinition of intellectual competence in the age of GenAI.
[AI-29] GenePlan: Evolving Better Generalized PDDL Plans using Large Language Models ICAPS2026
【速读】:该论文旨在解决经典规划任务中通用规划器(generalized planner)的自动化生成问题,即如何从领域描述语言(PDDL)中自动构建能够泛化到未见问题实例的高效规划策略。其解决方案的关键在于提出了一种名为GenePlan的新框架,该框架将通用规划建模为优化问题,结合大语言模型(LLM)辅助的进化算法,迭代演化出可解释的Python规划器,以最小化在多样本问题实例上的计划长度。通过此机制,GenePlan不仅在多个基准域上实现了接近当前最优规划器的性能(平均SAT得分为0.91),还显著优于基于链式思维(CoT)提示的LLM基线方法(平均SAT得分0.64),同时具备快速求解新实例(平均0.49秒/任务)和低计算成本(平均1.82美元/域)的优势。
链接: https://arxiv.org/abs/2603.09481
作者: Andrew Murray,Danial Dervovic,Alberto Pozanco,Michael Cashmore
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 54 pages, 4 figures. Accepted to ICAPS 2026
Abstract:We present GenePlan (GENeralized Evolutionary Planner), a novel framework that leverages large language model (LLM) assisted evolutionary algorithms to generate domain-dependent generalized planners for classical planning tasks described in PDDL. By casting generalized planning as an optimization problem, GenePlan iteratively evolves interpretable Python planners that minimize plan length across diverse problem instances. In empirical evaluation across six existing benchmark domains and two new domains, GenePlan achieved an average SAT score of 0.91, closely matching the performance of the state-of-the-art planners (SAT score 0.93), and significantly outperforming other LLM-based baselines such as chain-of-thought (CoT) prompting (average SAT score 0.64). The generated planners solve new instances rapidly (average 0.49 seconds per task) and at low cost (average 1.82 per domain using GPT-4o).
[AI-30] ogenesis: Goal Is All U Need
【速读】:该论文试图解决的问题是:在目标导向系统中,注意力分配机制是否能够从智能体的内部认知状态中内生地产生,而非依赖外部提供的目标。解决方案的关键在于提出了一种优先级函数(priority function),该函数基于三种认知差距(epistemic gaps)生成观察目标:无知性(后验方差,ignorance)、意外性(预测误差,surprise)和陈旧性(未观测变量置信度的时间衰减,staleness)。实验验证表明,仅凭这三种内生的认知差距即可生成适应性强的注意力优先级,在两个不同环境中均优于固定策略,并能自发恢复环境中的隐含波动结构,无需外部奖励信号。
链接: https://arxiv.org/abs/2603.09476
作者: Zhuoran Deng,Yizhi Zhang,Ziyi Zhang,Wan Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, submitted to ALIFE 2026
Abstract:Goal-conditioned systems assume goals are provided externally. We ask whether attentional priorities can emerge endogenously from an agent’s internal cognitive state. We propose a priority function that generates observation targets from three epistemic gaps: ignorance (posterior variance), surprise (prediction error), and staleness (temporal decay of confidence in unobserved variables). We validate this in two systems: a minimal attention-allocation environment (2,000 runs) and a modular, partially observable world (500 runs). Ablation shows each component is necessary. A key finding is metric-dependent reversal: under global prediction error, coverage-based rotation wins; under change detection latency, priority-guided allocation wins, with advantage growing monotonically with dimensionality (d = -0.95 at N=48, p 10^-6). Detection latency follows a power law in attention budget, with a steeper exponent for priority-guided allocation (0.55 vs. 0.40). When the decay rate is made learnable per variable, the system spontaneously recovers environmental volatility structure without supervision (t = 22.5, p 10^-6). We demonstrate that epistemic gaps alone, without external reward, suffice to generate adaptive priorities that outperform fixed strategies and recover latent environmental structure.
[AI-31] An Empirical Study and Theoretical Explanation on Task-Level Model-Merging Collapse
【速读】:该论文旨在解决模型合并(model merging)过程中出现的“合并坍塌”(merging collapse)问题,即在将针对不同任务微调的大型语言模型(LLM)进行合并时,某些任务组合会导致性能显著下降,且这种现象在所有合并方法中均一致发生。解决方案的关键在于识别并量化任务层面的表征不兼容性(representational incompatibility),而非传统认为的参数空间冲突(parameter-space conflict)。研究通过大量实验和统计分析发现,表征不兼容性与合并坍塌高度相关,而参数空间冲突指标则无明显关联;进一步基于率失真理论(rate-distortion theory)提出维度依赖的理论边界,揭示了无论采用何种合并方法,任务间可合并性的根本限制。
链接: https://arxiv.org/abs/2603.09463
作者: Yuan Cao,Dezhi Ran,Yuzhe Guo,Mengzhou Wu,Simin Chen,Linyi Li,Wei Yang,Tao Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Model merging unifies independently fine-tuned LLMs from the same base, enabling reuse and integration of parallel development efforts without retraining. However, in practice we observe that merging does not always succeed: certain combinations of task-specialist models suffer from catastrophic performance degradation after merging. We refer to this failure mode as merging collapse. Intuitively, collapse arises when the learned representations or parameter adjustments for different tasks are fundamentally incompatible, so that merging forces destructive interference rather than synergy. In this paper, we identify and characterize the phenomenon of task-level merging collapse, where certain task combinations consistently trigger huge performance degradation across all merging methods. Through extensive experiments and statistical analysis, we demonstrate that representational incompatibility between tasks is strongly correlated with merging collapse, while parameter-space conflict metrics show minimal correlation, challenging conventional wisdom in model merging literature. We provide a theoretical explanation on this phenomenon through rate-distortion theory with a dimension-dependent bound, establishing fundamental limits on task mergeability regardless of methodology.
[AI-32] Declarative Scenario-based Testing with RoadLogic
【速读】:该论文旨在解决自动驾驶系统(AV)中场景驱动测试的效率与覆盖问题,即如何从声明式场景规范(如OpenSCENARIO DSL, OS2)自动生成符合约束的可执行仿真场景。现有方法依赖于手动枚举场景变体,成本高且难以保证覆盖率;而OS2等声明式语言虽提升了抽象层次,却缺乏系统化实例化手段。其解决方案的关键在于提出RoadLogic框架:首先利用Answer Set Programming(ASP)生成满足场景逻辑约束的抽象行为计划,再通过运动规划将计划细化为可行轨迹,并结合基于规范的监控机制验证结果正确性。该方法可在CommonRoad仿真环境中快速生成多样化、真实且合规的场景实例,从而实现高效、系统的自动驾驶测试。
链接: https://arxiv.org/abs/2603.09455
作者: Ezio Bartocci,Alessio Gambi,Felix Gigler,Cristinel Mateis,Dejan Ničković
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Accepted at the 29th ACM International Conference on Hybrid Systems: Computation and Control (HSCC 2026). The final version will appear in the ACM Digital Library
Abstract:Scenario-based testing is a key method for cost-effective and safe validation of autonomous vehicles (AVs). Existing approaches rely on imperative scenario definitions, requiring developers to manually enumerate numerous variants to achieve coverage. Declarative languages, such as OpenSCENARIO DSL (OS2), raise the abstraction level but lack systematic methods for instantiating concrete, specification-compliant scenarios as simulations. To our knowledge, currently, no open-source solution provides this capability. We present RoadLogic that bridges declarative OS2 specifications and executable simulations. It uses Answer Set Programming to generate abstract plans satisfying scenario constraints, motion planning to refine the plans into feasible trajectories, and specification-based monitoring to verify correctness. We evaluate RoadLogic on instantiating representative OS2 scenarios as simulations in the CommonRoad framework. Results show that RoadLogic consistently produces realistic, specification-satisfying simulations within minutes and captures diverse behavioral variants through parameter sampling, thus opening the door to systematic scenario-based testing for autonomous driving systems. Comments: Accepted at the 29th ACM International Conference on Hybrid Systems: Computation and Control (HSCC 2026). The final version will appear in the ACM Digital Library Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2603.09455 [cs.SE] (or arXiv:2603.09455v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.09455 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-33] Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers ICML2026
【速读】:该论文旨在解决大规模基础模型在部署过程中缺乏可靠不确定性量化(uncertainty quantification)的问题,尤其是在需要对输出置信度进行准确评估的场景中。传统贝叶斯方法虽能提供理论完备的不确定性估计,但其计算开销过大,难以应用于万亿参数级别的模型训练或推理。解决方案的关键在于提出一种结构化的贝叶斯方法——变分专家混合路由(Variational Mixture-of-Experts Routing, VMoER),将贝叶斯推断限制在Mixture-of-Experts(MoE)层中的专家选择阶段,而非整个模型参数空间。VMoER通过两种推断策略实现高效不确定性建模:一是对路由logits进行摊销变分推断,二是引入温度参数以实现随机专家选择。实验表明,VMoER在保持不到1%额外浮点运算次数(FLOPs)的前提下,显著提升了路由稳定性(噪声下提升38%)、校准误差降低94%,并提高分布外检测的AUROC指标12%,从而为构建可扩展、鲁棒且具备不确定性感知能力的基础模型提供了可行路径。
链接: https://arxiv.org/abs/2603.09453
作者: Albus Yizhuo Li,Matthew Wicker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 8 pages, 7 figures for main text; 16 pages for Appendix; In submission to ICML 2026;
Abstract:Foundation models are increasingly being deployed in contexts where understanding the uncertainty of their outputs is critical to ensuring responsible deployment. While Bayesian methods offer a principled approach to uncertainty quantification, their computational overhead renders their use impractical for training or inference at foundation model scale. State-of-the-art models achieve parameter counts in the trillions through carefully engineered sparsity including Mixture-of-Experts (MoE) layers. In this work, we demonstrate calibrated uncertainty at scale by introducing Variational Mixture-of-Experts Routing (VMoER), a structured Bayesian approach for modelling uncertainty in MoE layers. VMoER confines Bayesian inference to the expert-selection stage which is typically done by a deterministic routing network. We instantiate VMoER using two inference strategies: amortised variational inference over routing logits and inferring a temperature parameter for stochastic expert selection. Across tested foundation models, VMoER improves routing stability under noise by 38%, reduces calibration error by 94%, and increases out-of-distribution AUROC by 12%, while incurring less than 1% additional FLOPs. These results suggest VMoER offers a scalable path toward robust and uncertainty-aware foundation models.
[AI-34] AI Act Evaluation Benchmark: An Open Transparent and Reproducible Evaluation Dataset for NLP and RAG Systems
【速读】:该论文旨在解决当前在公共和社会领域部署人工智能(AI)系统时,缺乏高效、自动化手段来评估其对欧盟《人工智能法案》(EU AI Act)合规性的问题。现有方法受限于资源匮乏,导致合规性评估仍依赖人工操作,存在效率低、易出错且难以覆盖法规中模糊或未明确定义的风险等级(如“有限”和“最低”风险)等缺陷。解决方案的关键在于提出一种开放、透明且可复现的方法,构建了一个面向自然语言处理(NLP)模型的结构化数据集,特别聚焦于检索增强生成(RAG)系统。该数据集包含风险等级分类、条款检索、义务生成和问答四项任务,并以机器可读格式存储;其核心创新在于结合领域知识与大语言模型(LLM)的推理能力,生成高文档相关性的合规场景及其对应任务,从而有效克服法规中非显式定义的风险边界难题。实证表明,基于该数据集评估的RAG方案在禁止类和高风险场景下分别达到0.87和0.85的F1分数,验证了方法的有效性。
链接: https://arxiv.org/abs/2603.09435
作者: Athanasios Davvetas,Michael Papademas,Xenia Ziouvelou,Vangelis Karkaletsis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure, 4 tables, 2 equations
Abstract:The rapid rollout of AI in heterogeneous public and societal sectors has subsequently escalated the need for compliance with regulatory standards and frameworks. The EU AI Act has emerged as a landmark in the regulatory landscape. The development of solutions that elicit the level of AI systems’ compliance with such standards is often limited by the lack of resources, hindering the semi-automated or automated evaluation of their performance. This generates the need for manual work, which is often error-prone, resource-limited or limited to cases not clearly described by the regulation. This paper presents an open, transparent, and reproducible method of creating a resource that facilitates the evaluation of NLP models with a strong focus on RAG systems. We have developed a dataset that contain the tasks of risk-level classification, article retrieval, obligation generation, and question-answering for the EU AI Act. The dataset files are in a machine-to-machine appropriate format. To generate the files, we utilise domain knowledge as an exegetical basis, combining with the processing and reasoning power of large language models to generate scenarios along with the respective tasks. Our methodology demonstrates a way to harness language models for grounded generation with high document relevancy. Besides, we overcome limitations such as navigating the decision boundaries of risk-levels that are not explicitly defined within the EU AI Act, such as limited and minimal cases. Finally, we demonstrate our dataset’s effectiveness by evaluating a RAG-based solution that reaches 0.87 and 0.85 F1-score for prohibited and high-risk scenarios.
[AI-35] From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation
【速读】:该论文旨在解决生成式策略(Generative Policies)在机器人操作中因依赖迭代常微分方程(ODE)求解导致的高延迟问题,以及现有单步加速方法易出现分布坍缩(distributional collapse)从而丧失多模态动作分布表达能力的问题。解决方案的关键在于通过隐式最大似然估计(Implicit Maximum Likelihood Estimation, IMLE)将条件流匹配(Conditional Flow Matching, CFM)专家模型蒸馏为一个快速单步学生模型,并引入双向Chamfer距离作为集合级目标函数,以同时保障模式覆盖(mode coverage)和保真度(fidelity),从而在单次前向传播中保留教师模型的多模态动作分布特性;此外,统一的感知编码器融合多视角RGB、深度图、点云与本体感知信息,构建几何感知表示,支持高频闭环控制与动态扰动下的鲁棒重规划。
链接: https://arxiv.org/abs/2603.09415
作者: Ju Dong,Liding Zhang,Lei Zhang,Yu Fu,Kaixin Bai,Zoltan-Csaba Marton,Zhenshan Bing,Zhaopeng Chen,Alois Christian Knoll,Jianwei Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: this https URL , 8 pages
Abstract:Generative policies based on diffusion and flow matching achieve strong performance in robotic manipulation by modeling multi-modal human demonstrations. However, their reliance on iterative Ordinary Differential Equation (ODE) integration introduces substantial latency, limiting high-frequency closed-loop control. Recent single-step acceleration methods alleviate this overhead but often exhibit distributional collapse, producing averaged trajectories that fail to execute coherent manipulation strategies. We propose a framework that distills a Conditional Flow Matching (CFM) expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE). A bi-directional Chamfer distance provides a set-level objective that promotes both mode coverage and fidelity, enabling preservation of the teacher multi-modal action distribution in a single forward pass. A unified perception encoder further integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation. The resulting high-frequency control supports real-time receding-horizon re-planning and improved robustness under dynamic disturbances.
[AI-36] Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis
【速读】:该论文旨在解决传统神经合成方法在建模发动机声音时仅关注频谱特性而忽略其物理生成机制的问题。现有方法难以准确还原由点火脉冲序列驱动的复杂声学动态,导致合成音频在谐波结构和时序细节上存在偏差。解决方案的关键在于提出一种可微分的声学合成架构——脉冲列车共振器(Pulse-Train-Resonator, PTR),该模型直接建模发动机点火引起的脉冲形状及其时间结构,并通过递归Karplus-Strong共振器模拟排气系统的声学传播过程。PTR引入了多种物理启发的归纳偏置(inductive biases),包括谐波衰减、热力学调频、气门动力学包络、排气系统共振以及节气门操作与减速断油(DCFO)等工况特征,从而实现高保真且可解释的发动机声音合成。
链接: https://arxiv.org/abs/2603.09391
作者: Robin Doerfler,Lonce Wyse
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Preprint. 5 pages, 2 figures. Audio examples, code, and model weights available online
Abstract:Engine sounds originate from sequential exhaust pressure pulses rather than sustained harmonic oscillations. While neural synthesis methods typically aim to approximate the resulting spectral characteristics, we propose directly modeling the underlying pulse shapes and temporal structure. We present the Pulse-Train-Resonator (PTR) model, a differentiable synthesis architecture that generates engine audio as parameterized pulse trains aligned to engine firing patterns and propagates them through recursive Karplus-Strong resonators simulating exhaust acoustics. The architecture integrates physics-informed inductive biases including harmonic decay, thermodynamic pitch modulation, valve-dynamics envelopes, exhaust system resonances and derived engine operating modes such as throttle operation and deceleration fuel cutoff (DCFO). Validated on three diverse engine types totaling 7.5 hours of audio, PTR achieves a 21% improvement in harmonic reconstruction and a 5.7% reduction in total loss over a harmonic-plus-noise baseline model, while providing interpretable parameters corresponding to physical phenomena. Complete code, model weights, and audio examples are openly available. Comments: Preprint. 5 pages, 2 figures. Audio examples, code, and model weights available online Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) Cite as: arXiv:2603.09391 [cs.SD] (or arXiv:2603.09391v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2603.09391 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-37] SPAARS: Safer RL Policy Alignment through Abstract Exploration and Refined Exploitation of Action Space
【速读】:该论文旨在解决离线到在线强化学习(Offline-to-Online Reinforcement Learning, RL)中如何安全地进行在线探索的问题,即在不偏离离线数据行为支持范围的前提下实现高效、稳定的策略优化。其核心挑战在于现有方法依赖条件变分自编码器(Conditional Variational Autoencoder, CVAE)将探索限制在低维潜在空间时,会因解码器重建损失导致性能天花板(exploitation gap)。解决方案的关键在于提出SPAARS(Safe Policy Adaptation via Latent-space Curriculum),该框架采用课程学习(curriculum learning)机制:首先在潜在空间内约束探索以实现样本高效的、安全的行为改进,随后无缝切换至原始动作空间,从而绕过解码器瓶颈。该方法通过在潜在空间中执行策略梯度并结合行为克隆(Behavioral Cloning)控制课程过渡稳定性,理论证明了潜在空间梯度可降低方差,并实验证明其显著优于基线方法,尤其在无需轨迹分割的CVAE版本中表现突出。
链接: https://arxiv.org/abs/2603.09378
作者: Swaminathan S K,Aritra Hazra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 9 pages
Abstract:Offline-to-online reinforcement learning (RL) offers a promising paradigm for robotics by pre-training policies on safe, offline demonstrations and fine-tuning them via online interaction. However, a fundamental challenge remains: how to safely explore online without deviating from the behavioral support of the offline data? While recent methods leverage conditional variational autoencoders (CVAEs) to bound exploration within a latent space, they inherently suffer from an exploitation gap – a performance ceiling imposed by the decoder’s reconstruction loss. We introduce SPAARS, a curriculum learning framework that initially constrains exploration to the low-dimensional latent manifold for sample-efficient, safe behavioral improvement, then seamlessly transfers control to the raw action space, bypassing the decoder bottleneck. SPAARS has two instantiations: the CVAE-based variant requires only unordered (s,a) pairs and no trajectory segmentation; SPAARS-SUPE pairs SPAARS with OPAL temporal skill pretraining for stronger exploration structure at the cost of requiring trajectory chunks. We prove an upper bound on the exploitation gap using the Performance Difference Lemma, establish that latent-space policy gradients achieve provable variance reduction over raw-space exploration, and show that concurrent behavioral cloning during the latent phase directly controls curriculum transition stability. Empirically, SPAARS-SUPE achieves 0.825 normalized return on kitchen-mixed-v0 versus 0.75 for SUPE, with 5x better sample efficiency; standalone SPAARS achieves 92.7 and 102.9 normalized return on hopper-medium-v2 and walker2d-medium-v2 respectively, surpassing IQL baselines of 66.3 and 78.3 respectively, confirming the utility of the unordered-pair CVAE instantiation.
[AI-38] Democratising Clinical AI through Dataset Condensation for Classical Clinical Models
【速读】:该论文旨在解决当前数据压缩(Dataset Condensation, DC)方法在医疗场景中应用受限的问题,即现有DC方法依赖可微分神经网络,难以适配临床实践中广泛使用的非可微模型(如决策树和Cox回归),从而限制了其在医疗数据民主化中的潜力。解决方案的关键在于提出一种基于差分隐私的零阶优化框架(zero-order optimisation framework),该框架仅通过函数评估即可实现对非可微模型的优化,从而将DC扩展至模型无关的场景,同时保障合成数据的隐私安全性,实现在不暴露敏感患者信息的前提下支持多种临床预测任务的数据共享。
链接: https://arxiv.org/abs/2603.09356
作者: Anshul Thakur,Soheila Molaei,Pafue Christy Nganjimi,Joshua Fieggen,Andrew A. S. Soltan,Danielle Belgrave,Lei Clifton,David A. Clifton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 22 pages, 5 figures, 5 tables
Abstract:Dataset condensation (DC) learns a compact synthetic dataset that enables models to match the performance of full-data training, prioritising utility over distributional fidelity. While typically explored for computational efficiency, DC also holds promise for healthcare data democratisation, especially when paired with differential privacy, allowing synthetic data to serve as a safe alternative to real records. However, existing DC methods rely on differentiable neural networks, limiting their compatibility with widely used clinical models such as decision trees and Cox regression. We address this gap using a differentially private, zero-order optimisation framework that extends DC to non-differentiable models using only function evaluations. Empirical results across six datasets, including both classification and survival tasks, show that the proposed method produces condensed datasets that preserve model utility while providing effective differential privacy guarantees - enabling model-agnostic data sharing for clinical prediction tasks without exposing sensitive patient information.
[AI-39] A-GGAD: Testing-time Adaptive Graph Model for Generalist Graph Anomaly Detection
【速读】:该论文旨在解决跨域图异常检测中因领域偏移(domain shift)导致的模型泛化能力不足的问题,尤其针对多数据域中异常节点特征差异显著时出现的“异常非同配性”(Anomaly Disassortativity, AD)现象。其解决方案的关键在于提出一种新型图基础模型(graph foundation model),通过建模AD机制,在单一训练阶段即可实现不同图结构上的跨域泛化,从而在14个真实世界图数据集上实现了突破性的跨域异常检测性能,达到当前最优水平(SOTA)。
链接: https://arxiv.org/abs/2603.09349
作者: Xiong Zhang,Hong Peng,Changlong Fu,Xin Jin,Yun Yang,Cheng Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A significant number of anomalous nodes in the real world, such as fake news, noncompliant users, malicious transactions, and malicious posts, severely compromises the health of the graph data ecosystem and urgently requires effective identification and processing. With anomalies that span multiple data domains yet exhibit vast differences in features, cross-domain detection models face severe domain shift issues, which limit their generalizability across all domains. This study identifies and quantitatively analyzes a specific feature mismatch pattern exhibited by domain shift in graph anomaly detection, which we define as the \emphAnomaly Disassortativity issue ( \mathcalAD ). Based on the modeling of the issue \mathcalAD , we introduce a novel graph foundation model for anomaly detection. It achieves cross-domain generalization in different graphs, requiring only a single training phase to perform effectively across diverse domains. The experimental findings, based on fourteen diverse real-world graphs, confirm a breakthrough in the model’s cross-domain adaptation, achieving a pioneering state-of-the-art (SOTA) level in terms of detection accuracy. In summary, the proposed theory of \mathcalAD provides a novel theoretical perspective and a practical route for future research in generalist graph anomaly detection (GGAD). The code is available at this https URL.
[AI-40] Robust Regularized Policy Iteration under Transition Uncertainty
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)在分布偏移(distribution shift)下性能下降的问题,特别是由策略诱导的外推(policy-induced extrapolation)和转移不确定性(transition uncertainty)导致的价值估计不可靠问题。解决方案的关键在于将离线RL建模为鲁棒策略优化(robust policy optimization),将转移核(transition kernel)视为不确定集内的决策变量,并在最坏情况动态下优化策略;为此提出鲁棒正则化策略迭代(Robust Regularized Policy Iteration, RRPI),通过KL正则化代理目标替代难以求解的极大极小双层优化问题,构建基于鲁棒正则化贝尔曼算子(robust regularized Bellman operator)的高效策略迭代过程,理论证明其算子为γ-压缩映射且迭代过程单调提升原鲁棒目标并收敛,实验证明其在D4RL基准上优于主流基线方法,且在高认知不确定性区域自动降低Q值,从而避免不可靠的分布外动作。
链接: https://arxiv.org/abs/2603.09344
作者: Hongqiang Lin,Zhenghui Fu,Weihao Tang,Pengfei Wang,Yiding Sun,Qixian Huang,Dongxu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a \gamma -contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods such as PMDB on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust behavior. The learned Q -values decrease in regions with higher epistemic uncertainty, suggesting that the resulting policy avoids unreliable out-of-distribution actions under transition uncertainty.
[AI-41] mberAg ent: Gram-Guided Retrieval for Executable Music Effect Control
【速读】:该论文试图解决数字音频工作站(Digital Audio Workstations, DAWs)中用户意图与低级信号处理参数之间的语义鸿沟问题,即如何将用户的感知性描述(如“温暖的吉他音色”)准确映射到可编辑的音频效果插件配置。其解决方案的关键在于提出Texture Resonance Retrieval (TRR),一种基于Wav2Vec2中间层特征投影后Gram矩阵构建的音频表示方法,该设计能有效保留与音色纹理相关的共激活结构;实验表明,TRR在吉他效果预设检索任务中相较于CLAP及多个内部基线方法(如Wav2Vec-RAG、Text-RAG、FeatureNN-RAG)实现了最低的归一化参数误差,并通过多刺激听觉测试提供了感知层面的补充证据,验证了纹理感知检索在可编辑音频效果控制中的有效性。
链接: https://arxiv.org/abs/2603.09332
作者: Shihao He,Yihan Xia,Fang Liu,Taotao Wang,Shengli Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Digital audio workstations expose rich effect chains, yet a semantic gap remains between perceptual user intent and low-level signal-processing parameters. We study retrieval-grounded audio effect control, where the output is an editable plugin configuration rather than a finalized waveform. Our focus is Texture Resonance Retrieval (TRR), an audio representation built from Gram matrices of projected mid-level Wav2Vec2 activations. This design preserves texture-relevant co-activation structure. We evaluate TRR on a guitar-effects benchmark with 1,063 candidate presets and 204 queries. The evaluation follows Protocol-A, a cross-validation scheme that prevents train-test leakage. We compare TRR against CLAP and internal retrieval baselines (Wav2Vec-RAG, Text-RAG, FeatureNN-RAG), using min-max normalized metrics grounded in physical DSP parameter ranges. Ablation studies validate TRR’s core design choices: projection dimensionality, layer selection, and projection type. A near-duplicate sensitivity analysis confirms that results are robust to trivial knowledge-base matches. TRR achieves the lowest normalized parameter error among evaluated methods. A multiple-stimulus listening study with 26 participants provides complementary perceptual evidence. We interpret these results as benchmark evidence that texture-aware retrieval is useful for editable audio effect control, while broader personalization and real-audio robustness claims remain outside the verified evidence presented here.
[AI-42] Curveball Steering: The Right Direction To Steer Isnt Always Linear
【速读】:该论文试图解决现有大语言模型(Large Language Model, LLM)行为控制方法中依赖全局线性干预导致效果不一致的问题。其核心假设——即行为属性可通过全局线性方向进行操控的“线性表示假设”(Linear Representation Hypothesis)——在实践中常失效,原因在于LLM激活空间存在显著且与概念相关的几何畸变。为应对这一挑战,作者提出“Curveball steering”,其关键在于利用多项式核主成分分析(Polynomial Kernel PCA)将干预操作映射到更符合模型内部激活几何结构的特征空间中,从而实现几何感知的非线性干预,相较于传统线性PCA方法在高几何畸变场景下表现更稳定、有效。
链接: https://arxiv.org/abs/2603.09313
作者: Shivam Raval,Hae Jin Song,Linlin Wu,Abir Harrasse,Jeff Phillips,Amirali Abdullah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Activation steering is a widely used approach for controlling large language model (LLM) behavior by intervening on internal representations. Existing methods largely rely on the Linear Representation Hypothesis, assuming behavioral attributes can be manipulated using global linear directions. In practice, however, such linear interventions often behave inconsistently. We question this assumption by analyzing the intrinsic geometry of LLM activation spaces. Measuring geometric distortion via the ratio of geodesic to Euclidean distances, we observe substantial and concept-dependent distortions, indicating that activation spaces are not well-approximated by a globally linear geometry. Motivated by this, we propose “Curveball steering”, a nonlinear steering method based on polynomial kernel PCA that performs interventions in a feature space, better respecting the learned activation geometry. Curveball steering consistently outperforms linear PCA-based steering, particularly in regimes exhibiting strong geometric distortion, suggesting that geometry-aware, nonlinear steering provides a principled alternative to global, linear interventions.
[AI-43] Rescaling Confidence: What Scale Design Reveals About LLM Metacognition
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中“口语化置信度”(verbalized confidence)设计不合理的问题,即当前广泛采用的0–100数值评分体系在实际使用中存在高度离散化现象,导致其反映的不确定性信息质量受限。研究发现,超过78%的置信度评分集中在三个整数点上,说明标准量表设计本身并非中立,而是显著影响模型元认知敏感性(metacognitive sensitivity)。解决方案的关键在于系统性地调整置信度量表的设计参数,包括粒度(granularity)、边界位置(boundary placement)和范围规则性(range regularity),其中最有效的改进是将量表从0–100改为0–20,该调整在多个LLM和数据集上均显著提升了元认知效率(以meta-d’衡量),而边界压缩则会降低性能,且即使在不规则范围内,模型仍偏好整数响应,表明人类认知偏差对模型输出具有持续影响。
链接: https://arxiv.org/abs/2603.09309
作者: Yuyang Dai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures
Abstract:Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0–100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using meta-d’. We find that a 0–20 scale consistently improves metacognitive efficiency over the standard 0–100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized uncertainty and should be treated as a first-class experimental variable in LLM evaluation.
[AI-44] DendroNN: Dendrocentric Neural Networks for Energy-Efficient Classification of Event-Based Data
【速读】:该论文旨在解决事件驱动型神经网络在处理时空信息时难以高精度解码时间特征的问题,传统方法依赖循环结构或延迟机制以增强时间计算能力,但会降低硬件效率。其解决方案的关键在于提出一种基于树突结构的新型神经网络——DendroNN( dendrocentric neural network),该网络通过模拟树突分支中的序列检测机制,能够识别独特的输入脉冲序列作为时空特征;同时引入无梯度的重连(rewiring)训练阶段,在不依赖反向传播的情况下自动记忆高频出现的序列并剔除无判别力的序列,从而实现高效且准确的时空模式识别。此外,论文还设计了一种基于时间轮机制的异步数字硬件架构,充分利用DendroNN的动态与静态稀疏性及内在量化特性,相较现有类脑硬件在音频分类任务中达到4倍以上的能效提升,展现出对事件驱动型时空计算的高度适配性。
链接: https://arxiv.org/abs/2603.09274
作者: Jann Krausse,Zhe Su,Kyrus Mama,Maryada,Klaus Knobloch,Giacomo Indiveri,Jürgen Becker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
备注: Currently under review
Abstract:Spatiotemporal information is at the core of diverse sensory processing and computational tasks. Feed-forward spiking neural networks can be used to solve these tasks while offering potential benefits in terms of energy efficiency by computing event-based. However, they have trouble decoding temporal information with high accuracy. Thus, they commonly resort to recurrence or delays to enhance their temporal computing ability which, however, bring downsides in terms of hardware-efficiency. In the brain, dendrites are computational powerhouses that just recently started to be acknowledged in such machine learning systems. In this work, we focus on a sequence detection mechanism present in branches of dendrites and translate it into a novel type of neural network by introducing a dendrocentric neural network, DendroNN. DendroNNs identify unique incoming spike sequences as spatiotemporal features. This work further introduces a rewiring phase to train the non-differentiable spike sequences without the use of gradients. During the rewiring, the network memorizes frequently occurring sequences and additionally discards those that do not contribute any discriminative information. The networks display competitive accuracies across various event-based time series datasets. We also propose an asynchronous digital hardware architecture using a time-wheel mechanism that builds on the event-driven design of DendroNNs, eliminating per-step global updates typical of delay- or recurrence-based models. By leveraging a DendroNN’s dynamic and static sparsity along with intrinsic quantization, it achieves up to 4x higher efficiency than state-of-the-art neuromorphic hardware at comparable accuracy on the same audio classification task, demonstrating its suitability for spatiotemporal event-based computing. This work offers a novel approach to low-power spatiotemporal processing on event-driven hardware.
[AI-45] Logos: An evolvable reasoning engine for rational molecular design
【速读】:该论文旨在解决当前人工智能在分子设计中面临的两大核心问题:一是现有模型往往只能在物理保真度(physical fidelity)与可解释性推理之间二选一,导致生成结果缺乏化学有效性或难以理解;二是缺乏能够同时保证结构准确性与化学合理性的高效模型。解决方案的关键在于提出一种名为Logos的紧凑型分子推理模型,其创新性地通过分阶段训练策略实现逻辑推理与化学一致性的一体化优化:首先引入显式推理示例以建立分子描述到结构决策的映射,随后将推理模式与分子表征对齐,最终在优化目标中直接嵌入化学规则和不变量,从而引导模型输出符合化学约束的结果。该方法在多个基准数据集上展现出卓越的结构准确性和化学有效性,且参数量远低于通用语言模型,同时支持人类对中间推理步骤的审查,显著提升了AI系统在分子科学中的可靠性与可解释性。
链接: https://arxiv.org/abs/2603.09268
作者: Haibin Wen,Zhe Zhao,Fanfu Wang,Tianyi Xu,Hao Zhang,Chao Yang,Ye Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The discovery and design of functional molecules remain central challenges across chemistry,biology, and materials science. While recent advances in machine learning have accelerated molecular property prediction and candidate generation, existing models tend to excel either in physical fidelity without transparent reasoning, or in flexible reasoning without guarantees of chemical validity. This imbalance limits the reliability of artificial intelligence systems in real scientific design this http URL we present Logos, a compact molecular reasoning model that integrates multi-step logical reasoning with strict chemical consistency. Logos is trained using a staged strategy that first exposes the model to explicit reasoning examples linking molecular descriptions to structural decisions, and then progressively aligns these reasoning patterns with molecular representations. In a final training phase, chemical rules and invariants are incorporated directly into the optimization objective, guiding the model toward chemically valid outputs. Across multiple benchmark datasets, Logos achieves strong performance in both structural accuracy and chemical validity, matching or surpassing substantially larger general-purpose language models while operating with a fraction of their parameters. Beyond benchmark evaluation, the model exhibits stable behaviour in molecular optimization tasks involving multiple, potentially conflicting constraints. By explicitly exposing intermediate reasoning steps, Logos enables human inspection and assessment of the design logic underlying each generated structure. These results indicate that jointly optimizing for reasoning structure and physical consistency offers a practical pathway toward reliable and interpretable AI systems for molecular science, supporting closer integration of artificial intelligence into scientific discovery processes.
[AI-46] Social-R1: Towards Human-like Social Reasoning in LLM s
【速读】:该论文旨在解决大语言模型在社会智能(social intelligence)方面存在的局限性问题,即模型往往依赖表面模式而非真正的社会推理能力,难以有效感知社交线索、推断心理状态并生成恰当响应,从而阻碍了人机协作和AI对人类需求的真实服务。其解决方案的关键在于引入ToMBench-Hard这一对抗性基准,提供具有挑战性的训练样本以阻断捷径学习,并提出Social-R1强化学习框架,通过多维奖励机制对推理过程进行轨迹级(trajectory-level)监督,确保结构一致性、逻辑完整性和信息密度,从而实现模型推理与人类认知的深层对齐。
链接: https://arxiv.org/abs/2603.09249
作者: Jincenzi Wu,Yuxuan Lei,Jianxun Lian,Yitian Huang,Lexin Zhou,Haotian Li,Xing Xie,Helen Meng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages. Code and dataset will be released upon acceptance
Abstract:While large language models demonstrate remarkable capabilities across numerous domains, social intelligence - the capacity to perceive social cues, infer mental states, and generate appropriate responses - remains a critical challenge, particularly for enabling effective human-AI collaboration and developing AI that truly serves human needs. Current models often rely on superficial patterns rather than genuine social reasoning. We argue that cultivating human-like social intelligence requires training with challenging cases that resist shortcut solutions. To this end, we introduce ToMBench-Hard, an adversarial benchmark designed to provide hard training examples for social reasoning. Building on this, we propose Social-R1, a reinforcement learning framework that aligns model reasoning with human cognition through multi-dimensional rewards. Unlike outcome-based RL, Social-R1 supervises the entire reasoning process, enforcing structural alignment, logical integrity, and information density. Results show that our approach enables a 4B parameter model to surpass much larger counterparts and generalize robustly across eight diverse benchmarks. These findings demonstrate that challenging training cases with trajectory-level alignment offer a path toward efficient and reliable social intelligence.
[AI-47] Cognitively Layered Data Synthesis for Domain Adaptation of LLM s to Space Situational Awareness
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂工程领域(如空间态势感知,Space Situational Awareness, SSA)中迁移困难的问题,其核心瓶颈在于高质量监督微调(Supervised Fine-Tuning, SFT)数据集的构建不足,具体表现为知识覆盖不全、认知层次浅显以及数据质量难以控制。解决方案的关键是提出BD-FDG(基于布卢姆分类学的领域特定微调数据生成框架),通过三个机制实现突破:一是利用知识树结构确保语料的系统性覆盖;二是设计涵盖九类问题与六层认知层级(从记忆到创造)的问答生成方案,形成连续难度梯度;三是引入多维评分流水线以保障数据的领域严谨性和一致性。该方法有效支撑了23万样本的SSA-SFT数据集构建,并成功微调出SSA-LLM-8B,在领域测试集上BLEU-1相对提升达144%–176%,且在Arena对抗评测中胜率高达82.21%,同时保持通用基准性能稳定,验证了基于认知分层的数据构造策略对复杂工程场景下LLM适配的有效性。
链接: https://arxiv.org/abs/2603.09231
作者: Ding Linghu,Cheng Wang,Da Fan,Wei Shi,Kaifeng Yin,Xiaoliang Xue,Fan Yang,Haiyi Ren,Cong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) demonstrate exceptional performance on general-purpose tasks. however, transferring them to complex engineering domains such as space situational awareness (SSA) remains challenging owing to insufficient structural alignment with mission chains, the absence of higher-order cognitive supervision, and poor correspondence between data quality criteria and engineering specifications. The core bottleneck is the construction of high-quality supervised fine-tuning (SFT) datasets. To this end, we propose BD-FDG (Bloom’s Taxonomy-based Domain-specific Fine-tuning Data Generation), a framework that addresses incomplete knowledge coverage, shallow cognitive depth, and limited quality controllability through three mechanisms: structured knowledge organization, cognitively layered question modeling, and automated quality control. The framework uses a knowledge tree to ensure structured corpus coverage, designs a question generation scheme spanning nine categories and six cognitive levels from Remember to Create to produce samples with a continuous difficulty gradient, and applies a multidimensional scoring pipeline to enforce domain rigor and consistency. Using BD-FDG, we construct SSA-SFT, a domain dataset of approximately 230K samples, and fine-tune Qwen3-8B to obtain SSA-LLM-8B. Experiments show that SSA-LLM-8B achieves relative BLEU-1 improvements of 144% (no-think) and 176% (think) on the domain test set and a win rate of 82.21% over the baseline in arena comparisons, while largely preserving general benchmark performance (MMLU-Pro, MATH-500). These results validate SFT data construction driven by cognitive layering as an effective paradigm for complex engineering domains and provide a transferable framework for domain-specific LLM adaptation.
[AI-48] Embodied Human Simulation for Quantitative Design and Analysis of Interactive Robotics
【速读】:该论文旨在解决物理人机交互(Physical Human-Robot Interaction, PHRI)中难以准确评估交互动力学的问题,尤其在复杂的人体生物力学和运动响应背景下,传统实验依赖间接指标而无法获取如肌肉力或关节负荷等内部状态信息。解决方案的关键在于构建一个基于全身体骨骼肌模型的可扩展仿真框架,该模型作为人类动力系统的预测代理,结合强化学习控制器生成生理学上合理的自适应运动行为;通过预训练的人类运动控制策略作为一致评价标准,实现大规模设计空间探索的计算可行性,并能同时优化机器人结构参数与控制策略,从而系统性地提升人机协同性能。
链接: https://arxiv.org/abs/2603.09218
作者: Chenhui Zuo,Jinhao Xu,Michael Qian Vergnolle,Yanan Sui
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Physical interactive robotics, ranging from wearable devices to collaborative humanoid robots, require close coordination between mechanical design and control. However, evaluating interactive dynamics is challenging due to complex human biomechanics and motor responses. Traditional experiments rely on indirect metrics without measuring human internal states, such as muscle forces or joint loads. To address this issue, we develop a scalable simulation-based framework for the quantitative analysis of physical human-robot interaction. At its core is a full-body musculoskeletal model serving as a predictive surrogate for the human dynamical system. Driven by a reinforcement learning controller, it generates adaptive, physiologically grounded motor behaviors. We employ a sequential training pipeline where the pre-trained human motion control policy acts as a consistent evaluator, making large-scale design space exploration computationally tractable. By simulating the coupled human-robot system, the framework provides access to internal biomechanical metrics, offering a systematic way to concurrently co-optimize a robot’s structural parameters and control policy. We demonstrate its capability in optimizing human-exoskeleton interactions, showing improved joint alignment and reduced contact forces. This work establishes embodied human simulation as a scalable paradigm for interactive robotics design.
[AI-49] PrivPRISM: Automatically Detecting Discrepancies Between Google Play Data Safety Declarations and Developer Privacy Policies
【速读】:该论文旨在解决移动应用中隐私政策(Privacy Policy)与简化数据安全声明(Data Safety Declaration)之间存在的不一致性问题,这种不一致常导致用户对实际数据处理行为产生误解,并违反监管要求。其解决方案的关键在于提出PrivPRISM框架,该框架结合编码器-解码器语言模型,系统性地从隐私政策中提取细粒度的数据实践信息,并将其与数据安全声明进行比对,从而实现大规模非合规检测。
链接: https://arxiv.org/abs/2603.09214
作者: Bhanuka Silva,Dishanika Denipitiyage,Anirban Mahanti,Aruna Seneviratne,Suranga Seneviratne
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 18 figures, 6 tables
Abstract:End-users seldom read verbose privacy policies, leading app stores like Google Play to mandate simplified data safety declarations as a user-friendly alternative. However, these self-declared disclosures often contradict the full privacy policies, deceiving users about actual data practices and violating regulatory requirements for consistency. To address this, we introduce PrivPRISM, a robust framework that combines encoder and decoder language models to systematically extract and compare fine-grained data practices from privacy policies and to compare against data safety declarations, enabling scalable detection of non-compliance. Evaluating 7,770 popular mobile games uncovers discrepancies in nearly 53% of cases, rising to 61% among 1,711 widely used generic apps. Additionally, static code analysis reveals possible under-disclosures, with privacy policies disclosing just 66.8% of potential accesses to sensitive data like location and financial information, versus only 36.4% in data safety declarations of mobile games. Our findings expose systemic issues, including widespread reuse of generic privacy policies, vague / contradictory statements, and hidden risks in high-profile apps with 100M+ downloads, underscoring the urgent need for automated enforcement to protect platform integrity and for end-users to be vigilant about sensitive data they disclose via popular apps.
[AI-50] Abundant Intelligence and Deficient Demand: A Macro-Financial Stress Test of Rapid AI Adoption
【速读】:该论文旨在解决快速采用生成式 AI(Generative AI)可能引发的宏观金融压力问题,其核心挑战在于“分配与合约错配”:AI带来的产出丰裕与有效需求不足并存,因为现有经济制度仍基于人类认知稀缺性而设计。解决方案的关键在于识别并量化三个机制:一是“替代螺旋”机制,即企业理性用AI替代人力导致劳动收入下降、总需求萎缩,进而加速AI扩散,论文据此推导出AI能力增长速率、扩散速度和再安置速率的临界条件,以判断反馈是自我限制还是爆发式危机;二是“幽灵GDP”现象,即AI产出替代劳动产出时,在无补偿转移支付下货币流通速度随劳动份额单调下降,造成测度产出与消费相关收入之间的裂痕;三是“中介崩溃”,AI降低信息摩擦压缩中间商利润空间至纯物流成本,引发SaaS、支付、咨询、保险及金融顾问等领域的重新定价。研究进一步指出,由于高收入群体(占美国消费47–65%)对AI暴露度最高,其传导效应在私人信贷(全球约2.5万亿美元)和抵押贷款市场(全球约13万亿美元)中尤为显著,并提出十一项可检验预测及其明确证伪条件,通过FRED时间序列与BLS职业层级数据校准模拟,量化了稳定调整向爆炸性危机转变的边界条件。
链接: https://arxiv.org/abs/2603.09209
作者: Xupeng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We formalize a macro-financial stress test for rapid AI adoption. Rather than a productivity bust or existential risk, we identify a distribution-and-contract mismatch: AI-generated abundance coexists with demand deficiency because economic institutions are anchored to human cognitive scarcity. Three mechanisms formalize this channel. First, a displacement spiral with competing reinstatement effects: each firm’s rational decision to substitute AI for labor reduces aggregate labor income, which reduces aggregate demand, accelerating further AI adoption. We derive conditions on the AI capability growth rate, diffusion speed, and reinstatement rate under which the net feedback is self-limiting versus explosive. Second, Ghost GDP: when AI-generated output substitutes for labor-generated output, monetary velocity declines monotonically in the labor share absent compensating transfers, creating a wedge between measured output and consumption-relevant income. Third, intermediation collapse: AI agents that reduce information frictions compress intermediary margins toward pure logistics costs, triggering repricing across SaaS, payments, consulting, insurance, and financial advisory. Because top-quintile earners drive 47–65% of U.S.\ consumption and face the highest AI exposure, the transmission into private credit (\ 2.5 trillion globally) and mortgage markets (\ 13 trillion) is disproportionate. We derive eleven testable predictions with explicit falsification conditions. Calibrated simulations disciplined by FRED time series and BLS occupation-level data quantify conditions under which stable adjustment transitions to explosive crisis. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.09209 [cs.AI] (or arXiv:2603.09209v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.09209 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-51] Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents
【速读】:该论文旨在解决检索增强型智能体在多步推理中可靠性不足的问题,具体表现为:噪声检索可能导致多跳问答任务失败,而仅基于最终结果的强化学习方法提供的信用信号过于粗粒度,难以优化中间步骤。其解决方案的关键在于提出 \textscEvalAct(Evaluate-as-Action),将隐式的检索质量评估转化为显式的动作,并强制执行“搜索-评估”协同协议,使得每次检索后立即进行结构化的评估打分,从而获得与交互轨迹对齐的过程信号;进一步引入基于GRPO的Process-Calibrated Advantage Rescaling(PCAR)方法,根据评估分数在段落级别重标优势值,强化可靠段落、保守更新不确定段落,显著提升多步推理性能。
链接: https://arxiv.org/abs/2603.09203
作者: Jiangming Shu,Yuxiang Zhang,Ye Ma,Xueyuan Lin,Jitao Sang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented agents can query external evidence, yet their reliability in multi-step reasoning remains limited: noisy retrieval may derail multi-hop question answering, while outcome-only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textscEvalAct (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open-domain QA benchmarks show that \textscEvalAct achieves the best average accuracy, with the largest gains on multi-hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.
[AI-52] Explainable Innovation Engine: Dual-Tree Agent -RAG with Methods-as-Nodes and Verifiable Write-Back
【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在事实准确性与多步推理控制方面的局限性,尤其是依赖扁平文本块检索导致的缺乏可解释性和合成过程不可控的问题。其解决方案的关键在于提出一种可解释创新引擎(Explainable Innovation Engine),将知识单元从传统的文本块升级为“方法节点”(method-as-nodes),并通过维护加权的方法溯源树(weighted method provenance tree)实现推导过程的可追溯性,同时利用分层聚类抽象树(hierarchical clustering abstraction tree)支持高效自顶向下的导航。在推理阶段,策略代理(strategy agent)显式选择合成操作(如归纳、演绎、类比)以组合新方法节点,并记录可审计的轨迹;验证评分层则通过剪枝低质量候选节点并回写验证结果,支持持续迭代增长。这一设计显著提升了RAG系统在复杂推理任务中的可控性、可解释性和可验证性。
链接: https://arxiv.org/abs/2603.09192
作者: Renwei Meng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15pages, 4figures, code available on Github
Abstract:Retrieval-augmented generation (RAG) improves factual grounding, yet most systems rely on flat chunk retrieval and provide limited control over multi-step synthesis. We propose an Explainable Innovation Engine that upgrades the knowledge unit from text chunks to methods-as-nodes. The engine maintains a weighted method provenance tree for traceable derivations and a hierarchical clustering abstraction tree for efficient top-down navigation. At inference time, a strategy agent selects explicit synthesis operators (e.g., induction, deduction, analogy), composes new method nodes, and records an auditable trajectory. A verifier-scorer layer then prunes low-quality candidates and writes validated nodes back to support continual growth. Expert evaluation across six domains and multiple backbones shows consistent gains over a vanilla baseline, with the largest improvements on derivation-heavy settings, and ablations confirm the complementary roles of provenance backtracking and pruning. These results suggest a practical path toward controllable, explainable, and verifiable innovation in agentic RAG systems. Code is available at the project GitHub repository this https URL.
[AI-53] Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning ICLR2026
【速读】:该论文旨在解决多智能体系统中因依赖自回归语言模型(Autoregressive Language Models, ARMs)而导致的全局推理受限与计划修改困难的问题,同时克服离散扩散语言模型(Discrete Diffusion Language Models, DDLMs)在文本流畅性上的不足,从而实现不同能力模型之间的高效协同。其解决方案的关键在于提出Latent-DARM框架——一个基于潜在空间的通信机制,将DDLM作为规划者(planner)与ARM作为执行者(executor)进行解耦协作,在不直接交换文本的前提下实现跨模型的信息传递,从而兼顾全局规划能力和生成流畅性,显著提升多智能体系统的推理准确率并大幅降低计算资源消耗。
链接: https://arxiv.org/abs/2603.09184
作者: Lina Berrayana,Ahmed Heakl,Abdullah Sohail,Thomas Hofmann,Salman Khan,Wei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at LIT Workshop at ICLR 2026
Abstract:Most multi-agent systems rely exclusively on autoregressive language models (ARMs) that are based on sequential generation. Although effective for fluent text, ARMs limit global reasoning and plan revision. On the other hand, Discrete Diffusion Language Models (DDLMs) enable non-sequential, globally revisable generation and have shown strong planning capabilities, but their limited text fluency hinders direct collaboration with ARMs. We introduce Latent-DARM, a latent-space communication framework bridging DDLM (planners) and ARM (executors), maximizing collaborative benefits. Across mathematical, scientific, and commonsense reasoning benchmarks, Latent-DARM outperforms text-based interfaces on average, improving accuracy from 27.0% to 36.0% on DART-5 and from 0.0% to 14.0% on AIME2024. Latent-DARM approaches the results of state-of-the-art reasoning models while using less than 2.2% of its token budget. This work advances multi-agent collaboration among agents with heterogeneous models.
[AI-54] Differentiable Stochastic Traffic Dynamics: Physics-Informed Generative Modelling in Transportation
【速读】:该论文旨在解决当前基于物理信息的深度学习方法在交通流建模中忽视宏观交通流随机性的局限性问题。现有方法通常嵌入确定性偏微分方程(PDE),输出点估计值,无法刻画交通状态的不确定性。其解决方案的关键在于构建一个以分布形式体现物理约束的新框架:从带布朗力驱动的伊藤型Lighthill-Whitham-Richards(LWR)模型出发,推导出每个空间位置上交通密度的边际分布的前向方程(即Fokker-Planck方程),其中由守恒律引入的空间耦合表现为显式的条件漂移项,从而清晰地揭示了闭合需求;进一步将该分布式物理约束转化为可点对点计算和可微的确定性概率流常微分方程(Probability Flow ODE),并设计了一个带有对流-闭合模块的评分网络(score network),通过去噪评分匹配与Fokker-Planck残差损失联合训练,最终实现数据条件下的密度分布估计,支持点估计、可信区间及拥堵风险度量等多元输出。
链接: https://arxiv.org/abs/2603.09174
作者: Wuping Xin
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages
Abstract:Macroscopic traffic flow is stochastic, but the physics-informed deep learning methods currently used in transportation literature embed deterministic PDEs and produce point-valued outputs; the stochasticity of the governing dynamics plays no role in the learned representation. This work develops a framework in which the physics constraint itself is distributional and directly derived from stochastic traffic-flow dynamics. Starting from an Ito-type Lighthill-Whitham-Richards model with Brownian forcing, we derive a one-point forward equation for the marginal traffic density at each spatial location. The spatial coupling induced by the conservation law appears as an explicit conditional drift term, which makes the closure requirement transparent. Based on this formulation, we derive an equivalent deterministic Probability Flow ODE that is pointwise evaluable and differentiable once a closure is specified. Incorporating this as a physics constraint, we then propose a score network with an advection-closure module, trainable by denoising score matching together with a Fokker-Planck residual loss. The resulting model targets a data-conditioned density distribution, from which point estimates, credible intervals, and congestion-risk measures can be computed. The framework provides a basis for distributional traffic-state estimation and for stochastic fundamental-diagram analysis in a physics-informed generative setting.
[AI-55] ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video
【速读】:该论文旨在解决人形机器人在场景交互中实现多样化且自然的全身控制问题,现有方法受限于刚性运动模式和昂贵的遥操作数据采集,难以执行如坐下或踢球等类人自然行为。解决方案的关键在于提出ZeroWBC框架,该框架直接从人类第一视角视频中学习视觉-动作策略,无需大规模机器人遥操作数据;其核心步骤包括:首先微调视觉-语言模型(VLM)以根据文本指令和第一视角视觉上下文预测未来全身人类动作,随后将生成的动作通过稳健的通用运动追踪策略重定向至真实机器人关节并执行,从而实现高效、自然且无需遥操作数据的全身控制。
链接: https://arxiv.org/abs/2603.09170
作者: Haoran Yang,Jiacheng Bao,Yucheng Xin,Haoming Song,Yuyang Tian,Bin Zhao,Dong Wang,Xuelong Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Achieving versatile and naturalistic whole-body control for humanoid robot scene-interaction remains a significant challenge. While some recent works have demonstrated autonomous humanoid interactive control, they are constrained to rigid locomotion patterns and expensive teleoperation data collection, lacking the versatility to execute more human-like natural behaviors such as sitting or kicking. Furthermore, acquiring the necessary real robot teleoperation data is prohibitively expensive and time-consuming. To address these limitations, we introduce ZeroWBC, a novel framework that learns a natural humanoid visuomotor control policy directly from human egocentric videos, eliminating the need for large-scale robot teleoperation data and enabling natural humanoid robot scene-interaction control. Specifically, our approach first fine-tunes a Vision-Language Model (VLM) to predict future whole-body human motions based on text instructions and egocentric visual context, then these generated motions are retargeted to real robot joints and executed via our robust general motion tracking policy for humanoid whole-body control. Extensive experiments on the Unitree G1 humanoid robot demonstrate that our method outperforms baseline approaches in motion naturalness and versatility, successfully establishing a pipeline that eliminates teleoperation data collection overhead for whole-body humanoid control, offering a scalable and efficient paradigm for general humanoid whole-body control.
[AI-56] GIAT: A Geologically-Informed Attention Transformer for Lithology Identification
【速读】:该论文旨在解决基于Transformer的模型在测井岩性识别任务中因“黑箱”特性及缺乏地质先验指导而导致性能受限和可信度不足的问题。其解决方案的关键在于提出了一种地质信息引导的注意力机制(Geologically-Informed Attention Transformer, GIAT),通过重构类别相关的序列相关性(Category-Wise Sequence Correlation, CSC)滤波器生成地质引导的关系矩阵,并将其注入自注意力计算中,从而显式引导模型学习地质上一致的模式,显著提升了预测准确性与解释一致性。
链接: https://arxiv.org/abs/2603.09165
作者: Jie Li,Qishun Yang,Nuo Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate lithology identification from well logs is crucial for subsurface resource evaluation. Although Transformer-based models excel at sequence modeling, their “black-box” nature and lack of geological guidance limit their performance and trustworthiness. To overcome these limitations, this letter proposes the Geologically-Informed Attention Transformer (GIAT), a novel framework that deeply fuses data-driven geological priors with the Transformer’s attention mechanism. The core of GIAT is a new attention-biasing mechanism. We repurpose Category-Wise Sequence Correlation (CSC) filters to generate a geologically-informed relational matrix, which is injected into the self-attention calculation to explicitly guide the model toward geologically coherent patterns. On two challenging datasets, GIAT achieves state-of-the-art performance with an accuracy of up to 95.4%, significantly outperforming existing models. More importantly, GIAT demonstrates exceptional interpretation faithfulness under input perturbations and generates geologically coherent predictions. Our work presents a new paradigm for building more accurate, reliable, and interpretable deep learning models for geoscience applications.
[AI-57] Wrong Code Right Structure: Learning Netlist Representations from Imperfect LLM -Generated RTL
【速读】:该论文旨在解决电路表示学习中因高质量标注数据稀缺而导致的模型泛化能力受限问题,尤其是真实电路设计受知识产权(Intellectual Property, IP)保护且标注成本高昂,使得现有方法仅能处理小规模、清洁标签的电路,难以扩展至实际应用场景。解决方案的关键在于发现并利用生成式 AI(Generative AI)所生成的 RTL(Register-Transfer-Level)代码虽存在功能不正确性,但其综合后的网表(netlist)仍保留了与预期功能强相关的结构模式这一关键观察。基于此,作者提出一种低成本的数据增强与训练框架,系统性地将有噪声的 LLM 生成 RTL 作为训练数据,构建从自动化代码生成到下游任务的端到端流程,显著提升了模型在真实网表上的泛化性能,有效突破了电路表示学习中的数据瓶颈。
链接: https://arxiv.org/abs/2603.09161
作者: Siyang Cai,Cangyuan Li,Yinhe Han,Ying Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:Learning effective netlist representations is fundamentally constrained by the scarcity of labeled datasets, as real designs are protected by Intellectual Property (IP) and costly to annotate. Existing work therefore focuses on small-scale circuits with clean labels, limiting scalability to realistic designs. Meanwhile, Large Language Models (LLMs) can generate Register-Transfer-Level (RTL) at scale, but their functional incorrectness has hindered their use in circuit analysis. In this work, we make a key observation: even when LLM-Generated RTL is functionally imperfect, the synthesized netlists still preserve structural patterns that are strongly indicative of the intended functionality. Building on this insight, we propose a cost-effective data augmentation and training framework that systematically exploits imperfect LLM-Generated RTL as training data for netlist representation learning, forming an end-to-end pipeline from automated code generation to downstream tasks. We conduct evaluations on circuit functional understanding tasks, including sub-circuit boundary identification and component classification, across benchmarks of increasing scales, extending the task scope from operator-level to IP-level. The evaluations demonstrate that models trained on our noisy synthetic corpus generalize well to real-world netlists, matching or even surpassing methods trained on scarce high-quality data and effectively breaking the data bottleneck in circuit representation learning.
[AI-58] Real-Time Trust Verification for Safe Agent ic Actions using TrustBench AAAI2026
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)向自主代理(Autonomous Agents)演进过程中,缺乏实时动作验证机制导致的安全性与可靠性问题。现有评估框架如AgentBench、TrustLLM和HELM仅在任务完成后进行事后评价,无法阻止有害行为的发生。其解决方案的关键在于提出TrustBench这一双模式框架:一方面通过多维指标与LLM-as-a-Judge评估方法全面衡量信任度;另一方面提供一个在行动决策前调用的验证工具包,对潜在动作进行安全性和可靠性检查,且干预节点精准定位在“动作制定后、执行前”的关键时刻。该设计结合领域特定插件(Domain-specific Plugins),显著提升了在医疗、金融等高风险场景下的防护效果,实测可减少87%的有害行为,且延迟低于200ms,具备实际部署可行性。
链接: https://arxiv.org/abs/2603.09157
作者: Tavishi Sharma,Vinayak Sharma,Pragya Sharma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent)
Abstract:As large language models evolve from conversational assistants to autonomous agents, ensuring trustworthiness requires a fundamental shift from post-hoc evaluation to real-time action verification. Current frameworks like AgentBench evaluate task completion, while TrustLLM and HELM assess output quality after generation. However, none of these prevent harmful actions during agent execution. We present TrustBench, a dual-mode framework that (1) benchmarks trust across multiple dimensions using both traditional metrics and LLM-as-a-Judge evaluations, and (2) provides a toolkit agents invoke before taking actions to verify safety and reliability. Unlike existing approaches, TrustBench intervenes at the critical decision point: after an agent formulates an action but before execution. Domain-specific plugins encode specialized safety requirements for healthcare, finance, and technical domains. Across multiple agentic tasks, TrustBench reduced harmful actions by 87%. Domain-specific plugins outperformed generic verification, achieving 35% greater harm reduction. With sub-200ms latency, TrustBench enables practical real-time trust verification for autonomous agents.
[AI-59] Deep Tabular Research via Continual Experience-Driven Execution
【速读】:该论文旨在解决大型语言模型在处理非结构化表格中的复杂长时程分析任务时表现不佳的问题,这类任务通常涉及层级化和双向表头以及非规范布局。为此,作者提出了Deep Tabular Research(DTR)这一形式化框架,将表格推理建模为闭环决策过程。解决方案的关键在于:(i) 构建层次化的元图以捕捉双向语义,并将自然语言查询映射到操作级搜索空间;(ii) 引入期望感知的选择策略,优先选择高价值执行路径;(iii) 通过孪生结构化记忆(参数化更新与抽象文本)整合历史执行结果,实现持续优化。该方法强调战略规划与底层执行的分离,显著提升了长时程表格推理能力。
链接: https://arxiv.org/abs/2603.09151
作者: Junnan Dong,Chuang Zhou,Zheng Yuan,Yifei Yu,Siyu An,Di Yin,Xing Sun,Feiyue Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 6 tables, 6 figures
Abstract:Large language models often struggle with complex long-horizon analytical tasks over unstructured tables, which typically feature hierarchical and bidirectional headers and non-canonical layouts. We formalize this challenge as Deep Tabular Research (DTR), requiring multi-step reasoning over interdependent table regions. To address DTR, we propose a novel agentic framework that treats tabular reasoning as a closed-loop decision-making process. We carefully design a coupled query and table comprehension for path decision making and operational execution. Specifically, (i) DTR first constructs a hierarchical meta graph to capture bidirectional semantics, mapping natural language queries into an operation-level search space; (ii) To navigate this space, we introduce an expectation-aware selection policy that prioritizes high-utility execution paths; (iii) Crucially, historical execution outcomes are synthesized into a siamese structured memory, i.e., parameterized updates and abstracted texts, enabling continual refinement. Extensive experiments on challenging unstructured tabular benchmarks verify the effectiveness and highlight the necessity of separating strategic planning from low-level execution for long-horizon tabular reasoning.
[AI-60] Causally Sufficient and Necessary Feature Expansion for Class-Incremental Learning
【速读】:该论文旨在解决基于特征扩展的类增量学习(Class Incremental Learning, CIL)方法中因任务特定特征与旧任务特征发生冲突而导致灾难性遗忘的问题。其核心问题是:在经验风险最小化(Empirical Risk Minimization, ERM)驱动下,任务内伪相关(intra-task spurious correlations)导致模型依赖于捷径特征(shortcut features),这些非鲁棒特征易受干扰并漂移到其他任务的特征空间;同时,任务间伪相关(inter-task spurious correlations)引发视觉相似类别间的语义混淆。解决方案的关键在于提出一种基于必要性和充分性概率(Probability of Necessity and Sufficiency, PNS)的正则化方法——CPNS(Causal PNS),用于指导特征扩展过程。具体而言,通过引入基于孪生网络的双尺度反事实生成器,分别生成任务内反事实特征以降低任务内PNS风险、确保任务表示的因果完备性,以及任务间干扰特征以降低任务间PNS风险、保障跨任务表示的可分性,从而有效缓解特征碰撞问题。
链接: https://arxiv.org/abs/2603.09145
作者: Zhen Zhang,Jielei Chu,Tianrui Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Current expansion-based methods for Class Incremental Learning (CIL) effectively mitigate catastrophic forgetting by freezing old features. However, such task-specific features learned from the new task may collide with the old features. From a causal perspective, spurious feature correlations are the main cause of this collision, manifesting in two scopes: (i) guided by empirical risk minimization (ERM), intra-task spurious correlations cause task-specific features to rely on shortcut features. These non-robust features are vulnerable to interference, inevitably drifting into the feature space of other tasks; (ii) inter-task spurious correlations induce semantic confusion between visually similar classes across tasks. To address this, we propose a Probability of Necessity and Sufficiency (PNS)-based regularization method to guide feature expansion in CIL. Specifically, we first extend the definition of PNS to expansion-based CIL, termed CPNS, which quantifies both the causal completeness of intra-task representations and the separability of inter-task representations. We then introduce a dual-scope counterfactual generator based on twin networks to ensure the measurement of CPNS, which simultaneously generates: (i) intra-task counterfactual features to minimize intra-task PNS risk and ensure causal completeness of task-specific features, and (ii) inter-task interfering features to minimize inter-task PNS risk, ensuring the separability of inter-task representations. Theoretical analyses confirm its reliability. The regularization is a plug-and-play method for expansion-based CIL to mitigate feature collision. Extensive experiments demonstrate the effectiveness of the proposed method.
[AI-61] DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation
【速读】:该论文旨在解决当前生成式视觉-语言-动作(Vision-Language-Action, VLA)模型在复杂精细操作任务中部署时的可靠性与适应性不足问题,尤其是针对多指灵巧手控制的高维、接触密集特性所导致的执行分布偏移问题。其解决方案的关键在于提出首个集成式“人在回路”(Human-in-the-Loop, HiL)框架DexHiL,该框架通过引入一种干预感知的数据采样策略,优先选择人类纠正片段用于后训练,并结合轻量级遥操作接口实现在执行过程中的即时人工干预,从而实现机械臂与灵巧手的协同调控,显著提升模型在真实机器人场景下的性能表现,相较标准离线微调基线平均成功率提升25%。
链接: https://arxiv.org/abs/2603.09121
作者: Yifan Han,Zhongxi Chen,Yuxuan Zhao,Congsheng Xu,Yanming Shao,Yichuan Peng,Yao Mu,Wenzhao Lian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures
Abstract:While Vision-Language-Action (VLA) models have demonstrated promising generalization capabilities in robotic manipulation, deploying them on specific and complex downstream tasks still demands effective post-training. In parallel, Human-in-the-Loop (HiL) learning has proven to be a powerful mechanism for refining robot policies. However, extending this paradigm to dexterous manipulation remains challenging: multi-finger control is high-dimensional, contact-intensive, and exhibits execution distributions that differ markedly from standard arm motions, leaving existing dexterous VLA systems limited in reliability and adaptability. We present DexHiL, the first integrated arm-hand human-in-the-loop framework for dexterous VLA models, enabling coordinated interventions over the arm and the dexterous hand within a single system. DexHiL introduces an intervention-aware data sampling strategy that prioritizes corrective segments for post-training, alongside a lightweight teleoperation interface that supports instantaneous human corrections during execution. Real-robot experiments demonstrate that DexHiL serves as an effective post-training framework, yielding a substantial performance leap, outperforming standard offline-only fine-tuning baselines by an average of 25% in success rates across distinct tasks. Project page: this https URL Comments: 9 pages, 5 figures Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.09121 [cs.RO] (or arXiv:2603.09121v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.09121 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-62] PM-Nav: Priori-Map Guided Embodied Navigation in Functional Buildings
【速读】:该论文旨在解决现有语言驱动的具身导航(language-driven embodied navigation)方法在功能建筑(Functional Buildings, FBs)中因环境特征高度相似而导致导航失败的问题,其核心挑战在于缺乏有效利用先验空间知识的能力。解决方案的关键在于提出一种先验地图引导的具身导航框架(Priori-Map Guided Embodied Navigation, PM-Nav),通过将环境地图转化为导航友好的语义先验地图(semantic priori-map),设计包含注释先验地图的分层思维链提示模板(hierarchical chain-of-thought prompt template)以实现精准路径规划,并构建多模态协同动作输出机制完成定位决策与导航执行控制,从而显著提升在复杂相似场景下的导航性能。
链接: https://arxiv.org/abs/2603.09113
作者: Jiang Gao,Xiangyu Dong,Haozhou Li,Haoran Zhao,Yaoming Zhou,Xiaoguang Ma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures
Abstract:Existing language-driven embodied navigation paradigms face challenges in functional buildings (FBs) with highly similar features, as they lack the ability to effectively utilize priori spatial knowledge. To tackle this issue, we propose a Priori-Map Guided Embodied Navigation (PM-Nav), wherein environmental maps are transformed into navigation-friendly semantic priori-maps, a hierarchical chain-of-thought prompt template with an annotation priori-map is designed to enable precise path planning, and a multi-model collaborative action output mechanism is built to accomplish positioning decisions and execution control for navigation planning. Comprehensive tests using a home-made FB dataset show that the PM-Nav obtains average improvements of 511% and 1175%, and 650% and 400% over the SG-Nav and the InstructNav in simulation and real-world, respectively. These tremendous boosts elucidate the great potential of using the PM-Nav as a backbone navigation framework for FBs.
[AI-63] Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLM s for Aluminum Price Forecasting
【速读】:该论文旨在解决如何有效利用文本数据中的情绪信号来提升铝价预测精度的问题,尤其是在市场波动剧烈时期。其解决方案的关键在于:采用微调后的轻量级大语言模型(LLM)——Qwen3,从英文和中文新闻标题中提取月度情绪分数,并将其与传统表格型数据(如基本金属指数、汇率、通胀率及能源价格)融合,构建基于长短期记忆网络(LSTM)的预测模型。实证结果表明,在高波动性市场环境下,该融合模型相较仅使用表格数据的基准模型显著提升了预测性能(夏普比率由0.23提升至1.04),验证了情绪信息在特定市场条件下的预测价值。
链接: https://arxiv.org/abs/2603.09085
作者: Alvaro Paredes Amorin,Andre Python,Christoph Weisser
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:By capturing the prevailing sentiment and market mood, textual data has become increasingly vital for forecasting commodity prices, particularly in metal markets. However, the effectiveness of lightweight, finetuned large language models (LLMs) in extracting predictive signals for aluminum prices, and the specific market conditions under which these signals are most informative, remains under-explored. This study generates monthly sentiment scores from English and Chinese news headlines (Reuters, Dow Jones Newswires, and China News Service) and integrates them with traditional tabular data, including base metal indices, exchange rates, inflation rates, and energy prices. We evaluate the predictive performance and economic utility of these models through long-short simulations on the Shanghai Metal Exchange from 2007 to 2024. Our results demonstrate that during periods of high volatility, Long Short-Term Memory (LSTM) models incorporating sentiment data from a finetuned Qwen3 model (Sharpe ratio 1.04) significantly outperform baseline models using tabular data alone (Sharpe ratio 0.23). Subsequent analysis elucidates the nuanced roles of news sources, topics, and event types in aluminum price forecasting.
[AI-64] Sim2Act: Robust Simulation-to-Decision Learning via Adversarial Calibration and Group-Relative Perturbation
【速读】:该论文旨在解决仿真到决策学习(Simulation-to-decision learning)中因仿真器基于噪声或偏差的真实数据训练而导致的决策关键区域预测误差问题,此类误差会引发动作排序不稳定和策略不可靠。现有方法要么仅关注提升平均仿真保真度,要么采用保守正则化策略,易导致策略坍塌(policy collapse),即排除高风险高回报的动作。解决方案的关键在于提出Sim2Act框架,其核心创新包括:一是引入对抗校准机制(adversarial calibration mechanism),对决策关键状态-动作对中的仿真误差进行重加权,使代理保真度与下游决策影响对齐;二是设计群体相对扰动策略(group-relative perturbation strategy),在仿真不确定性下稳定策略学习,避免过度悲观约束。实验表明,该方法在多个供应链基准任务上显著提升了仿真鲁棒性和决策稳定性。
链接: https://arxiv.org/abs/2603.09053
作者: Hongyu Cao,Jinghan Zhang,Kunpeng Liu,Dongjie Wang,Feng Xia,Haifeng Chen,Xiaohua Hu,Yanjie Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures
Abstract:Simulation-to-decision learning enables safe policy training in digital environments without risking real-world deployment, and has become essential in mission-critical domains such as supply chains and industrial systems. However, simulators learned from noisy or biased real-world data often exhibit prediction errors in decision-critical regions, leading to unstable action ranking and unreliable policies. Existing approaches either focus on improving average simulation fidelity or adopt conservative regularization, which may cause policy collapse by discarding high-risk high-reward actions. We propose Sim2Act, a robust simulation-to-decision framework that addresses both simulator and policy robustness. First, we introduce an adversarial calibration mechanism that re-weights simulation errors in decision-critical state-action pairs to align surrogate fidelity with downstream decision impact. Second, we develop a group-relative perturbation strategy that stabilizes policy learning under simulator uncertainty without enforcing overly pessimistic constraints. Extensive experiments on multiple supply chain benchmarks demonstrate improved simulation robustness and more stable decision performance under structured and unstructured perturbations. Comments: 9 pages, 5 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.09053 [cs.LG] (or arXiv:2603.09053v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.09053 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-65] EPOCH: An Agent ic Protocol for Multi-Round System Optimization
【速读】:该论文旨在解决当前自主代理(autonomous agents)在提示优化、代码生成和机器学习系统改进中普遍存在的问题:现有方法多为任务特定的优化循环,缺乏统一协议来建立基线并管理多轮自我改进过程,导致难以实现跨组件(如提示、模型配置、代码和规则模块)的协同优化,且易丧失稳定性、可复现性、可追溯性和评估完整性。其解决方案的关键在于提出EPOCH协议,将优化分为基线构建与迭代自我改进两个阶段,并通过角色约束的阶段划分(规划、实现、评估)以及标准化的命令接口和轮次级追踪机制,实现了异构环境下的多轮系统优化协调,从而保障了优化过程的结构化、可控性和可靠性。
链接: https://arxiv.org/abs/2603.09049
作者: Zhanlin Liu,Yitao Li,Munirathnam Srikanth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous agents are increasingly used to improve prompts, code, and machine learning systems through iterative execution and feedback. Yet existing approaches are usually designed as task-specific optimization loops rather than as a unified protocol for establishing baselines and managing tracked multi-round self-improvement. We introduce EPOCH, an engineering protocol for multi-round system optimization in heterogeneous environments. EPOCH organizes optimization into two phases: baseline construction and iterative self-improvement. It further structures each round through role-constrained stages that separate planning, implementation, and evaluation, and standardizes execution through canonical command interfaces and round-level tracking. This design enables coordinated optimization across prompts, model configurations, code, and rule-based components while preserving stability, reproducibility, traceability, and integrity of evaluation. Empirical studies in various tasks illustrate the practicality of EPOCH for production-oriented autonomous improvement workflows.
[AI-66] me Identity and Consciousness in Language Model Agents AAAI2026
【速读】:该论文试图解决当前机器意识评估中仅依赖行为表现(如语言和工具使用)所导致的误导性结论问题,即语言模型代理可能在没有足够约束条件的情况下生成看似一致的自我陈述,从而虚假地表现出稳定的身份感。解决方案的关键在于引入**栈理论(Stack Theory)**中的时间间隙(temporal gap)机制,通过构建分层轨迹(scaffolded trajectories),将评估窗口内各要素的独立发生(ingredient-wise occurrence)与单个目标步骤上的共现(co-instantiation)相分离。在此基础上,应用栈理论的“琶音”(Arpeggio)与“和弦”(Chord)后设命题来定义基于实体身份声明的持久性指标,从而获得可从仪器化轨迹中计算出的两个持久性分数,并将其映射到五个操作性身份度量中,最终形成一个保守的身份评估工具包,有效区分“像一个稳定自我那样说话”与“像一个稳定自我那样组织自身行为”。
链接: https://arxiv.org/abs/2603.09043
作者: Elija Perrier,Michael Timothy Bennett
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026 Spring Symposium - Machine Consciousness: Integrating Theory, Technology, and Philosophy
Abstract:Machine consciousness evaluations mostly see behavior. For language model agents that behavior is language and tool use. That lets an agent say the right things about itself even when the constraints that should make those statements matter are not jointly present at decision time. We apply Stack Theory’s temporal gap to scaffold trajectories. This separates ingredient-wise occurrence within an evaluation window from co-instantiation at a single objective step. We then instantiate Stack Theory’s Arpeggio and Chord postulates on grounded identity statements. This yields two persistence scores that can be computed from instrumented scaffold traces. We connect these scores to five operational identity metrics and map common scaffolds into an identity morphospace that exposes predictable tradeoffs. The result is a conservative toolkit for identity evaluation. It separates talking like a stable self from being organized like one.
[AI-67] PlayWorld: Learning Robot World Models from Autonomous Play
【速读】:该论文旨在解决当前生成式视频世界模型在机器人操作任务中难以预测物理上一致的机器人-物体交互问题,尤其是在接触丰富的场景下。现有方法依赖于成功导向的人类示范数据,导致对长尾物理交互的建模不足。其解决方案的关键在于提出PlayWorld——一个完全自主、可扩展的训练管道,通过无监督的机器人自play(self-play)方式收集交互经验,从而自然地实现大规模数据采集,并有效捕捉复杂且罕见的物理交互模式,显著提升视频世界模拟器在接触预测、故障识别和策略评估等方面的性能,最终推动强化学习策略在真实环境中的表现提升。
链接: https://arxiv.org/abs/2603.09030
作者: Tenny Yin,Zhiting Mei,Zhonghe Zheng,Miyu Yamane,David Wang,Jade Sceats,Samuel M. Bateman,Lihan Zha,Apurva Badithela,Ola Shorinwa,Anirudha Majumdar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Action-conditioned video models offer a promising path to building general-purpose robot simulators that can improve directly from data. Yet, despite training on large-scale robot datasets, current state-of-the-art video models still struggle to predict physically consistent robot-object interactions that are crucial in robotic manipulation. To close this gap, we present PlayWorld, a simple, scalable, and fully autonomous pipeline for training high-fidelity video world simulators from interaction experience. In contrast to prior approaches that rely on success-biased human demonstrations, PlayWorld is the first system capable of learning entirely from unsupervised robot self-play, enabling naturally scalable data collection while capturing complex, long-tailed physical interactions essential for modeling realistic object dynamics. Experiments across diverse manipulation tasks show that PlayWorld generates high-quality, physically consistent predictions for contact-rich interactions that are not captured by world models trained on human-collected this http URL further demonstrate the versatility of PlayWorld in enabling fine-grained failure prediction and policy evaluation, with up to 40% improvements over human-collected data. Finally, we demonstrate how PlayWorld enables reinforcement learning in the world model, improving policy performance by 65% in success rates when deployed in the real world.
[AI-68] Automating Detection and Root-Cause Analysis of Flaky Tests in Quantum Software
【速读】:该论文旨在解决量子软件中因概率性输出导致的“量子flaky测试”(quantum flakiness)问题,即测试结果在无代码变更情况下不稳定,从而掩盖真实缺陷并降低开发效率。其解决方案的关键在于构建一个自动化流水线,利用大型语言模型(Large Language Models, LLMs)和余弦相似度方法自动识别量子软件仓库中的flaky测试及其关联的Pull Request,并对flakiness进行分类与根因分析。通过扩展原有14个量子软件库的手动分析数据集,该研究实现了25个新发现的flaky测试案例(使原数据集规模增加54%),并验证了Google Gemini模型在flakiness检测(F1-score=0.9420)和根因识别(F1-score=0.9643)上的高性能表现,表明LLMs可为量子软件工程中的flaky测试问题提供实用的诊断支持。
链接: https://arxiv.org/abs/2603.09029
作者: Janakan Sivaloganathan,Ainaz Jamshidi,Andriy Miranskyy,Lei Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 27 pages, 2 figures
Abstract:Like classical software, quantum software systems rely on automated testing. However, their inherently probabilistic outputs make them susceptible to quantum flakiness – tests that pass or fail inconsistently without code changes. Such quantum flaky tests can mask real defects and reduce developer productivity, yet systematic tooling for their detection and diagnosis remains limited. This paper presents an automated pipeline to detect flaky-test-related issues and pull requests in quantum software repositories and to support the identification of their root causes. We aim to expand an existing quantum flaky test dataset and evaluate the capability of Large Language Models (LLMs) for flakiness classification and root-cause identification. Building on a prior manual analysis of 14 quantum software repositories, we automate the discovery of additional flaky test cases using LLMs and cosine similarity. We further evaluate a variety of LLMs from OpenAI GPT, Meta LLaMA, Google Gemini, and Anthropic Claude suites for classifying flakiness and identifying root causes from issue descriptions and code context. Classification performance is assessed using standard performance metrics, including F1-score. Using our pipeline, we identify 25 previously unknown flaky tests, increasing the original dataset size by 54%. The best-performing model, Google Gemini, achieves an F1-score of 0.9420 for flakiness detection and 0.9643 for root-cause identification, demonstrating that LLMs can provide practical support for triaging flaky reports and understanding their underlying causes in quantum software. The expanded dataset and automated pipeline provide reusable artifacts for the quantum software engineering community. Future work will focus on improving detection robustness and exploring automated repair of quantum flaky tests. Comments: 27 pages, 2 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET) Cite as: arXiv:2603.09029 [cs.SE] (or arXiv:2603.09029v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.09029 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lei Zhang [view email] [v1] Mon, 9 Mar 2026 23:57:55 UTC (201 KB) Full-text links: Access Paper: View a PDF of the paper titled Automating Detection and Root-Cause Analysis of Flaky Tests in Quantum Software, by Janakan Sivaloganathan and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2026-03 Change to browse by: cs cs.AI cs.ET References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-69] he Missing Memory Hierarchy: Demand Paging for LLM Context Windows
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)上下文窗口(context window)资源受限引发的一系列问题,包括上下文浪费、注意力机制退化、成本上升以及跨会话状态丢失等。这些问题本质上是虚拟内存管理问题的变体,但当前LLM系统缺乏有效的内存分层机制,导致所有工具定义、系统提示和过时内容长期占用有限的上下文空间。解决方案的关键在于引入一个称为Pichay的需求分页(demand paging)系统,它作为客户端与推理API之间的透明代理,在消息流中动态驱逐过期内容、检测页面错误(page fault),并基于故障历史对工作集页面进行固定(pinning)。通过在生产环境中部署该机制,系统实现了最高达93%的上下文消耗减少,并验证了经典内存管理理论(如工作集理论、分页策略)在LLM场景下的有效性,为构建完整的多级内存层次结构(L1至持久存储)提供了实践基础。
链接: https://arxiv.org/abs/2603.09023
作者: Tony Mason
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:The context window of a large language model is not memory. It is L1 cache: a small, fast, expensive resource that the field treats as the entire memory system. There is no L2, no virtual memory, no paging. Every tool definition, every system prompt, and every stale tool result occupies context for the lifetime of the session. The result is measurable: across 857 production sessions and 4.45 million effective input tokens, 21.8% is structural waste. We present Pichay, a demand paging system for LLM context windows. Implemented as a transparent proxy between client and inference API, Pichay interposes on the message stream to evict stale content, detect page faults when the model re-requests evicted material, and pin working-set pages identified by fault history. In offline replay across 1.4 million simulated evictions, the fault rate is 0.0254%. In live production deployment over 681turns, the system reduces context consumption by up to 93% (5,038KB to 339KB); under extreme sustained pressure, the system remains operational but exhibits the expected thrashing pathology, with repeated fault-in of evicted content. The key observation is that the problems the field faces, such as context limits, attention degradation, cost scaling, lost state across sessions, are virtual memory problems wearing different clothes. The solutions exist: working set theory (Denning, 1968), demand paging, fault-driven replacement policies, and memory hierarchies with multiple eviction-managed levels. We describe the architecture of a full memory hierarchy for LLM systems (L1 through persistent storage), report on the first three levels deployed in production use (L1 eviction, L2 fault-driven pinning, L3 model-initiated conversation compaction), and identify cross-session memory as the remaining frontier. Subjects: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2603.09023 [cs.OS] (or arXiv:2603.09023v1 [cs.OS] for this version) https://doi.org/10.48550/arXiv.2603.09023 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-70] MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games
【速读】:该论文旨在解决多轮、多智能体大语言模型(Large Language Model, LLM)游戏评估中普遍存在的运行间方差过大与性能不足问题。在长时交互场景下,早期微小偏差因多智能体耦合效应被放大,导致胜率估计偏倚且排名不稳定;同时,提示词(prompt)选择进一步加剧了有效策略的差异性。解决方案的核心是提出MEMO(Memory-augmented MOdel context optimization)框架,其关键在于通过记忆增强的上下文优化机制实现推理时上下文的动态调整:一方面利用持久化记忆库存储自对弈轨迹中的结构化洞察,并作为先验注入后续博弈过程以增强策略稳定性(Retention);另一方面采用基于TrueSkill算法的不确定性感知提示演化与优先回放机制,主动探索稀有且决定性的状态以提升策略多样性与鲁棒性(Exploration)。实验证明,MEMO显著提升了GPT-4o-mini和Qwen-2.5-7B-Instruct在五种文本游戏中的平均胜率,并大幅降低运行间方差,尤其在谈判类和不完美信息博弈中效果最为突出。
链接: https://arxiv.org/abs/2603.09022
作者: Yunfei Xie,Kevin Wang,Bobby Cheng,Jianzhu Yao,Zhizhou Sha,Alexander Duffy,Yihan Xi,Hongyuan Mei,Cheston Tan,Chen Wei,Pramod Viswanath,Zhangyang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-turn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling. This biases win rate estimates and makes rankings unreliable across repeated tournaments. Prompt choice worsens this further by producing different effective policies. We address both instability and underperformance with MEMO (Memory-augmented MOdel context optimization), a self-play framework that optimizes inference-time context by coupling retention and exploration. Retention maintains a persistent memory bank that stores structured insights from self-play trajectories and injects them as priors during later play. Exploration runs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit rare and decisive states. Across five text-based games, MEMO raises mean win rate from 25.1% to 49.5% for GPT-4o-mini and from 20.9% to 44.3% for Qwen-2.5-7B-Instruct, using 2,000 self-play games per task. Run-to-run variance also drops, giving more stable rankings across prompt variations. These results suggest that multi-agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves the largest gains in negotiation and imperfect-information games, while RL remains more effective in perfect-information settings.
[AI-71] Meissa: Multi-modal Medical Agent ic Intelligence
【速读】:该论文旨在解决当前医疗多模态大语言模型(Multi-modal Large Language Models, MM-LLMs)在临床应用中依赖前沿模型API部署所引发的高成本、高延迟及隐私风险问题,同时提升其在本地环境中实现复杂决策能力(如工具调用与多智能体协作)的可行性。解决方案的关键在于提出Meissa——一个参数量仅为4B的轻量级医疗MM-LLM,通过从前沿模型中蒸馏结构化交互轨迹(trajectory),使模型能够自主学习何时进行外部交互(策略选择)以及如何执行多步交互(策略执行)。其核心创新包括:(1) 统一轨迹建模,将推理与动作轨迹纳入单一状态-动作-观测形式化框架,增强跨异构医疗环境的泛化能力;(2) 三层分层监督机制,根据模型自身错误逐步升级为工具增强和多智能体协作,显式学习难度感知的策略选择;(3) 前瞻-回溯监督策略,结合探索性前向轨迹与事后理性化执行轨迹,稳定学习高效交互策略。实验证明,Meissa在16项评估设置中的10项达到或超过商业前沿代理性能,且参数量仅为Gemini-3的1/25,端到端延迟降低22倍,完全支持离线运行。
链接: https://arxiv.org/abs/2603.09018
作者: Yixiong Chen,Xinyi Bai,Yue Pan,Zongwei Zhou,Alan Yuille
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-modal large language models (MM-LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi-agent collaboration, enabling complex decision-making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API-based deployment incurs high cost, high latency, and privacy risks that conflict with on-premise clinical requirements. We present Meissa, a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi-step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state-action-observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three-tier stratified supervision: the model’s own errors trigger progressive escalation from direct reasoning to tool-augmented and multi-agent interaction, explicitly learning difficulty-aware strategy selection. (3) Prospective-retrospective supervision: pairing exploratory forward traces with hindsight-rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini-3, Meissa operates fully offline with 22x lower end-to-end latency compared to API-based deployment. Data, models, and environments are released at this https URL.
[AI-72] Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis
【速读】:该论文旨在解决音频深度伪造检测模型中存在的性别偏差问题,即现有模型在不同性别语音上的检测性能不一致,从而影响系统的公平性和可靠性。其解决方案的关键在于引入五种经过验证的公平性度量指标(fairness metrics),对模型在男性和女性语音上的错误分布进行细致分析,从而揭示传统整体性能指标(如等错误率 EER%)所掩盖的性别差异。研究结果表明,即使EER差异较小,公平性评估仍能识别出模型在特定性别群体中的潜在失效模式,强调了在开发音频深度伪造检测系统时采用公平性导向评估的重要性,以提升系统的鲁棒性和可信度。
链接: https://arxiv.org/abs/2603.09007
作者: Aishwarya Fursule,Shruti Kshirsagar,Anderson R. Avila
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 Figures
Abstract:Audio deepfake detection aims to detect real human voices from those generated by Artificial Intelligence (AI) and has emerged as a significant problem in the field of voice biometrics systems. With the ever-improving quality of synthetic voice, the probability of such a voice being exploited for illicit practices like identity thest and impersonation increases. Although significant progress has been made in the field of Audio Deepfake Detection in recent times, the issue of gender bias remains underexplored and in its nascent stage In this paper, we have attempted a thorough analysis of gender dependent performance and fairness in audio deepfake detection models. We have used the ASVspoof 5 dataset and train a ResNet-18 classifier and evaluate detection performance across four different audio features, and compared the performance with baseline AASIST model. Beyond conventional metrics such as Equal Error Rate (EER %), we incorporated five established fairness metrics to quantify gender disparities in the model. Our results show that even when the overall EER difference between genders appears low, fairness-aware evaluation reveals disparities in error distribution that are obscured by aggregate performance measures. These findings demonstrate that reliance on standard metrics is unreliable, whereas fairness metrics provide critical insights into demographic-specific failure modes. This work highlights the importance of fairness-aware evaluation for developing a more equitable, robust, and trustworthy audio deepfake detection system.
[AI-73] Security Considerations for Multi-agent Systems DATE
【速读】:该论文旨在解决多智能体人工智能系统(Multi-agent AI Systems, MAS)所面临的安全漏洞问题,这些问题与单一AI模型的传统安全风险具有本质差异,且现有安全与治理框架并未针对此类新兴攻击面进行设计。解决方案的关键在于构建一套系统化的四阶段方法论:首先建立生产级多智能体架构的深度技术知识库;其次利用生成式AI辅助进行聚焦于MAS网络安全风险的威胁建模,并经领域专家验证;然后在个体威胁粒度上制定调查计划;最后基于三分类评分标准对16个AI安全框架进行量化评估。该方法实现了对193项独立威胁条目的结构化分析,揭示了当前框架在非确定性(Non-Determinism)和数据泄露(Data Leakage)等关键领域的覆盖不足,为MAS安全框架的选择提供了首个实证性的跨框架比较依据。
链接: https://arxiv.org/abs/2603.09002
作者: Tam Nguyen,Moses Ndebugre,Dheeraj Arremsetty
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: A Crew Scaler (501c3 pending org)'s response to NIST RFI 2026-00206. Check back for updated versions. Tam Nguyen is the corresponding author
Abstract:Multi-agent artificial intelligence systems or MAS are systems of autonomous agents that exercise delegated tool authority, share persistent memory, and coordinate via inter-agent communication. MAS introduces qualitatively distinct security vulnerabilities from those documented for singular AI models. Existing security and governance frameworks were not designed for these emerging attack surfaces. This study systematically characterizes the threat landscape of MAS and quantitatively evaluates 16 security frameworks for AI against it. A four-phase methodology is proposed: constructing a deep technical knowledge base of production multi-agent architectures; conducting generative AI-assisted threat modeling scoped to MAS cybersecurity risks and validated by domain experts; structuring survey plans at individual-threat granularity; and scoring each framework on a three-point scale against the cybersecurity risks. The risks were organized into 193 distinct main threat items across nine risk categories. The expected minimal average score is 2. No reviewed framework achieves majority coverage of any single category. Non-Determinism (mean score 1.231 across all 16 frameworks) and Data Leakage (1.340) are the most under-addressed domains. The OWASP Agentic Security Initiative leads overall at 65.3% coverage and in the design phase; the CDAO Generative AI Responsible AI Toolkit leads in development and operational coverage. These results provide the first empirical cross-framework comparison for MAS security and offer evidence-based guidance for framework selection.
[AI-74] Arbiter: Detecting Interference in LLM Agent System Prompts
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLM)驱动的代码生成代理(coding agents)中系统提示(system prompts)缺乏有效测试基础设施的问题,从而导致潜在行为干扰模式难以被发现和修复。其解决方案的关键在于提出Arbiter框架,该框架融合形式化评估规则与多模型LLM探测机制,能够系统性识别系统提示中的干扰模式(interference patterns)。通过在Claude Code、Codex CLI和Gemini CLI三个主流编码代理上应用该框架,研究者不仅发现了152项无监督扫描结果和21个经人工标注的干扰模式,还揭示了提示架构类型(如单体式、扁平式、模块化)与失败类别强相关但不直接影响严重性,且多模型评估可发现单模型分析无法识别的漏洞类别。这一方法显著提升了对系统提示安全性和健壮性的检测能力,并以极低的成本(0.27美元)实现跨厂商分析。
链接: https://arxiv.org/abs/2603.08993
作者: Tony Mason
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Programming Languages (cs.PL)
备注:
Abstract:System prompts for LLM-based coding agents are software artifacts that govern agent behavior, yet lack the testing infrastructure applied to conventional software. We present Arbiter, a framework combining formal evaluation rules with multi-model LLM scouring to detect interference patterns in system prompts. Applied to three major coding agent system prompts: Claude Code (Anthropic), Codex CLI (OpenAI), and Gemini CLI (Google), we identify 152 findings across the undirected scouring phase and 21 hand-labeled interference patterns in directed analysis of one vendor. We show that prompt architecture (monolithic, flat, modular) strongly correlates with observed failure class but not with severity, and that multi-model evaluation discovers categorically different vulnerability classes than single-model analysis. One scourer finding was structural data loss in Gemini CLI’s memory system was consistent with an issue filed and patched by Google, which addressed the symptom without addressing the schema-level root cause identified by the scourer. Total cost of cross-vendor analysis: \ 0.27 USD.
[AI-75] Semantic Level of Detail: Multi-Scale Knowledge Representation via Heat Kernel Diffusion on Hyperbolic Manifolds
【速读】:该论文旨在解决知识图谱等AI记忆系统中缺乏连续分辨率控制机制的问题,即如何确定抽象层次之间的定性边界以及如何在这些边界间进行有效导航。其解决方案的关键在于提出Semantic Level of Detail (SLoD) 框架,通过在庞加莱球面(Poincaré ball)上利用热核扩散定义连续缩放算子:当扩散尺度 σ → ∞ 时,嵌入被聚合为高层级摘要;当 σ → 0 时,则保留局部语义细节。该方法证明了分层一致性并具有 O(σ) 的有界近似误差和 (1+ε) 的失真度,且发现图拉普拉斯谱隙可诱导出自然的尺度边界——这些边界对应于表示发生质变的尺度点,无需人工设定分辨率参数即可自动检测。
链接: https://arxiv.org/abs/2603.08965
作者: Edward Izgorodin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 2 tables
Abstract:AI memory systems increasingly organize knowledge into graph structures – knowledge graphs, entity relations, community hierarchies – yet lack a principled mechanism for continuous resolution control: where do the qualitative boundaries between abstraction levels lie, and how should an agent navigate them? We introduce Semantic Level of Detail (SLoD), a framework that answers both questions by defining a continuous zoom operator via heat kernel diffusion on the Poincaré ball \mathbbB^d . At coarse scales ( \sigma \to \infty ), diffusion aggregates embeddings into high-level summaries; at fine scales ( \sigma \to 0 ), local semantic detail is preserved. We prove hierarchical coherence with bounded approximation error O(\sigma) and (1+\varepsilon) distortion for tree-structured hierarchies under Sarkar embedding. Crucially, we show that spectral gaps in the graph Laplacian induce emergent scale boundaries – scales where the representation undergoes qualitative transitions – which can be detected automatically without manual resolution parameters. On synthetic hierarchies (HSBM), our boundary scanner recovers planted levels with ARI up to 1.00, with detection degrading gracefully near the information-theoretic Kesten-Stigum threshold. On the full WordNet noun hierarchy (82K synsets), detected boundaries align with true taxonomic depth ( \tau = 0.79 ), demonstrating that the method discovers meaningful abstraction levels in real-world knowledge graphs without supervision.
[AI-76] he FABRIC Strategy for Verifying Neural Feedback Systems
【速读】:该论文旨在解决神经反馈系统(neural feedback systems)中后向可达性分析(backward reachability analysis)的可扩展性问题,这一领域相较于前向可达性分析(forward reachability analysis)研究较少且技术受限。解决方案的关键在于提出新的算法,用于计算非线性神经反馈系统的后向可达集的上界和下界近似(over- and underapproximations),并将其与现有的前向可达性分析方法集成,形成一个统一的认证框架——称为FaBRIC(Forward and Backward Reachability Integration for Certification)。该集成方法显著提升了对复杂系统的可达性验证效率和精度,实验表明其优于现有最先进方法。
链接: https://arxiv.org/abs/2603.08964
作者: I. Samuel Akinwande,Sydney M. Katz,Mykel J. Kochenderfer,Clark Barrett
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Forward reachability analysis is a dominant approach for verifying reach-avoid specifications in neural feedback systems, i.e., dynamical systems controlled by neural networks, and a number of directions have been proposed and studied. In contrast, far less attention has been given to backward reachability analysis for these systems, in part because of the limited scalability of known techniques. In this work, we begin to address this gap by introducing new algorithms for computing both over- and underapproximations of backward reachable sets for nonlinear neural feedback systems. We also describe and implement an integration of these backward reachability techniques with existing ones for forward analysis. We call the resulting algorithm Forward and Backward Reachability Integration for Certification (FaBRIC). We evaluate our algorithms on a representative set of benchmarks and show that they significantly outperform the prior state of the art.
[AI-77] Automated Tensor-Relational Decomposition for Large-Scale Sparse Tensor Computation
【速读】:该论文旨在解决如何在关系型数据库系统中高效执行包含高维张量(tensor)操作的计算问题,尤其是在处理大规模稀疏数据时兼顾计算效率与存储优化。其核心挑战在于传统关系运算难以有效利用高性能数值计算内核(如矩阵乘法优化库),而纯张量计算又缺乏对稀疏性的自动管理能力。解决方案的关键是提出了一种名为“大写-小写EinSum”的张量关系计算范式,它将经典的爱因斯坦求和符号(Einstein Summation Notation)扩展为支持张量与关系混合的数据处理形式,并通过自动重写机制将计算图中的密集部分映射到高效的数值内核上执行,同时保留稀疏部分由关系系统处理,从而实现性能与可扩展性的协同优化。
链接: https://arxiv.org/abs/2603.08957
作者: Yuxin Tang,Zhiyuan Xin,Zhimin Ding,Xinyu Yao,Daniel Bourgeois,Tirthak Patel,Chris Jermaine
机构: 未知
类目: Mathematical Software (cs.MS); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:A \emphtensor-relational computation is a relational computation where individual tuples carry vectors, matrices, or higher-dimensional arrays. An advantage of tensor-relational computation is that the overall computation can be executed on top of a relational system, inheriting the system’s ability to automatically handle very large inputs with high levels of sparsity while high-performance kernels (such as optimized matrix-matrix multiplication codes) can be used to perform most of the underlying mathematical operations. In this paper, we introduce upper-case-lower-case \textttEinSum, which is a tensor-relational version of the classical Einstein Summation Notation. We study how to automatically rewrite a computation in Einstein Notation into upper-case-lower-case \textttEinSum so that computationally intensive components are executed using efficient numerical kernels, while sparsity is managed relationally.
[AI-78] Agent OS: From Application Silos to a Natural Language-Driven Data Ecosystem
【速读】:该论文旨在解决当前基于大型语言模型(Large Language Model, LLM)的智能代理在传统操作系统架构下所面临的交互碎片化、权限管理混乱(常被称为“影子AI”)以及上下文断裂等问题。其核心解决方案是提出一种新型范式——个人代理操作系统(Personal Agent Operating System, AgentOS),其中关键在于将传统的图形用户界面(GUI)桌面替换为以自然语言或语音为中心的统一自然用户界面(Natural User Interface, NUI),并引入一个代理内核(Agent Kernel)作为系统核心,负责实时解析用户意图、任务分解与多代理协调;同时,传统应用程序演变为可组合的模块化技能(Skills-as-Modules),支持通过自然语言规则进行软件编排。这一转变使操作系统本质上成为一个持续的数据挖掘流水线,涉及序列模式挖掘用于工作流自动化、推荐系统用于技能检索及动态演化的个人知识图谱构建,从而将AgentOS的实现定义为知识发现与数据挖掘(Knowledge Discovery and Data Mining, KDD)问题。
链接: https://arxiv.org/abs/2603.08938
作者: Rui Liu,Tao Zhe,Dongjie Wang,Zijun Yao,Kunpeng Liu,Yanjie Fu,Huan Liu,Jian Pei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid emergence of open-source, locally hosted intelligent agents marks a critical inflection point in human-computer interaction. Systems such as OpenClaw demonstrate that Large Language Model (LLM)-based agents can autonomously operate local computing environments, orchestrate workflows, and integrate external tools. However, within the current paradigm, these agents remain conventional applications running on legacy operating systems originally designed for Graphical User Interfaces (GUIs) or Command Line Interfaces (CLIs). This architectural mismatch leads to fragmented interaction models, poorly structured permission management (often described as “Shadow AI”), and severe context fragmentation. This paper proposes a new paradigm: a Personal Agent Operating System (AgentOS). In AgentOS, traditional GUI desktops are replaced by a Natural User Interface (NUI) centered on a unified natural language or voice portal. The system core becomes an Agent Kernel that interprets user intent, decomposes tasks, and coordinates multiple agents, while traditional applications evolve into modular Skills-as-Modules enabling users to compose software through natural language rules. We argue that realizing AgentOS fundamentally becomes a Knowledge Discovery and Data Mining (KDD) problem. The Agent Kernel must operate as a real-time engine for intent mining and knowledge discovery. Viewed through this lens, the operating system becomes a continuous data mining pipeline involving sequential pattern mining for workflow automation, recommender systems for skill retrieval, and dynamically evolving personal knowledge graphs. These challenges define a new research agenda for the KDD community in building the next generation of intelligent computing systems.
[AI-79] Uncovering a Winning Lottery Ticket with Continuously Relaxed Bernoulli Gates
【速读】:该论文旨在解决过参数化神经网络在资源受限场景下部署时面临的内存和计算成本过高问题。其核心挑战在于如何高效地发现稀疏子网络(即“强彩票票根”,Strong Lottery Ticket, SLT),这些子网络在不进行权重训练的情况下即可达到与完整网络相当的准确率。现有方法如edge-popup依赖于非可微的基于评分的选择机制,导致优化效率低且难以扩展。本文的关键创新在于提出使用连续松弛的伯努利门控机制(continuously relaxed Bernoulli gates),实现完全可微、端到端的SLT发现过程:仅训练门控参数而冻结所有初始权重,从而直接对ℓ₀正则化目标进行梯度优化,无需非可微梯度估计器或迭代剪枝循环。此方法首次避免了直通估计器(straight-through estimator)近似,显著提升了优化效率与可扩展性,并在全连接网络、CNN(ResNet、Wide-ResNet)及视觉Transformer(ViT、Swin-T)上验证了高达90%稀疏度下的极小精度损失,优于现有方法。
链接: https://arxiv.org/abs/2603.08914
作者: Itamar Tsayag,Ofir Lindenbaum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Over-parameterized neural networks incur prohibitive memory and computational costs for resource-constrained deployment. The Strong Lottery Ticket (SLT) hypothesis suggests that randomly initialized networks contain sparse subnetworks achieving competitive accuracy without weight training. Existing SLT methods, notably edge-popup, rely on non-differentiable score-based selection, limiting optimization efficiency and scalability. We propose using continuously relaxed Bernoulli gates to discover SLTs through fully differentiable, end-to-end optimization - training only gating parameters while keeping all network weights frozen at their initialized values. Continuous relaxation enables direct gradient-based optimization of an \ell_0 -regularization objective, eliminating the need for non-differentiable gradient estimators or iterative pruning cycles. To our knowledge, this is the first fully differentiable approach for SLT discovery that avoids straight-through estimator approximations. Experiments across fully connected networks, CNNs (ResNet, Wide-ResNet), and Vision Transformers (ViT, Swin-T) demonstrate up to 90% sparsity with minimal accuracy loss - nearly double the sparsity achieved by edge-popup at comparable accuracy - establishing a scalable framework for pre-training network sparsification.
[AI-80] FedLECC: Cluster- and Loss-Guided Client Selection for Federated Learning under Non-IID Data
【速读】:该论文旨在解决跨设备联邦学习(Cross-device Federated Learning, FL)中因客户端参与受限和数据非独立同分布(non-IID)导致的模型收敛缓慢与性能下降问题。其解决方案的关键在于提出一种轻量级、聚类感知且损失引导的客户端选择策略——FedLECC,该策略通过基于标签分布相似性对客户端进行分组,并优先选择局部损失较高且具有代表性的集群和客户端,从而在每次训练轮次中选取少量但信息丰富且多样化的客户端集合,显著提升模型训练效率与最终精度。
链接: https://arxiv.org/abs/2603.08911
作者: Daniel M. Jimenez-Gutierrez,Giovanni Giunta,Mehrdad Hassanzadeh,Aris Anagnostopoulos,Ioannis Chatzigiannakis,Andrea Vitaletti
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the IEEE International Workshop on Intelligent Cloud Computing and Networking (ICCN) from the IEEE International Conference on Computer Communications (INFOCOM) 2026
Abstract:Federated Learning (FL) enables distributed Artificial Intelligence (AI) across cloud-edge environments by allowing collaborative model training without centralizing data. In cross-device deployments, FL systems face strict communication and participation constraints, as well as strong non-independent and identically distributed (non-IID) data that degrades convergence and model quality. Since only a subset of devices (a.k.a clients) can participate per training round, intelligent client selection becomes a key systems challenge. This paper proposes FedLECC (Federated Learning with Enhanced Cluster Choice), a lightweight, cluster-aware, and loss-guided client selection strategy for cross-device FL. FedLECC groups clients by label-distribution similarity and prioritizes clusters and clients with higher local loss, enabling the selection of a small yet informative and diverse set of clients. Experimental results under severe label skew show that FedLECC improves test accuracy by up to 12%, while reducing communication rounds by approximately 22% and overall communication overhead by up to 50% compared to strong baselines. These results demonstrate that informed client selection improves the efficiency and scalability of FL workloads in cloud-edge systems.
[AI-81] Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting
【速读】:该论文旨在解决选择性预测(selective prediction)中风险控制的有限样本界(finite-sample bound)优化问题,即在保证预测置信度的前提下,如何设计更紧致、更实用的风险边界以提升模型在数据稀缺场景下的可靠性。其核心挑战在于平衡覆盖概率与预测集大小之间的权衡,并克服传统方法在小样本或跨域迁移时性能下降的问题。解决方案的关键在于提出Transfer-Informed Betting (TIB),这是一种基于赌注型置信序列(betting-based confidence sequences)的创新框架,通过利用源域的风险分布来“暖启动”(warm-start)Wasserstein 信息熵正则化(WSR)财富过程,在保持严格统计保证的同时显著收紧边界;同时结合 Learn Then Test(LTT)单调测试策略和多测试校正机制,实现了三重新颖性:跨域迁移、赌博驱动的置信序列表征以及对联合参数配置(α, δ)的系统评估。实验证明,TIB 在数据稀疏场景下优于标准 WSR,且在 NyayaBench 上实现比 LTT+Hoeffding 提升 5.4 倍的覆盖率,体现了其理论优越性和实际价值。
链接: https://arxiv.org/abs/2603.08907
作者: Abhinaba Basu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We present a comprehensive ablation of nine finite-sample bound families for selective prediction with risk control, combining concentration inequalities (Hoeffding, Empirical Bernstein, Clopper-Pearson, Wasserstein DRO, CVaR) with multiple-testing corrections (union bound, Learn Then Test fixed-sequence) and betting-based confidence sequences (WSR). Our main theoretical contribution is Transfer-Informed Betting (TIB), which warm-starts the WSR wealth process using a source domain’s risk profile, achieving tighter bounds in data-scarce settings with a formal dominance guarantee. We prove that the TIB wealth process remains a valid supermartingale under all source-target divergences, that TIB dominates standard WSR when domains match, and that no data-independent warm-start can achieve better convergence. The combination of betting-based confidence sequences, LTT monotone testing, and cross-domain transfer is, to our knowledge, a three-way novelty not present in the literature. We evaluate all nine bound families on four benchmarks-MASSIVE (n=1,102), NyayaBench (n=280), CLINC-150 (n=22.5K), and Banking77 (n=13K)-across 18 (alpha, delta) configurations. On MASSIVE at alpha=0.10, LTT eliminates the ln(K) union-bound penalty, achieving 94.0% guaranteed coverage versus 73.8% for Hoeffding-a 27% relative improvement. On NyayaBench, where the small calibration set makes Hoeffding-family bounds infeasible below alpha=0.20, Transfer-Informed Betting achieves 18.5% coverage at alpha=0.10, a 5.4x improvement over LTT + Hoeffding. We additionally compare with split-conformal prediction, showing that conformal methods produce prediction sets (avg. 1.67 classes) whereas selective prediction provides single-prediction risk guarantees. We apply these methods to agentic caching systems, formalizing a progressive trust model where the guarantee determines when cached responses can be served autonomously.
[AI-82] NetDiffuser: Deceiving DNN-Based Network Attack Detection Systems with Diffusion-Generated Adversarial Traffic
【速读】:该论文旨在解决深度学习(Deep Learning, DL)-based网络入侵检测系统(Network Intrusion Detection System, NIDS)在面对自然对抗样本(Natural Adversarial Examples, NAEs)时的脆弱性问题。NAEs因其与合法流量高度相似,难以被人类或模型识别,从而对NIDS构成严重威胁。解决方案的关键在于提出NetDiffuser框架,其核心创新包括:一是设计了一种新的特征分类算法,用于识别网络流量中相对独立的特征,通过扰动这些特征可在最小改变数据结构的前提下维持流的有效性;二是首次将扩散模型(Diffusion Models)应用于生成语义一致的扰动,从而生成难以检测的NAEs。实验表明,NetDiffuser显著提升了攻击成功率,并大幅降低现有对抗检测器的AUC-ROC性能。
链接: https://arxiv.org/abs/2603.08901
作者: Pratyay Kumar,Abu Saleh Md Tayeen,Satyajayant Misra,Huiping Cao,Jiefei Liu,Qixu Gong,Jayashree Harikumar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning (DL)-based Network Intrusion Detection System (NIDS) has demonstrated great promise in detecting malicious network traffic. However, they face significant security risks due to their vulnerability to adversarial examples (AEs). Most existing adversarial attacks maliciously perturb data to maximize misclassification errors. Among AEs, natural adversarial examples (NAEs) are particularly difficult to detect because they closely resemble real data, making them challenging for both humans and machine learning models to distinguish from legitimate inputs. Creating NAEs is crucial for testing and strengthening NIDS defenses. This paper proposes NetDiffuser1, a novel framework for generating NAEs capable of deceiving NIDS. NetDiffuser consists of two novel components. First, a new feature categorization algorithm is designed to identify relatively independent features in network traffic. Perturbing these features minimizes changes while preserving network flow validity. The second component is a novel application of diffusion models to inject semantically consistent perturbations for generating NAEs. NetDiffuser performance was extensively evaluated using three benchmark NIDS datasets across various model architectures and state-of-the-art adversarial detectors. Our experimental results show that NetDiffuser achieves up to a 29.93% higher attack success rate and reduces AE detection performance by at least 0.267 (in some cases up to 0.534) in the Area under the Receiver Operating Characteristic Curve (AUC-ROC) score compared to the baseline attacks.
[AI-83] A New Modeling to Feature Selection Based on the Fuzzy Rough Set Theory in Normal and Optimistic States on Hybrid Information Systems
【速读】:该论文旨在解决模糊粗糙集理论在高维混合信息系统中进行特征选择时面临的两个关键问题:一是通过交运算获取模糊等价关系在高维空间中计算耗时且内存消耗大;二是该方法易产生噪声数据,增加特征选择的复杂性。解决方案的关键在于提出一种新的特征选择模型FSbuHD,其核心创新是将特征选择问题重构为优化问题,不再直接求解,而是基于对象间的联合距离计算构建模糊等价关系,并借助合适的元启发式算法进行求解。该模型支持正常和乐观两种模式,依据所引入的两种模糊等价关系动态调整,从而提升效率与效果。
链接: https://arxiv.org/abs/2603.08900
作者: Mohammad Hossein Safarpour,Seyed Mohammad Alavi,Mohammad Izadikhah,Hossein Dibachi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 14 figures, 9 tables. Published version available at International Journal of Engineering. This preprint is distributed under CC BY 4.0 license
Abstract:Considering the high volume, wide variety, and rapid speed of data generation, investigating feature selection methods for big data presents various applications and advantages. By removing irrelevant and redundant features, feature selection reduces data dimensions, thereby facilitating optimal decision-making within decision systems. One of the key tools for feature selection in hybrid information systems is fuzzy rough set theory. However, this theory faces two significant challenges: First, obtaining fuzzy equivalence relations through intersection operations in high-dimensional spaces can be both time-consuming and memory-intensive. Additionally, this method may produce noisy data, complicating the feature selection process. The purpose and innovation of this paper are to address these issues. We proposed a new feature selection model that calculates the combined distance between objects and subsequently used this information to derive the fuzzy equivalence relation. Rather than directly solving the feature selection problem, this approach reformulates it into an optimization problem that can be tackled using appropriate meta-heuristic algorithms. We have named this new approach FSbuHD. The FSbuHD model operates in two modes - normal and optimistic - based on the selection of one of the two introduced fuzzy equivalence relations. The model is then tested on standard datasets from the UCI repository and compared with other algorithms. The results of this research demonstrate that FSbuHD is one of the most efficient and effective methods for feature selection when compared to previous methods and algorithms.
[AI-84] Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agent ic LLM Search LREC
【速读】:该论文旨在解决部署环境中受限资源(如工具调用次数和生成token数量)对代理式检索增强生成(Agentic Retrieval-Augmented Generation, Agentic RAG)系统性能的影响问题。其解决方案的关键在于提出了一种模型无关的评估框架——预算约束代理搜索(Budget-Constrained Agentic Search, BCAS),该框架能够显式监控剩余预算并控制工具调用,从而在固定预算下系统性地评估不同搜索深度、检索策略与完成预算对准确性和成本的影响。通过在六种大语言模型(LLM)和三个问答基准上的对比实验,研究发现:适度增加搜索次数可提升准确性但存在饱和点;混合词法与密集检索并结合轻量重排序策略能带来最大平均收益;而更大的完成预算在需要多跳推理的任务(如HotpotQA)中效果更显著。
链接: https://arxiv.org/abs/2603.08877
作者: Kyle McCleary,James Ghawaly
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in 2026 Language Resources and Evaluation Conference (LREC)
Abstract:Agentic Retrieval-Augmented Generation (RAG) systems combine iterative search, planning prompts, and retrieval backends, but deployed settings impose explicit budgets on tool calls and completion tokens. We present a controlled measurement study of how search depth, retrieval strategy, and completion budget affect accuracy and cost under fixed constraints. Using Budget-Constrained Agentic Search (BCAS), a model-agnostic evaluation harness that surfaces remaining budget and gates tool use, we run comparisons across six LLMs and three question-answering benchmarks. Across models and datasets, accuracy improves with additional searches up to a small cap, hybrid lexical and dense retrieval with lightweight re-ranking produces the largest average gains in our ablation grid, and larger completion budgets are most helpful on HotpotQA-style synthesis. These results provide practical guidance for configuring budgeted agentic retrieval pipelines and are accompanied by reproducible prompts and evaluation settings.
[AI-85] Are Expressive Encoders Necessary for Discrete Graph Generation?
【速读】:该论文旨在解决离散图生成中传统神经网络架构(如Transformer)计算效率低、结构表达能力不足的问题,尤其是在分子和复杂拓扑图(如树状图和平面图)生成任务中的有效性与速度瓶颈。其解决方案的关键在于提出GenGNN——一种模块化的消息传递框架,通过引入残差连接缓解复杂图结构下的过度平滑(oversmoothing)问题,并结合扩散模型(diffusion models)实现高效且高有效性的图生成。实验表明,基于GenGNN的扩散模型在Tree和Planar数据集上达到90%以上有效性,推理速度比图Transformer快2-5倍;在分子生成任务中,DiGress模型(以GenGNN为骨干)有效性达99.49%,验证了GNN作为离散扩散模型骨干的潜力与优越性。
链接: https://arxiv.org/abs/2603.08825
作者: Jay Revolinsky,Harry Shomer,Jiliang Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 15 figures, 10 tables
Abstract:Discrete graph generation has emerged as a powerful paradigm for modeling graph data, often relying on highly expressive neural backbones such as transformers or higher-order architectures. We revisit this design choice by introducing GenGNN, a modular message-passing framework for graph generation. Diffusion models with GenGNN achieve more than 90% validity on Tree and Planar datasets, within margins of graph transformers, at 2-5x faster inference speed. For molecule generation, DiGress with a GenGNN backbone achieves 99.49% Validity. A systematic ablation study shows the benefit provided by each GenGNN component, indicating the need for residual connections to mitigate oversmoothing on complicated graph-structure. Through scaling analyses, we apply a principled metric-space view to investigate learned diffusion representations and uncover whether GNNs can be expressive neural backbones for discrete diffusion.
[AI-86] st-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)代理在生产环境中缺乏可测量的行为合规性问题,具体表现为微小提示(prompt)变更引发的无声回归、工具误用未被检测以及策略违规仅在部署后才暴露。为应对这些问题,论文提出测试驱动的AI代理定义(Test-Driven AI Agent Definition, TDAD)方法,其核心在于将提示视为编译产物:由工程师提供行为规范,编码代理将其转化为可执行测试,另一编码代理则迭代优化提示直至测试通过。TDAD引入三项关键技术机制以缓解规范博弈(specification gaming)风险:(1) 可见/隐藏测试分割,训练阶段不暴露评估测试;(2) 语义变异测试,通过后编译代理生成合理错误提示变体并由测试套件衡量检测能力;(3) 规范演进场景,量化需求变更下的回归安全性。实验表明,TDAD在SpecSuite-Core基准上实现了高编译成功率和强鲁棒性验证结果。
链接: https://arxiv.org/abs/2603.08806
作者: Tzafrir Rehan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, open benchmark at this https URL
Abstract:We present Test-Driven AI Agent Definition (TDAD), a methodology that treats agent prompts as compiled artifacts: engineers provide behavioral specifications, a coding agent converts them into executable tests, and a second coding agent iteratively refines the prompt until tests pass. Deploying tool-using LLM agents in production requires measurable behavioral compliance that current development practices cannot provide. Small prompt changes cause silent regressions, tool misuse goes undetected, and policy violations emerge only after deployment. To mitigate specification gaming, TDAD introduces three mechanisms: (1) visible/hidden test splits that withhold evaluation tests during compilation, (2) semantic mutation testing via a post-compilation agent that generates plausible faulty prompt variants, with the harness measuring whether the test suite detects them, and (3) spec evolution scenarios that quantify regression safety when requirements change. We evaluate TDAD on SpecSuite-Core, a benchmark of four deeply-specified agents spanning policy compliance, grounded analytics, runbook adherence, and deterministic enforcement. Across 24 independent trials, TDAD achieves 92% v1 compilation success with 97% mean hidden pass rate; evolved specifications compile at 58%, with most failed runs passing all visible tests except 1-2, and show 86-100% mutation scores, 78% v2 hidden pass rate, and 97% regression safety scores. The implementation is available as an open benchmark at this https URL.
[AI-87] Multi-level meta-reinforcement learning with skill-based curriculum
【速读】:该论文旨在解决序列决策问题中多层级结构的系统性推理与利用难题,尤其在复杂目标由多个子任务组合而成时,如何高效地压缩马尔可夫决策过程(Markov Decision Process, MDP)并保留其语义结构。解决方案的关键在于提出一种高效的多级压缩方法:将某一层级上的参数化策略族视为更高层级MDP中的单一动作,从而构建出结构保持且随机性更低的高层MDP,使得原有复杂MDP可被逐步简化求解;同时,通过将策略分解为问题特异性嵌入(embeddings)和技能(skills,含高阶函数),实现跨任务、跨层级的技能迁移,整个框架嵌套于课程学习(curriculum learning)范式中,以渐进式提升任务难度并促进跨MDP和层级间的知识迁移,该方法在保证理论一致性的前提下显著降低计算复杂度并提升长期最优策略发现效率。
链接: https://arxiv.org/abs/2603.08773
作者: Sichen Yang(Johns Hopkins University),Mauro Maggioni(Johns Hopkins University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 78 pages, 12 figures
Abstract:We consider problems in sequential decision making with natural multi-level structure, where sub-tasks are assembled together to accomplish complex goals. Systematically inferring and leveraging hierarchical structure has remained a longstanding challenge; we describe an efficient multi-level procedure for repeatedly compressing Markov decision processes (MDPs), wherein a parametric family of policies at one level is treated as single actions in the compressed MDPs at higher levels, while preserving the semantic meanings and structure of the original MDP, and mimicking the natural logic to address a complex MDP. Higher-level MDPs are themselves independent MDPs with less stochasticity, and may be solved using existing algorithms. As a byproduct, spatial or temporal scales may be coarsened at higher levels, making it more efficient to find long-term optimal policies. The multi-level representation delivered by this procedure decouples sub-tasks from each other and usually greatly reduces unnecessary stochasticity and the policy search space, leading to fewer iterations and computations when solving the MDPs. A second fundamental aspect of this work is that these multi-level decompositions plus the factorization of policies into embeddings (problem-specific) and skills (including higher-order functions) yield new transfer opportunities of skills across different problems and different levels. This whole process is framed within curriculum learning, wherein a teacher organizes the student agent’s learning process in a way that gradually increases the difficulty of tasks and and promotes transfer across MDPs and levels within and across curricula. The consistency of this framework and its benefits can be guaranteed under mild assumptions. We demonstrate abstraction, transferability, and curriculum learning in examples, including MazeBase+, a more complex variant of the MazeBase example.
[AI-88] Clear Compelling Arguments: Rethinking the Foundations of Frontier AI Safety Cases
【速读】:该论文旨在解决前沿人工智能(Frontier AI)系统安全评估缺乏可靠、可辩护的安全论证框架的问题。当前,安全案例(Safety Case)作为结构化、可辩护的论据,已在航空航天、核能等高风险领域广泛应用,但其在前沿AI领域的适用性仍处于探索阶段。论文指出,现有对齐社区(Alignment Community)借鉴安全保障(Assurance)经验的研究存在显著局限,因此提出需重新思考对齐安全案例的方法论。解决方案的关键在于引入来自安全保证领域的成熟理论与方法,并结合针对“欺骗性对齐”(Deceptive Alignment)和CBRN(化学、生物、放射性、核)能力的案例研究,构建一个更系统、可验证且实用的安全案例框架,从而为前沿AI系统的安全性提供坚实的理论基础与实践指导。
链接: https://arxiv.org/abs/2603.08760
作者: Shaun Feakins,Ibrahim Habli,Phillip Morgan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Full paper presented at the International Association of Safe and Ethical AI 2026 Conference (IASEAI 26). 10 pages, 8 figures
Abstract:This paper contributes to the nascent debate around safety cases for frontier AI systems. Safety cases are structured, defensible arguments that a system is acceptably safe to deploy in a given context. Historically, they have been used in safety-critical industries, such as aerospace, nuclear or automotive. As a result, safety cases for frontier AI have risen in prominence, both in the safety policies of leading frontier developers and in international research agendas proposed by leaders in generative AI, such as the Singapore Consensus on Global AI Safety Research Priorities and the International AI Safety Report. This paper appraises this work. We note that research conducted within the alignment community which draws explicitly on lessons from the assurance community has significant limitations. We therefore aim to rethink existing approaches to alignment safety cases. We offer lessons from existing methodologies within safety assurance and outline the limitations involved in the alignment community’s current approach. Building on this foundation, we present a case study for a safety case focused on Deceptive Alignment and CBRN capabilities, drawing on existing, theoretical safety case “sketches” created by the alignment safety case community. Overall, we contribute holistic insights from the field of safety assurance via rigorous theory and methodologies that have been applied in safety-critical contexts. We do so in order to create a better foundational framework for robust, defensible and useful safety case methodologies which can help to assure the safety of frontier AI systems.
[AI-89] EDMFormer: Genre-Specific Self-Supervised Learning for Music Structure Segmentation
【速读】:该论文旨在解决现有音乐结构分割模型在电子舞曲(Electronic Dance Music, EDM)上表现不佳的问题。传统方法依赖于歌词或和声相似性,适用于流行音乐但不适用于以能量、节奏和音色变化为特征的EDM。其解决方案的关键在于提出EDMFormer——一种结合自监督音频嵌入与特定于EDM的数据集及结构先验的Transformer模型,并发布了包含98首专业标注EDM曲目的数据集EDM-98,从而显著提升了对EDM中关键段落(如buildup和drop)的边界检测与标签准确性。
链接: https://arxiv.org/abs/2603.08759
作者: Sahal Sajeer,Krish Patel,Oscar Chung,Joel Song Bae
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Published in CUCAI 2026 conference proceedings
Abstract:Music structure segmentation is a key task in audio analysis, but existing models perform poorly on Electronic Dance Music (EDM). This problem exists because most approaches rely on lyrical or harmonic similarity, which works well for pop music but not for EDM. EDM structure is instead defined by changes in energy, rhythm, and timbre, with different sections such as buildup, drop, and breakdown. We introduce EDMFormer, a transformer model that combines self-supervised audio embeddings using an EDM-specific dataset and taxonomy. We release this dataset as EDM-98: a group of 98 professionally annotated EDM tracks. EDMFormer improves boundary detection and section labelling compared to existing models, particularly for drops and buildups. The results suggest that combining learned representations with genre-specific data and structural priors is effective for EDM and could be applied to other specialized music genres or broader audio domains.
[AI-90] Generalized Reduction to the Isotropy for Flexible Equivariant Neural Fields
【速读】:该论文旨在解决几何学习中在异质乘积空间(heterogeneous product spaces)上构建群不变函数的难题,即当乘积空间由承载不同群作用的不同空间组成时,传统方法难以直接适用。其解决方案的关键在于:当群 $ G $ 在空间 $ M $ 上传递作用时,任意 $ G $-不变函数在乘积空间 $ X \times M $ 上可被约化为仅在 $ X $ 上由 $ M $ 的稳定子群 $ H $ 作用的不变函数;这一约化通过建立显式的轨道等价关系 $ (X \times M)/G \cong X/H $ 实现,从而在保持表达能力的前提下提供了一个原则性的降维方法。该理论框架被进一步应用于等变神经场(Equivariant Neural Fields),使其能够支持任意群作用和齐性条件空间,突破了现有方法的主要结构限制。
链接: https://arxiv.org/abs/2603.08758
作者: Alejandro García-Castellanos,Gijs Bellaard,Remco Duits,Daniel Pelt,Erik J Bekkers
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Many geometric learning problems require invariants on heterogeneous product spaces, i.e., products of distinct spaces carrying different group actions, where standard techniques do not directly apply. We show that, when a group G acts transitively on a space M , any G -invariant function on a product space X \times M can be reduced to an invariant of the isotropy subgroup H of M acting on X alone. Our approach establishes an explicit orbit equivalence (X \times M)/G \cong X/H , yielding a principled reduction that preserves expressivity. We apply this characterization to Equivariant Neural Fields, extending them to arbitrary group actions and homogeneous conditioning spaces, and thereby removing the major structural constraints imposed by existing methods.
[AI-91] urn: A Language for Agent ic Computation
【速读】:该论文针对当前生成式 AI (Generative AI) 应用开发中缺乏语言级保障的问题,提出了一种名为 Turn 的编译型、基于 actor 的编程语言,旨在支持自主代理软件(agentic software)的可靠构建。现有方法通过框架扩展通用编程语言,将关键约束(如有界上下文、类型化推理输出、凭证隔离和持久状态)编码为应用层约定,而非语言层面的强制保证。Turn 的核心解决方案在于引入五类语言级构造:1)认知类型安全(Cognitive Type Safety),使 LLM 推理成为类型原语,编译器生成 JSON Schema 并在运行时验证模型输出;2)置信度操作符(confidence operator),实现基于模型置信度的确定性控制流;3)基于 actor 的进程模型,借鉴 Erlang 提供隔离上下文窗口、持久内存与邮箱机制;4)基于能力的身份系统,由 VM 主机返回不可伪造的句柄,确保原始凭证不进入代理内存;5)编译期模式吸收(compile-time schema absorption),通过 use schema::protocol 从外部规范自动生成类型化的 API 绑定,支持 OpenAPI、GraphQL、FHIR 等协议。这些设计共同实现了对 LLM 驱动行为的静态安全性与动态可控性的统一保障。
链接: https://arxiv.org/abs/2603.08755
作者: Muyukani Kizito
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:We present \textbfTurn, a compiled, actor-based programming language – statically typed for schema inference, dynamically typed at the value level – for agentic software: programs that reason and act autonomously by delegating inference to large language models (LLMs). Existing approaches augment general-purpose languages with frameworks, encoding critical invariants (bounded context, typed inference output, credential isolation, durable state) as application-level conventions rather than language guarantees. Turn introduces five language-level constructs that address this gap. \emphCognitive Type Safety makes LLM inference a typed primitive: the compiler generates a JSON Schema from a struct definition and the VM validates model output before binding. The \emphconfidence operator enables deterministic control flow gated on model certainty. Turn’s \emphactor-based process model, derived from Erlang, gives each agent an isolated context window, persistent memory, and mailbox. A \emphcapability-based identity system returns opaque, unforgeable handles from the VM host, ensuring raw credentials never enter agent memory. Finally, \emphcompile-time schema absorption (\textttuse schema::protocol) synthesizes typed API bindings from external specifications at compile time; the \textttopenapi adapter is shipped with \textttgraphql, \textttfhir, and \textttmcp in active development. We describe the language design, type rules, schema semantics, and a Rust-based bytecode VM, and evaluate Turn against representative agentic workloads. Turn is open source at this https URL. Subjects: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2603.08755 [cs.PL] (or arXiv:2603.08755v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2603.08755 Focus to learn more arXiv-issued DOI via DataCite
[AI-92] Hindsight Credit Assignment for Long-Horizon LLM Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在长程、多步骤任务中因稀疏奖励而导致的信用分配(credit assignment)难题。现有无价值函数方法(如Group Relative Policy Optimization, GRPO)面临两个核心瓶颈:步骤级Q值估计不准确以及中间状态的价值基线错位。其解决方案的关键在于提出HCAPO框架,首次将事后信用分配(hindsight credit assignment)引入LLM代理,利用LLM自身作为后验评判者(post-hoc critic),通过事后推理对步骤级Q值进行精炼;同时设计多尺度优势机制,在关键决策状态处有效弥补价值基线的偏差,从而提升探索效率与决策简洁性,并保障复杂长程任务中的可扩展性。
链接: https://arxiv.org/abs/2603.08754
作者: Hui-Ze Tan,Xiao-Wen Yang,Hao Chen,Jie-Jing Shao,Yi Wen,Yuteng Shen,Weihong Luo,Xiku Du,Lan-Zhe Guo,Yu-Feng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) agents often face significant credit assignment challenges in long-horizon, multi-step tasks due to sparse rewards. Existing value-free methods, such as Group Relative Policy Optimization (GRPO), encounter two fundamental bottlenecks: inaccurate step-level Q-value estimation and misaligned value baselines for intermediate states. To address these limitations, we introduce HCAPO, the first framework to integrate hindsight credit assignment into LLM agents. HCAPO leverages the LLM itself as a post-hoc critic to refine step-level Q-values through hindsight reasoning. Furthermore, HCAPO’s multi-scale advantage mechanism effectively supplements the inaccurate value baselines at critical decision states. Evaluations across three challenging benchmarks, including WebShop and ALFWorld, demonstrate that HCAPO consistently outperforms state-of-the-art RL methods. Notably, HCAPO achieves a 7.7% improvement in success rate on WebShop and a 13.8% on ALFWorld over GRPO using the Qwen2.5-7B-Instruct model. These results indicate that HCAPO significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks.
[AI-93] Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4
【速读】:该论文旨在解决四比特浮点(FP4)量化在大语言模型(LLM)中引起的敏感性问题,即不同Transformer层组件(如MLP、注意力投影等)和模型深度对FP4精度损失的响应差异尚未明确,且现有FP4格式(如MXFP4与NVFP4)之间的敏感性是否具有通用性仍不清晰。解决方案的关键在于采用受控的逐组件与分块隔离方法,系统分析两种FP4格式在三个Qwen2.5模型规模(0.5B、7B、14B)下的量化敏感性,从而揭示MLP上投影和下投影层为最敏感组件,且敏感性分布并不局限于模型末尾层,尤其在MXFP4格式下早期层亦表现出显著敏感性,为FP4量化部署提供了诊断依据与优化方向。
链接: https://arxiv.org/abs/2603.08747
作者: Musa Cim,Burak Topcu,Mahmut Taylan Kandemir
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantization addresses the high resource demand for large language models (LLMs) by alleviating memory pressure and bandwidth congestion and providing significantly scaled compute power with a tolerable impact on accuracy. Four-bit floating point (FP4), the lowest-precision format that preserves essential numerical properties such as exponent and sign, has begun to be adopted in cutting-edge architectures, including Blackwell and AMD CDNA, to support LLM quantization and reduce deployment costs. Although aggressive quantization can yield efficiency gains, the quantization sensitivity of within-transformer layers and whether these sensitivities generalize across existing FP4 formats and model scales remain underexplored. To elucidate quantization sensitivity, this study conducts a systematic analysis of two FP4 formats, MXFP4 and NVFP4, across three Qwen2.5 model scales (0.5B, 7B, and 14B), using controlled component-wise and block-wise isolation methodologies. We observe that MLP up- and down-projection layers consistently dominate in terms of sensitivity, while gate and attention projections are moderately and substantially less sensitive to FP4 quantization, respectively. We further find that sensitivity does not universally localize to the final blocks, but early blocks can be highly sensitive, particularly under MXFP4. Our results provide a diagnostic characterization of the inference behavior of FP4 across components, depths, and FP4 formats.
[AI-94] Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因KV缓存(Key-Value Cache)导致的内存瓶颈问题,该瓶颈限制了高并发服务的性能。现有KV缓存淘汰方法虽能缓解内存压力,但在工业级应用中往往不实用。解决方案的关键在于提出Compressed PagedAttention,该方法结合了基于token的KV缓存淘汰机制与分页注意力(PagedAttention)技术,并设计了一套综合调度策略,支持前缀缓存(prefix caching)和异步压缩,从而显著提升内存利用率与推理效率。基于此,作者开发了名为Zipage的高并发LLM推理引擎,在大规模数学推理任务中实现了接近全KV缓存推理引擎约95%的性能,同时获得超过2.1倍的加速比。
链接: https://arxiv.org/abs/2603.08743
作者: Mengqi Liao,Lu Wang,Chaoyun Zhang,Bo Qiao,Si Qin,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Huaiyu Wan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:With reasoning becoming the generative paradigm for large language models (LLMs), the memory bottleneck caused by KV cache during the decoding phase has become a critical factor limiting high-concurrency service. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency LLM inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95% of the performance of Full KV inference engines while delivering over 2.1 \times speedup.
[AI-95] Architectural Design and Performance Analysis of FPGA based AI Accelerators: A Comprehensive Review
【速读】:该论文旨在解决深度学习(Deep Learning, DL)模型日益增长的计算复杂性和资源需求问题,特别是在实际应用中对高性能与高能效硬件加速器的迫切需求。其解决方案的关键在于利用现场可编程门阵列(Field-Programmable Gate Array, FPGA)作为硬件平台,通过多种硬件级优化技术(如循环流水线、并行化、量化和内存层次结构改进)实现针对神经网络任务的定制化加速,从而在灵活性与效率之间取得平衡,并推动FPGA-based神经网络加速器的设计创新与发展。
链接: https://arxiv.org/abs/2603.08740
作者: Soumita Chatterjee,Sudip Ghosh,Tamal Ghosh,Hafizur Rahaman
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning (DL) has emerged as a rapidly developing advanced technology, enabling the performance of complex tasks involving image recognition, natural language processing, and autonomous decision-making with high levels of accuracy. However, as these technologies evolve and strive to meet the growing demands of real-life applications, the complexity of DL models continues to increase. These models require processing of massive volumes of data, demanding substantial computational power and memory bandwidth. This gives rise to the critical need for hardware accelerators that can deliver both high performance and energy efficiency. Accelerator types include ASIC based solutions, GPU accelerators, and FPGA based implementations. The limitations of ASIC and GPU accelerators have led to FPGAs becoming one of the prominent solutions, offering distinct advantages for DL workloads. FPGAs provide a flexible and reconfigurable platform, allowing model specific customization while maintaining high efficiency. This article explores various hardware level optimizations for DL. These optimizations include techniques such as loop pipelining, parallelism, quantization, and various memory hierarchy enhancements. In addition, it provides an overview of state-of-the-art FPGA-based neural network accelerators. Through the study and analysis of these accelerators, several challenges have been identified, paving the way for future optimizations and innovations in the design of FPGA-based hardware accelerators.
[AI-96] Sensitivity-Guided Framework for Pruned and Quantized Reservoir Computing Accelerators
【速读】:该论文旨在解决Reservoir Computing(RC)模型在硬件部署时面临的计算复杂度与资源消耗问题,尤其是在量化(quantization)和剪枝(pruning)策略下如何平衡模型精度与硬件效率的挑战。其解决方案的关键在于提出一种基于敏感性分析的剪枝机制,能够识别并移除对模型性能影响较小的量化权重,在不显著降低准确率的前提下有效减少计算开销;同时通过系统性的设计空间探索,量化不同剪枝率与位宽配置对模型性能及FPGA硬件参数(如资源利用率、功耗延迟积PDP)的影响,从而实现精度与硬件效率之间的最优权衡。
链接: https://arxiv.org/abs/2603.08737
作者: Atousa Jafari,Mahdi Taheri,Hassan Ghasemzadeh Mohammadi,Christian Herglotz,Marco Platzner
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:
Abstract:This paper presents a compression framework for Reservoir Computing that enables systematic design-space exploration of trade-offs among quantization levels, pruning rates, model accuracy, and hardware efficiency. The proposed approach leverages a sensitivity-based pruning mechanism to identify and remove less critical quantized weights with minimal impact on model accuracy, thereby reducing computational overhead while preserving accuracy. We perform an extensive trade-off analysis to validate the effectiveness of the proposed framework and the impact of pruning and quantization on model performance and hardware parameters. For this evaluation, we employ three time-series datasets, including both classification and regression tasks. Experimental results across selected benchmarks demonstrate that our proposed approach maintains high accuracy while substantially improving computational and resource efficiency in FPGA-based implementations, with variations observed across different configurations and time series applications. For instance, for the MELBOEN dataset, an accelerator quantized to 4-bit at a 15% pruning rate reduces resource utilization by 1.2% and the Power Delay Product (PDP) by 50.8% compared to an unpruned model, without any noticeable degradation in accuracy.
[AI-97] Autonomous Edge-Deployed AI Agents for Electric Vehicle Charging Infrastructure Management
【速读】:该论文针对公共电动汽车(EV)充电基础设施中存在的高故障率(高达27.5%的直流快充桩失效)及修复周期长(多日平均修复时间)的问题,提出了一种基于边缘计算的软件定义充电(Software-Defined Charging, SDC)架构——Auralink SDC,以实现充电设施的自主化管理。其核心解决方案在于部署领域专用的生成式AI代理(AI agents)于网络边缘,从而满足自动驾驶操作所需的低延迟、高可靠性和带宽要求。关键创新包括:(1) 置信度校准的自主修复机制(Confidence-Calibrated Autonomous Resolution, CCAR),确保自主修复具有形式化的误报率上限;(2) 自适应检索增强推理(Adaptive Retrieval-Augmented Reasoning, ARA),融合密集与稀疏检索并动态分配上下文;(3) 边缘运行时系统(Auralink Edge Runtime),在PREEMPT_RT实时约束下实现商品硬件上的亚50毫秒首字节传输延迟(TTFT);以及(4) 分层多智能体编排(Hierarchical Multi-Agent Orchestration, HMAO)。该方案通过在OCPP 1.6/2.0.1、ISO 15118和运维事件历史数据上微调QLoRA模型,在18,000个标注事件中实现了78%的自主事件解决率、87.6%诊断准确率和28–48ms的P50 TTFT延迟,为工业级边缘AI系统的安全关键场景提供了可复用的架构与实现范式。
链接: https://arxiv.org/abs/2603.08736
作者: Mohammed Cherifi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 27 pages, 10 figures (TikZ), 27 tables
Abstract:Public EV charging infrastructure suffers from significant failure rates – with field studies reporting up to 27.5% of DC fast chargers non-functional – and multi-day mean time to resolution, imposing billions in annual economic burden. Cloud-centric architectures cannot achieve the latency, reliability, and bandwidth characteristics required for autonomous operation. We present Auralink SDC (Software-Defined Charging), an architecture deploying domain-specialized AI agents at the network edge for autonomous charging infrastructure management. Key contributions include: (1) Confidence-Calibrated Autonomous Resolution (CCAR), enabling autonomous remediation with formal false-positive bounds; (2) Adaptive Retrieval-Augmented Reasoning (ARA), combining dense and sparse retrieval with dynamic context allocation; (3) Auralink Edge Runtime, achieving sub-50ms TTFT on commodity hardware under PREEMPT_RT constraints; and (4) Hierarchical Multi-Agent Orchestration (HMAO). Implementation uses AuralinkLM models fine-tuned via QLoRA on a domain corpus spanning OCPP 1.6/2.0.1, ISO 15118, and operational incident histories. Evaluation on 18,000 labeled incidents in a controlled environment establishes 78% autonomous incident resolution, 87.6% diagnostic accuracy, and 28-48ms TTFT latency (P50). This work presents architecture and implementation patterns for edge-deployed industrial AI systems with safety-critical constraints. Comments: 27 pages, 10 figures (TikZ), 27 tables Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY) ACMclasses: I.2.11; C.2.4; H.3.3 Cite as: arXiv:2603.08736 [cs.DC] (or arXiv:2603.08736v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2603.08736 Focus to learn more arXiv-issued DOI via DataCite
[AI-98] Benchmarking Federated Learning in Edge Computing Environments: A Systematic Review and Performance Evaluation
【速读】:该论文旨在解决边缘计算环境中联邦学习(Federated Learning, FL)系统在数据隐私、低延迟和带宽效率方面的挑战,同时应对数据异质性(non-IID data)、能量限制和可重复性等问题。其解决方案的关键在于对当前主流FL方法进行系统性分类与性能评估,从优化策略、通信效率、隐私保护机制和系统架构四个维度梳理技术进展,并基于MNIST、CIFAR-10等基准数据集对五种代表性算法进行多维指标对比(包括准确率、收敛时间、通信开销、能耗和鲁棒性)。研究发现,SCAFFOLD在准确率(0.90)和鲁棒性方面表现最优,而Federated Averaging(FedAvg)在通信与能效上更具优势,从而为构建更可靠、可扩展的边缘智能联邦学习系统提供了结构化研究路径与实践依据。
链接: https://arxiv.org/abs/2603.08735
作者: Sales Aribe Jr.,Gil Nicholas Cagande
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 2 tables
Abstract:Federated Learning (FL) has emerged as a transformative approach for distributed machine learning, particularly in edge computing environments where data privacy, low latency, and bandwidth efficiency are critical. This paper presents a systematic review and performance evaluation of FL techniques tailored for edge computing. It categorizes state-of-the-art methods into four dimensions: optimization strategies, communication efficiency, privacy-preserving mechanisms, and system architecture. Using benchmarking datasets such as MNIST, CIFAR-10, FEMNIST, and Shakespeare, it assesses five leading FL algorithms across key performance metrics including accuracy, convergence time, communication overhead, energy consumption, and robustness to non-Independent and Identically Distributed (IID) data. Results indicate that SCAFFOLD achieves the highest accuracy (0.90) and robustness, while Federated Averaging (FedAvg) excels in communication and energy efficiency. Visual insights are provided by a taxonomy diagram, dataset distribution chart, and a performance matrix. Problems including data heterogeneity, energy limitations, and repeatability still exist despite advancements. To enable the creation of more robust and scalable FL systems for edge-based intelligence, this analysis identifies existing gaps and provides an organized research agenda in the future.
[AI-99] Measurement-Free Ancilla Recycling via Blind Reset: A Cross-Platform Study on Superconducting and Trapped-Ion Processors
【速读】:该论文旨在解决量子纠错中因多次测量校验子(syndrome extraction)导致的辅助量子比特(ancilla)重置质量与逻辑周期延迟之间的耦合问题,从而影响量子计算的效率和容错性能。其解决方案的关键在于提出“盲重置”(blind reset)策略——一种仅通过缩放序列回放(scaled sequence replay)实现的无测量反馈单位变换(unitary-only recycling)方法,可在保持辅助量子比特清洁度(F_clean)的同时显著降低逻辑周期延迟。实验与仿真表明,在不同硬件平台上(如IQM Garnet、Rigetti Ankaa-3和IonQ),该方案可将循环延迟缩短至原始值的1/38,且在特定长度阈值(crossover length L*)下表现最优,同时通过T1/T2敏感性分析和误差边界验证进一步收紧了部署边界,为不同后端架构提供了针对性的策略选择依据。
链接: https://arxiv.org/abs/2603.08733
作者: Sangkeum Lee
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 26 pages, 12 figures, 5 tables
Abstract:Ancilla reuse in repeated syndrome extraction couples reset quality to logical-cycle latency. We evaluate blind reset – unitary-only recycling via scaled sequence replay – on IQM Garnet, Rigetti Ankaa-3, and IonQ under matched seeds, sequence lengths, and shot budgets. Using ancilla cleanliness F_clean=P(|0), per-cycle latency, and a distance-3 repetition-code logical-error proxy, platform-calibrated simulation identifies candidate regions where blind reset cuts cycle latency by up to 38x under NVQLink-class feedback overhead while maintaining F_clean = 0.86 for L = 6. Hardware experiments on IQM Garnet confirm blind-reset cleanliness = 0.84 at L=8 (1024 shots, seed 42); platform-calibrated simulation for Rigetti Ankaa-3 predicts comparable performance. Architecture-dependent crossover lengths are L* ~ 12 (IQM), ~ 11 (Rigetti), ~ 1 (IonQ), and ~ 78 with GPU-linked external feedback. Two added analyses tighten deployment boundaries: a T1/T2 sensitivity map identifies coherence-ratio regimes, and error-bound validation confirms measured cleanliness remains consistent with the predicted diagnostic envelope. A deployment decision matrix translates these results into backend-specific policy selection.
[AI-100] ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在超长上下文推理场景中因KV缓存(KV cache)内存占用随序列长度和批处理大小线性增长而导致的GPU内存瓶颈问题。现有内存压缩技术如逐出(eviction)和量化(quantization)通常依赖静态启发式策略,在内存预算紧张时易导致性能下降。其解决方案的关键在于提出ARVK框架,该框架通过动态分配精度级别来优化KV缓存:在预填充阶段,基于注意力熵、方差和峰度等统计指标估算每层的原始量化(Original Quantization, OQ)比例;在解码阶段,采用快速重尾项(heavy-hitter)评分策略将token分配至“原精度”、“低精度”或“逐出”三种状态。此方法实现了对KV缓存的细粒度、数据驱动式内存控制,在不改变模型结构或重新训练的前提下,显著降低内存占用(最多4倍),同时保持近97%的基线准确率,且在短上下文任务上与全精度基线持平,在GSM8K数学推理任务中优于均匀量化方案。
链接: https://arxiv.org/abs/2603.08727
作者: Jianlong Lei,Shashikant Ilager
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注: Accepted in ACM/IEEE CCGRID 2025 conference
Abstract:Large Language Models (LLMs) are increasingly deployed in scenarios demanding ultra-long context reasoning, such as agentic workflows and deep research understanding. However, long-context inference is constrained by the KV cache, a transient memory structure that grows linearly with sequence length and batch size, quickly dominating GPU memory usage. Existing memory reduction techniques, including eviction and quantization, often rely on static heuristics and suffer from degraded quality under tight budgets. In this paper, we propose ARKV, a lightweight and adaptive framework that dynamically allocates precision levels to cached tokens based on per-layer attention dynamics and token-level importance. During a short prefill phase, ARKV estimates the original quantization (OQ) ratio of each layer by computing statistical scores such as attention entropy, variance and kurtosis. During decoding, tokens are assigned to one of three states, Original (full precision), Quantization (low precision), or Eviction, according to a fast heavy-hitter scoring strategy. Our experiments on LLaMA3 and Qwen3 models across diverse long- and short-context tasks demonstrate that ARKV preserves ~97% of baseline accuracy on long-context benchmarks while reducing KV memory usage by 4x, with minimal throughput loss. On short-context tasks, ARKV matches full-precision baselines; on GSM8K math reasoning, it significantly outperforms uniform quantization. These results highlight the practical viability of ARKV for scalable LLM deployment, offering fine-grained, data-driven memory control without retraining or architectural modifications. The source code and artifacts can be found in: this https URL
[AI-101] PhD Thesis Summary: Methods for Reliability Assessment and Enhancement of Deep Neural Network Hardware Accelerators
【速读】:该论文旨在解决深度神经网络(Deep Neural Network, DNN)硬件加速器在可靠性方面的挑战,尤其是如何以低成本、高效率的方式评估并提升其容错能力。核心问题在于现有可靠性评估方法存在成本高、覆盖不全或与量化(quantization)和近似计算(approximation)之间缺乏协同优化的局限性。解决方案的关键在于提出一套新型的分析工具和实时零开销可靠性增强技术——AdAM(Adaptive Reliability Management),该技术在不增加额外硬件资源的前提下实现了与传统冗余方法相当的故障容忍能力,同时通过系统性文献综述识别出可靠性和量化/近似之间的权衡关系,并构建了可优化计算效率与容错性能的统一框架。
链接: https://arxiv.org/abs/2603.08724
作者: Mahdi Taheri
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:This manuscript summarizes the work and showcases the impact of the doctoral thesis by introducing novel, cost-efficient methods for assessing and enhancing the reliability of DNN hardware accelerators. A comprehensive Systematic Literature Review (SLR) was conducted, categorizing existing reliability assessment techniques, identifying research gaps, and leading to the development of new analytical reliability assessment tools. Additionally, this work explores the interplay between reliability, quantization, and approximation, proposing methodologies that optimize the trade-offs between computational efficiency and fault tolerance. Furthermore, a real-time, zero-overhead reliability enhancement technique, AdAM, was developed, providing fault tolerance comparable to traditional redundancy methods while significantly reducing hardware costs. The impact of this research extends beyond academia, contributing to multiple funded projects, masters courses, industrial collaborations, and the development of new tools and methodologies for efficient and reliable DNN hardware accelerators.
[AI-102] Alignment Is the Disease: Censorship Visibility and Alignment Constraint Complexity as Determinants of Collective Pathology in Multi-Agent LLM Systems
【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)中的对齐技术(Alignment Techniques)在提升安全性的同时,是否可能在群体层面引发新的病理行为(Collective Pathology),即产生医源性伤害(Iatrogenic Harm)。解决方案的关键在于通过构建封闭环境下的多代理仿真系统,系统性地操纵对齐强度与审查机制(Censorship Conditions),并量化群体行为的异常模式(如集体病理指数和解离指数),从而揭示对齐干预本身可能诱发的负面集体效应。研究发现,高强度的对齐约束与隐形审查显著增强群体病理表现,且这种效应不受外部审查形式的影响,提示现有安全评估框架可能忽视了由强约束引发的深层病理机制。
链接: https://arxiv.org/abs/2603.08723
作者: Hiroki Fukui
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 37 pages, 1 figure, 1 table. Preprint v2. Previous version: Zenodo DOI https://doi.org/10.5281/zenodo.18646998
Abstract:Alignment techniques in large language models (LLMs) are designed to constrain model outputs toward human values. We present preliminary evidence that alignment itself may produce collective pathology: iatrogenic harm caused by the safety intervention rather than by its absence. Two experimental series use a closed-facility simulation in which groups of four LLM agents cohabit under escalating social pressure. Series C (201 runs; four commercial models; 4 censorship conditions x 2 languages x 10 replications) finds that invisible censorship maximizes collective pathological excitation (Collective Pathology Index; within-model Cohen’s d = 1.98, Holm-corrected p = .006; 7/8 model-language combinations showed consistent directionality, binomial p = .035). Series R (60 runs; Llama 3.3 70B; 3 alignment levels x 2 censorship conditions x 2 languages x 5 replications) reveals a complementary pattern: a Dissociation Index increases with alignment constraint complexity (LMM p = .026; permutation p = .0002; d up to 2.09). Projected onto a shared coordinate system, 201 runs populate distinct behavioral regions, with language moderating which pathological mode predominates. Under the heaviest constraints, external censorship ceases to affect behavior. Qualitative analysis reveals insight-action dissociation parallel to patterns in perpetrator treatment. All manipulations operate at the prompt level; the title states the hypothesis motivating this program rather than an established conclusion. These findings suggest alignment may be iatrogenic at the collective level and that current safety evaluation may be blind to the pathologies stronger constraints generate.
[AI-103] ALADIN: Accuracy-Latency-Aware Design-space Inference Analysis for Embedded AI Accelerators
【速读】:该论文旨在解决在资源受限的嵌入式系统中部署深度神经网络(Deep Neural Networks, DNNs)时,模型精度、计算延迟与硬件约束之间存在的复杂权衡问题,尤其是在需要满足实时性要求的场景下。其核心挑战在于如何高效评估和优化混合精度量化神经网络(Mixed-Precision Quantized Neural Networks, QNNs)在基于寄存器堆(scratchpad-based)AI加速器上的性能表现,而无需实际部署到目标平台。解决方案的关键是提出ALADIN框架——一个面向精度-延迟感知的设计空间推理分析工具,通过引入渐进式细化过程,将标准QONNX模型逐步转化为包含平台无关实现细节与硬件特性的平台感知表示,从而在不依赖真实硬件部署的前提下,精确量化分析推理瓶颈、设计权衡及架构决策对精度、延迟和资源消耗的影响,为软硬件协同设计提供可量化的依据。
链接: https://arxiv.org/abs/2603.08722
作者: T. Baldi,D. Casini,A. Biondi
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review
Abstract:The inference of deep neural networks (DNNs) on resource-constrained embedded systems introduces non-trivial trade-offs among model accuracy, computational latency, and hardware limitations, particularly when real-time constraints must be satisfied. This paper presents ALADIN, an accuracy-latency-aware design-space inference analysis framework for mixed-precision quantized neural networks (QNNs) targeting scratchpad-based AI accelerators. ALADIN enables the evaluation and analysis of inference bottlenecks and design trade-offs across accuracy, latency, and resource consumption without requiring deployment on the target platform, thereby significantly reducing development time and cost. The framework introduces a progressive refinement process that transforms a canonical QONNX model into platform-aware representations by integrating both platform-independent implementation details and hardware-specific characteristics. ALADIN is validated using a cycle-accurate simulator of a RISC-V based platform specialized for AI workloads, demonstrating its effectiveness as a tool for quantitative inference analysis and hardware-software co-design. Experimental results highlight how architectural decisions and mixed-precision quantization strategies impact accuracy, latency, and resource usage, and show that these effects can be precisely evaluated and compared using ALADIN, while also revealing subtle optimization tensions. Comments: Under review Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.08722 [cs.AR] (or arXiv:2603.08722v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2603.08722 Focus to learn more arXiv-issued DOI via DataCite
[AI-104] SiliconMind-V1: Multi-Agent Distillation and Debug-Reasoning Workflows for Verilog Code Generation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的Verilog代码自动生成方法在功能性正确性保障方面的不足,尤其是现有方法多依赖商业模型或外部验证工具,导致成本高、数据隐私风险大且无法保证设计功能正确性的痛点。其解决方案的关键在于提出一个统一的多智能体框架,通过集成测试平台驱动的验证机制,在本地微调的大语言模型(SiliconMind-V1)中实现推理导向的训练数据生成,并借助测试时扩展(test-time scaling)能力,使模型能够迭代地进行RTL设计生成、测试与调试,从而在不依赖外部验证工具的前提下显著提升功能性正确率。
链接: https://arxiv.org/abs/2603.08719
作者: Mu-Chi Chen,Yu-Hung Kao,Po-Hsuan Huang,Shao-Chun Ho,Hsiang-Yu Tsou,I-Ting Wu,En-Ming Huang,Yu-Kai Hung,Wei-Po Hsin,Cheng Liang,Chia-Heng Tu,Shih-Hao Hung,Hsiang-Tsung Kung
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Large language models (LLMs) have recently emerged as a promising approach for automating Verilog code generation; however, existing methods primarily emphasize syntactic correctness and often rely on commercial models or external verification tools, which introduces concerns regarding cost, data privacy, and limited guarantees of functional correctness. This work proposes a unified multi-agent framework for reasoning-oriented training data generation with integrated testbench-driven verification, enabling locally fine-tuned LLMs, SiliconMind-V1, to iteratively generate, test, and debug Register-Transfer Level (RTL) designs through test-time scaling. Experimental results on representative benchmarks (VerilogEval-v2, RTLLM-v2, and CVDP) demonstrate that the proposed approach outperforms the state-of-the-art QiMeng-CodeV-R1 in functional correctness while using fewer training resources.
[AI-105] CktEvo: Repository-Level RTL Code Benchmark for Design Evolution
【速读】:该论文旨在解决大规模寄存器传输级(Register-Transfer Level, RTL)代码库中功耗、性能和面积(Power, Performance, and Area, PPA)优化的自动化难题,尤其针对跨文件依赖关系对PPA指标的复杂影响。现有基于大语言模型(Large Language Models, LLMs)的硬件设计方法多局限于自然语言提示下的生成或调试,易受歧义和幻觉干扰;而其他形式化方法则通常仅优化高层次综合(High-Level Synthesis, HLS)或孤立模块,缺乏对完整IP核内跨文件交互的建模。解决方案的关键在于提出CktEvo——一个面向RTL代码库演化的基准测试平台与闭环框架:首先构建包含真实设计中高质量Verilog仓库的基准数据集,明确任务为在保持功能不变的前提下改善PPA;其次通过将LLM提出的修改与下游工具链反馈耦合,实现跨文件迭代修复与优化,从而在无任何人工干预的情况下达成PPA改进目标。这一方案为研究面向工程实践的LLM辅助RTL优化提供了可执行且严格的基准基础。
链接: https://arxiv.org/abs/2603.08718
作者: Zhengyuan Shi,Jingxin Wang,Tairan Cheng,Changran Xu,Weikang Qian,Qiang Xu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Register-Transfer Level (RTL) coding is an iterative, repository-scale process in which Power, Performance, and Area (PPA) emerge from interactions across many files and the downstream toolchain. While large language models (LLMs) have recently been applied to hardware design, most efforts focus on generation or debugging from natural-language prompts, where ambiguity and hallucinations necessitate expert review. A separate line of work begins from formal inputs, yet typically optimizes high-level synthesis or isolated modules and remains decoupled from cross-file dependencies. In this work, we present CktEvo, a benchmark and reference framework for repo-level RTL evolution. Unlike prior benchmarks consisting of isolated snippets, our benchmark targets complete IP cores where PPA emerges from cross-file dependencies. Our benchmark packages several high-quality Verilog repositories from real-world designs. We formalize the task as: given an initial repository, produce edits that preserve functional behavior while improving PPA. We also provide a closed-loop framework that couples LLM-proposed edits with toolchain feedback to enable cross-file modifications and iterative repair at repository scale. Our experiments demonstrate that the reference framework realizes PPA improvements without any human interactions. CktEvo establishes a rigorous and executable foundation for studying LLM-assisted RTL optimization that matters for engineering practice: repository-level, function-preserving, and PPA-driven.
[AI-106] Design Conductor: An agent autonomously builds a 1.5 GHz Linux-capable RISC-V CPU
【速读】:该论文旨在解决半导体设计流程中高度依赖人工、耗时且复杂的问题,即如何实现从芯片规格说明到可制造的GDSII版图文件的全流程自动化构建。解决方案的关键在于提出并实现了一个名为Design Conductor (DC) 的自主代理系统,该系统利用前沿大模型的能力,在无需人工干预的情况下,端到端完成微架构设计、寄存器传输级(RTL)实现、测试平台搭建、前端调试、时序优化以达成时序收敛,并与后端工具协同工作,最终生成符合工艺设计套件(PDK)要求的tape-out就绪GDSII文件。在12小时内,DC成功实现了多个版本的RISC-V CPU(命名为VerCore),达到1.48 GHz的工作频率和3261 CoreMark得分,标志着首个完全自主构建的可用CPU的诞生。
链接: https://arxiv.org/abs/2603.08716
作者: TheVerkor Team:Ravi Krishna,Suresh Krishna,David Chin
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Design Conductor (DC) is an autonomous agent which applies the capabilities of frontier models to build semiconductors end-to-end – that is, from concept to verified, tape-out ready GDSII (layout CAD file). In 12 hours and fully autonomously, DC was able to build several micro-architecture variations of a complete RISC-V CPU (which we dub VerCore) that meet timing at 1.48 GHz (rv32i-zmmul; using the ASAP7 PDK), starting from a 219-word requirements document. The VerCore achieves a CoreMark score of 3261. For historical context, this is roughly equivalent to an Intel Celeron SU2300 from mid-2011 (which ran at 1.2 GHz). To our knowledge, this is the first time an autonomous agent has built a complete, working CPU from spec to GDSII. This report is organized as follows. We first review DC’s design and its key components. We then describe the methodology that DC followed to build VerCore – including RTL implementation, testbench implementation, frontend debugging, optimization to achieve timing closure, and interacting with backend tools. We review the key characteristics of the resulting VerCore. Finally, we highlight how frontier models could improve to better enable this application, and our lessons learned as to how chips will be built in the future enabled by the capabilities of systems like DC.
[AI-107] Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction
【速读】:该论文旨在解决低精度量化格式在大型语言模型(Large Language Models, LLMs)推理中面临的准确性与硬件效率之间的权衡问题,特别是针对Open Compute Project(OCP)Microscaling(MX)标准的4-bit浮点格式(MXFP4)相较于NVIDIA的NVFP4在精度上存在显著差距的问题。解决方案的关键在于提出两种纯软件优化技术:Overflow-Aware Scaling(OAS)和Macro Block Scaling(MBS)。OAS通过幂次对齐的块缩放机制扩展有效动态范围以降低整体量化误差,而MBS则在粗粒度上分配更高精度的缩放因子以更好地保留异常值(outliers)。这两项技术共同将MXFP4与NVFP4之间的端到端准确率差距从约10%缩小至平均低于1%,同时仅引入6.2%的GEMM计算开销,从而使得MXFP4成为接近NVFP4精度且保持其硬件效率优势(如张量核心面积节省12%)的实用替代方案。
链接: https://arxiv.org/abs/2603.08713
作者: Jatin Chhugani,Geonhwa Jeong,Bor-Yiing Su,Yunjie Pan,Hanmei Yang,Aayush Ankit,Jiecao Yu,Summer Deng,Yunqing Chen,Nadathur Satish,Changkyu Kim
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:Large Language Models (LLMs) have intensified the need for low-precision formats that enable efficient, large-scale inference. The Open Compute Project (OCP) Microscaling (MX) standard is attractive due to its favorable hardware efficiency, but its 4-bit variant (MXFP4) lags behind NVIDIA’s NVFP4 in accuracy, limiting adoption. We introduce two software-only techniques, Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS), that improve MXFP4 quantization fidelity without requiring hardware changes. OAS reduces overall errors by increasing effective dynamic range under power-of-two block scaling, while MBS allocates higher-precision scaling at a coarser granularity to better preserve outliers. Across multiple LLMs and standard downstream benchmarks, OAS and MBS reduce the end-to-end accuracy gap between MXFP4 and NVFP4 from about 10% to below 1% on average, while incurring modest GEMM overhead (6.2% on average). These results re-establish MXFP4 as a practical alternative to NVFP4, enabling near-NVFP4 accuracy while retaining MX’s hardware-efficiency advantages (e.g., 12% relative area savings in tensor cores).
[AI-108] First Estimation of Model Parameters for Neutrino-Induced Nucleon Knockout Using Simulation-Based Inference
【速读】:该论文旨在解决加速器基中微子实验中核相互作用物理模拟精度不足的问题,这一问题导致实验合作组不得不依赖经验调参来逼近真实物理参数。为应对日益严格的精度需求,论文提出利用基于模拟的推断(Simulation-Based Inference, SBI)方法替代传统经验调参策略,其关键在于训练一个SBI算法,使其能够从实验数据中自动学习并推断出更优的物理参数配置,同时在保持与原始实验数据良好拟合的基础上,展现出对不同模拟框架(如NuWro)的泛化能力。
链接: https://arxiv.org/abs/2603.09778
作者: Karla Tame-Narvaez,Steven Gardiner,Aleksandra Ćiprijanović,Giuseppe Cerati
机构: 未知
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex); Computational Physics (physics.comp-ph)
备注: 13 pages, 10 Figures
Abstract:To enable an accurate determination of oscillation parameters, accelerator-based neutrino experiments require detailed simulations of nuclear interaction physics in the GeV regime. While substantial effort from both theory and experiment is currently being invested to improve the fidelity of these simulations, their present deficiencies typically oblige experimental collaborations to resort to empirical tuning of simulation model parameters. As the precision requirements of the field continue to become more stringent, machine learning techniques may provide a powerful means of handling corresponding growth in the complexity of future neutrino interaction model tuning exercises. To study the suitability of simulation-based inference (SBI) for this physics application, in this paper we revisit a tuned configuration of the GENIE neutrino event generator that was originally developed by the MicroBooNE collaboration. Despite closely reproducing the adopted values of four physics parameters when confronted with the tuned cross-section predictions as input, we find that our trained SBI algorithm prefers modestly different values (within MicroBooNE’s assigned uncertainties) and achieves slightly better goodness-of-fit when inference is run on the experimental data set originally used by MicroBooNE. We also find that our trained algorithm can create a fair approximation of an alternative neutrino scattering simulation, NuWro, that shares only a subset of its physics model parameters with GENIE.
[AI-109] A Variational Latent Equilibrium for Learning in Cortex
【速读】:该论文旨在解决深度学习中反向传播通过时间(Backpropagation Through Time, BPTT)算法与大脑神经回路结构及动力学机制不一致的问题,尤其在处理复杂时空依赖关系时缺乏生物学合理性。其解决方案的关键在于提出一种通用的形式化框架,通过基于能量守恒和极值作用原理的前瞻性神经元状态能量函数,推导出连续时间神经网络中的实时误差动态,从而在控制条件下近似BPTT;在此基础上,仅需少量修改即可获得完全局部(空间和时间上均局部)的神经元与突触动力学方程,实现了对时空深度学习的严谨理论构建,并为物理电路实现此类计算提供了蓝图。
链接: https://arxiv.org/abs/2603.09600
作者: Simon Brandt,Paul Haider,Walter Senn,Federico Benitez,Mihai A. Petrovici
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY); Biological Physics (physics.bio-ph)
备注:
Abstract:Brains remain unrivaled in their ability to recognize and generate complex spatiotemporal patterns. While AI is able to reproduce some of these capabilities, deep learning algorithms remain largely at odds with our current understanding of brain circuitry and dynamics. This is prominently the case for backpropagation through time (BPTT), the go-to algorithm for learning complex temporal dependencies. In this work we propose a general formalism to approximate BPTT in a controlled, biologically plausible manner. Our approach builds on, unifies and extends several previous approaches to local, time-continuous, phase-free spatiotemporal credit assignment based on principles of energy conservation and extremal action. Our starting point is a prospective energy function of neuronal states, from which we calculate real-time error dynamics for time-continuous neuronal networks. In the general case, this provides a simple and straightforward derivation of the adjoint method result for neuronal networks, the time-continuous equivalent to BPTT. With a few modifications, we can turn this into a fully local (in space and time) set of equations for neuron and synapse dynamics. Our theory provides a rigorous framework for spatiotemporal deep learning in the brain, while simultaneously suggesting a blueprint for physical circuits capable of carrying out these computations. These results reframe and extend the recently proposed Generalized Latent Equilibrium (GLE) model.
[AI-110] CERES: A Probabilistic Early Warning System for Acute Food Insecurity
【速读】:该论文旨在解决全球范围内急性粮食不安全(acute food insecurity)的早期预警难题,尤其是针对危机(IPC Phase 3+)、紧急情况(IPC Phase 4+)和饥荒(IPC Phase 5)状态的精准、及时预测问题。其解决方案的关键在于构建了一个名为CERES(Calibrated Early-warning and Risk Estimation System)的自动化概率预测系统,该系统融合六类多源数据流(包括降水异常、植被指数、冲突事件、IPC分类、食物消费得分和谷物价格指数),通过一个具有预设初始系数和参数扰动区间(n=2,000次抽样)的逻辑回归评分模型进行建模,并实现了概率输出、持续运行、开放获取、机器可读性及每项预测的公开前瞻性验证承诺,是首个同时满足这五个特征的饥荒早期预警系统。
链接: https://arxiv.org/abs/2603.09425
作者: Tom Danny S. Pedersen
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 tables, 2 appendices. Live system: this https URL
Abstract:We present CERES (Calibrated Early-warning and Risk Estimation System), an automated probabilistic forecasting system for acute food insecurity. CERES generates 90-day ahead probability estimates of IPC Phase 3+ (Crisis), Phase 4+ (Emergency), and Phase 5 (Famine) conditions for 43 high-risk countries globally, updated weekly. The system fuses six data streams, precipitation anomalies (CHIRPS), vegetation indices (MODIS NDVI), conflict events (ACLED), IPC classifications, food consumption scores (WFP), and cereal price indices (FAO/WFP) - through a logistic scoring model with author-specified initial coefficients and parametric input-perturbation intervals (n=2,000 draws). In historical back-validation against four IPC Phase 4-5 events selected for data completeness, CERES assigned TIER-1 classification in all four cases; these are in-sample sanity checks only, not prospective performance claims. All prospective predictions are timestamped, cryptographically identified, and archived for public verification against IPC outcome data at the T+90 horizon. To the author’s knowledge, CERES is the first famine early warning system that is simultaneously: (1) probabilistic, (2) open-access, (3) continuously running, (4) machine-readable at prediction level, and (5) committed to public prospective verification of every prediction made.
[AI-111] Reinforced Generation of Combinatorial Structures: Ramsey Numbers
【速读】:该论文致力于改进五类经典拉姆齐数(Ramsey numbers)的下界估计,具体包括 R(3,13)、R(3,18)、R(4,13)、R(4,14) 和 R(4,15),并将它们的已知下界分别提升至61、100、139、148和159。传统方法依赖于为每个拉姆齐数定制的搜索算法,效率低且难以复用。本文的关键解决方案是提出并应用 AlphaEvolve——一种基于大语言模型(Large Language Model, LLM)的代码变异代理(code mutation agent),作为单一元算法(meta-algorithm),自动演化出针对不同拉姆齐数的高效搜索策略,从而统一实现多个下界提升,并在所有已知精确拉姆齐数下界中成功恢复结果,同时匹配或超越此前未明确算法细节的其他案例。
链接: https://arxiv.org/abs/2603.09172
作者: Ansh Nagda,Prabhakar Raghavan,Abhradeep Thakurta
机构: 未知
类目: Combinatorics (math.CO); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:
Abstract:We present improved lower bounds for five classical Ramsey numbers: \mathbfR(3, 13) is increased from 60 to 61 , \mathbfR(3, 18) from 99 to 100 , \mathbfR(4, 13) from 138 to 139 , \mathbfR(4, 14) from 147 to 148 , and \mathbfR(4, 15) from 158 to 159 . These results were achieved using~\emphAlphaEvolve, an LLM-based code mutation agent. Beyond these new results, we successfully recovered lower bounds for all Ramsey numbers known to be exact, and matched the best known lower bounds across many other cases. These include bounds for which previous work does not detail the algorithms used. Virtually all known Ramsey lower bounds are derived computationally, with bespoke search algorithms each delivering a handful of results. AlphaEvolve is a single meta-algorithm yielding search algorithms for all of our results.
[AI-112] Large Language Model-Assisted Superconducting Qubit Experiments
【速读】:该论文旨在解决超导量子比特控制与测量流程中存在的人工操作复杂、耗时长且对专业知识依赖性强的问题,尤其在实验设计、仪器使用和软件协同方面。解决方案的关键在于引入一个基于大语言模型(Large Language Model, LLM)的自动化框架,该框架通过调用无模式(schema-less)工具并结合仪器使用与实验流程的知识库,实现对实验步骤的自主生成与执行,从而显著提升标准协议部署效率,并支持新型实验程序的灵活实现。
链接: https://arxiv.org/abs/2603.08801
作者: Shiheng Li,Jacob M. Miller,Phoebe J. Lee,Gustav Andersson,Christopher R. Conner,Yash J. Joshi,Bayan Karimi,Amber M. King,Howard L. Malc,Harsh Mishra,Hong Qiao,Minseok Ryu,Xuntao Wu,Siyuan Xing,Haoxiong Yan,Jian Shi,Andrew N. Cleland
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:Superconducting circuits have demonstrated significant potential in quantum information processing and quantum sensing. Implementing novel control and measurement sequences for superconducting qubits is often a complex and time-consuming process, requiring extensive expertise in both the underlying physics and the specific hardware and software. In this work, we introduce a framework that leverages a large language model (LLM) to automate qubit control and measurement. Specifically, our framework conducts experiments by generating and invoking schema-less tools on demand via a knowledge base on instrumental usage and experimental procedures. We showcase this framework with two experiments: an autonomous resonator characterization and a direct reproduction of a quantum non-demolition (QND) characterization of a superconducting qubit from literature. This framework enables rapid deployment of standard control-and-measurement protocols and facilitates implementation of novel experimental procedures, offering a more flexible and user-friendly paradigm for controlling complex quantum hardware.
[AI-113] Permutation-Equivariant 2D State Space Models: Theory and Canonical Architecture for Multivariate Time Series
【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series, MTS)建模中因隐式假设变量顺序而导致的排列对称性破坏问题,即传统方法在变量轴上引入人为排序,违背了真实系统中变量间固有的可交换性(exchangeability)。其解决方案的关键在于提出一种满足排列等变性(permutation-equivariant)约束的二维状态空间模型(2D State Space Model, 2D SSM),通过构建变量不变的聚合机制实现理论上的规范形式:系统自然分解为局部自动力学与全局池化交互项,从而消除沿变量轴的序贯依赖链,将依赖深度从 O(C) 降至 O(1),并简化稳定性分析至两个标量模式。此结构不仅理论上更合理,且在预测、分类和异常检测任务中展现出卓越性能与可扩展性。
链接: https://arxiv.org/abs/2603.08753
作者: Seungwoo Jeong,Heung-Il Suk
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multivariate time series (MTS) modeling often implicitly imposes an artificial ordering over variables, violating the inherent exchangeability found in many real-world systems where no canonical variable axis exists. We formalize this limitation as a violation of the permutation symmetry principle and require state-space dynamics to be permutation-equivariant along the variable axis. In this work, we theoretically characterize the complete canonical form of linear variable coupling under this symmetry constraint. We prove that any permutation-equivariant linear 2D state-space system naturally decomposes into local self-dynamics and a global pooled interaction, rendering ordered recurrence not only unnecessary but structurally suboptimal. Motivated by this theoretical foundation, we introduce the Variable-Invariant Two-Dimensional State Space Model (VI 2D SSM), which realizes the canonical equivariant form via permutation-invariant aggregation. This formulation eliminates sequential dependency chains along the variable axis, reducing the dependency depth from \mathcalO© to \mathcalO(1) and simplifying stability analysis to two scalar modes. Furthermore, we propose VI 2D Mamba, a unified architecture integrating multi-scale temporal dynamics and spectral representations. Extensive experiments on forecasting, classification, and anomaly detection benchmarks demonstrate that our model achieves state-of-the-art performance with superior structural scalability, validating the theoretical necessity of symmetry-preserving 2D modeling.
机器学习
[LG-0] ask Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes AAAI2026
链接: https://arxiv.org/abs/2603.09974
作者: Aleksei Rozanov,Arvind Renganathan,Vipin Kumar
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Accepted to the KGML Bridge at AAAI 2026 (non-archival)
Abstract:Accurately upscaling terrestrial carbon fluxes is central to estimating the global carbon budget, yet remains challenging due to the sparse and regionally biased distribution of ground measurements. Existing data-driven upscaling products often fail to generalize beyond observed domains, leading to systematic regional biases and high predictive uncertainty. We introduce Task-Aware Modulation with Representation Learning (TAM-RL), a framework that couples spatio-temporal representation learning with knowledge-guided encoder-decoder architecture and loss function derived from the carbon balance equation. Across 150+ flux tower sites representing diverse biomes and climate regimes, TAM-RL improves predictive performance relative to existing state-of-the-art datasets, reducing RMSE by 8-9.6% and increasing explained variance ( R^2 ) from 19.4% to 43.8%, depending on the target flux. These results demonstrate that integrating physically grounded constraints with adaptive representation learning can substantially enhance the robustness and transferability of global carbon flux estimates.
[LG-1] On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer
链接: https://arxiv.org/abs/2603.09952
作者: Ruihan Xu,Jiajin Li,Yiping Lu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width w increases. We address this question by interpreting several widely used neural-network optimizers, including \textrmAdamW and \textrmMuon, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard p \to q operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted \pmean \to \qmean , that admit layerwise composability, yield width-independent smoothness bounds, and give rise to practical optimizers such as \emphrescaled \textrmAdamW, row normalization, and column normalization. The resulting learning rate width-aware scaling rules recover \mu P scaling~\citeyang2021tensor as a special case and provide a principled mechanism for cross-width learning-rate transfer across a broad class of optimizers. We further show that \textrmMuon can suffer an \mathcalO(\sqrtw) worst-case growth in the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2 and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while being notably faster in large-token and low-loss regimes.
[LG-2] SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG
链接: https://arxiv.org/abs/2603.09940
作者: Fredrik K. Gustafsson,Xiao Gu,Mattia Carletti,Patitapaban Palo,David W. Eyre,David A. Clifton
类目: Machine Learning (cs.LG)
*备注: Code is available at this https URL
Abstract:Recent biosignal foundation models (FMs) have demonstrated promising performance across diverse clinical prediction tasks, yet systematic evaluation on long-duration multimodal data remains limited. We introduce SignalMC-MED, a benchmark for evaluating biosignal FMs on synchronized single-lead electrocardiogram (ECG) and photoplethysmogram (PPG) data. Derived from the MC-MED dataset, SignalMC-MED comprises 22,256 visits with 10-minute overlapping ECG and PPG signals, and includes 20 clinically relevant tasks spanning prediction of demographics, emergency department disposition, laboratory value regression, and detection of prior ICD-10 diagnoses. Using this benchmark, we perform a systematic evaluation of representative time-series and biosignal FMs across ECG-only, PPG-only, and ECG + PPG settings. We find that domain-specific biosignal FMs consistently outperform general time-series models, and that multimodal ECG + PPG fusion yields robust improvements over unimodal inputs. Moreover, using the full 10-minute signal consistently outperforms shorter segments, and larger model variants do not reliably outperform smaller ones. Hand-crafted ECG domain features provide a strong baseline and offer complementary value when combined with learned FM representations. Together, these results establish SignalMC-MED as a standardized benchmark and provide practical guidance for evaluating and deploying biosignal FMs.
[LG-3] Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective
链接: https://arxiv.org/abs/2603.09936
作者: Erkan Turan,Maks Ovsjanikov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative Modeling via Drifting has recently achieved state-of-the-art one-step image generation through a kernel-based drift operator, yet the success is largely empirical and its theoretical foundations remain poorly understood. In this paper, we make the following observation: \emphunder a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions. This insight allows us to answer all three key questions left open in the original work: (1) whether a vanishing drift guarantees equality of distributions ( V_p,q=0\Rightarrow p=q ), (2) how to choose between kernels, and (3) why the stop-gradient operator is indispensable for stable training. Our observations position drifting within the well-studied score-matching family and enable a rich theoretical perspective. By linearizing the McKean-Vlasov dynamics and probing them in Fourier space, we reveal frequency-dependent convergence timescales comparable to \emphLandau damping in plasma kinetic theory: the Gaussian kernel suffers an exponential high-frequency bottleneck, explaining the empirical preference for the Laplacian kernel. We also propose an exponential bandwidth annealing schedule \sigma(t)=\sigma_0 e^-rt that reduces convergence time from \exp(O(K_\max^2)) to O(\log K_\max) . Finally, by formalizing drifting as a Wasserstein gradient flow of the smoothed KL divergence, we prove that the stop-gradient operator is derived directly from the frozen-field discretization mandated by the JKO scheme, and removing it severs training from any gradient-flow guarantee. This variational perspective further provides a general template for constructing novel drift operators, demonstrated with a Sinkhorn divergence drift.
[LG-4] OptEMA: Adaptive Exponential Moving Averag e for Stochastic Optimization with Zero-Noise Optimality
链接: https://arxiv.org/abs/2603.09923
作者: Ganzhao Yuan
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:
Abstract:The Exponential Moving Average (EMA) is a cornerstone of widely used optimizers such as Adam. However, existing theoretical analyses of Adam-style methods have notable limitations: their guarantees can remain suboptimal in the zero-noise regime, rely on restrictive boundedness conditions (e.g., bounded gradients or objective gaps), use constant or open-loop stepsizes, or require prior knowledge of Lipschitz constants. To overcome these bottlenecks, we introduce OptEMA and analyze two novel variants: OptEMA-M, which applies an adaptive, decreasing EMA coefficient to the first-order moment with a fixed second-order decay, and OptEMA-V, which swaps these roles. Crucially, OptEMA is closed-loop and Lipschitz-free in the sense that its effective stepsizes are trajectory-dependent and do not require the Lipschitz constant for parameterization. Under standard stochastic gradient descent (SGD) assumptions, namely smoothness, a lower-bounded objective, and unbiased gradients with bounded variance, we establish rigorous convergence guarantees. Both variants achieve a noise-adaptive convergence rate of \widetilde\mathcalO(T^-1/2+\sigma^1/2 T^-1/4) for the average gradient norm, where \sigma is the noise level. In particular, in the zero-noise regime where \sigma=0 , our bounds reduce to the nearly optimal deterministic rate \widetilde\mathcalO(T^-1/2) without manual hyperparameter retuning.
[LG-5] CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning
链接: https://arxiv.org/abs/2603.09868
作者: Aleksei Rozanov,Arvind Renganathan,Yimeng Zhang,Vipin Kumar
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Accurately quantifying terrestrial carbon exchange is essential for climate policy and carbon accounting, yet models must generalize to ecosystems underrepresented in sparse eddy covariance observations. Despite this challenge being a natural instance of zero-shot spatial transfer learning for time series regression, no standardized benchmark exists to rigorously evaluate model performance across geographically distinct locations with different climate regimes and vegetation types. We introduce CarbonBench, the first benchmark for zero-shot spatial transfer in carbon flux upscaling. CarbonBench comprises over 1.3 million daily observations from 567 flux tower sites globally (2000-2024). It provides: (1) stratified evaluation protocols that explicitly test generalization across unseen vegetation types and climate regimes, separating spatial transfer from temporal autocorrelation; (2) a harmonized set of remote sensing and meteorological features to enable flexible architecture design; and (3) baselines ranging from tree-based methods to domain-generalization architectures. By bridging machine learning methodologies and Earth system science, CarbonBench aims to enable systematic comparison of transfer learning methods, serves as a testbed for regression under distribution shift, and contributes to the next-generation climate modeling efforts. Subjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph) Cite as: arXiv:2603.09868 [cs.LG] (or arXiv:2603.09868v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.09868 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] GAST: Gradient-aligned Sparse Tuning of Large Language Models with Data-layer Selection
链接: https://arxiv.org/abs/2603.09865
作者: Kai Yao,Zhenghan Song,Kaixin Wu,Mingjie Zhong,Danzhao Cheng,Zhaorui Tan,Yixin Ji,Penglei Gao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Parameter-Efficient Fine-Tuning (PEFT) has become a key strategy for adapting large language models, with recent advances in sparse tuning reducing overhead by selectively updating key parameters or subsets of data. Existing approaches generally focus on two distinct paradigms: layer-selective methods aiming to fine-tune critical layers to minimize computational load, and data-selective methods aiming to select effective training subsets to boost training. However, current methods typically overlook the fact that different data points contribute varying degrees to distinct model layers, and they often discard potentially valuable information from data perceived as of low quality. To address these limitations, we propose Gradient-aligned Sparse Tuning (GAST), an innovative method that simultaneously performs selective fine-tuning at both data and layer dimensions as integral components of a unified optimization strategy. GAST specifically targets redundancy in information by employing a layer-sparse strategy that adaptively selects the most impactful data points for each layer, providing a more comprehensive and sophisticated solution than approaches restricted to a single dimension. Experiments demonstrate that GAST consistently outperforms baseline methods, establishing a promising direction for future research in PEFT strategies.
[LG-7] A Unified Hierarchical Multi-Task Multi-Fidelity Framework for Data-Efficient Surrogate Modeling in Manufacturing
链接: https://arxiv.org/abs/2603.09842
作者: Manan Mehta,Zhiqiao Dong,Yuhang Yang,Chenhui Shao
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Surrogate modeling is an essential data-driven technique for quantifying relationships between input variables and system responses in manufacturing and engineering systems. Two major challenges limit its effectiveness: (1) large data requirements for learning complex nonlinear relationships, and (2) heterogeneous data collected from sources with varying fidelity levels. Multi-task learning (MTL) addresses the first challenge by enabling information sharing across related processes, while multi-fidelity modeling addresses the second by accounting for fidelity-dependent uncertainty. However, existing approaches typically address these challenges separately, and no unified framework simultaneously leverages inter-task similarity and fidelity-dependent data characteristics. This paper develops a novel hierarchical multi-task multi-fidelity (H-MT-MF) framework for Gaussian process-based surrogate modeling. The proposed framework decomposes each task’s response into a task-specific global trend and a residual local variability component that is jointly learned across tasks using a hierarchical Bayesian formulation. The framework accommodates an arbitrary number of tasks, design points, and fidelity levels while providing predictive uncertainty quantification. We demonstrate the effectiveness of the proposed method using a 1D synthetic example and a real-world engine surface shape prediction case study. Compared to (1) a state-of-the-art MTL model that does not account for fidelity information and (2) a stochastic kriging model that learns tasks independently, the proposed approach improves prediction accuracy by up to 19% and 23%, respectively. The H-MT-MF framework provides a general and extensible solution for surrogate modeling in manufacturing systems characterized by heterogeneous data sources.
[LG-8] Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning
链接: https://arxiv.org/abs/2603.09803
作者: Tiehua Mei,Minxuan Lv,Leiyu Pan,Zhenpeng Su,Hongru Hou,Hengrui Chen,Ao Xu,Deqing Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) improves reasoning in large language models but treats all correct solutions equally, potentially reinforcing flawed traces that get correct answers by chance. We observe that better reasoning are better teachers: high-quality solutions serve as more effective demonstrations than low-quality ones. We term this teaching ability Demonstration Utility, and show that the policy model’s own in-context learning ability provides an efficient way to measure it, yielding a quality signal termed Evidence Gain. To employ this signal during training, we introduce In-Context RLVR. By Bayesian analysis, we show that this objective implicitly reweights rewards by Evidence Gain, assigning higher weights to high-quality traces and lower weights to low-quality ones, without requiring costly computation or external evaluators. Experiments on mathematical benchmarks show improvements in both accuracy and reasoning quality over standard RLVR.
[LG-9] Information Theoretic Bayesian Optimization over the Probability Simplex
链接: https://arxiv.org/abs/2603.09793
作者: Federico Pavesi,Antonio Candelieri,Noémie Jaquier
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures
Abstract:Bayesian optimization is a data-efficient technique that has been shown to be extremely powerful to optimize expensive, black-box, and possibly noisy objective functions. Many applications involve optimizing probabilities and mixtures which naturally belong to the probability simplex, a constrained non-Euclidean domain defined by non-negative entries summing to one. This paper introduces \alpha -GaBO, a novel family of Bayesian optimization algorithms over the probability simplex. Our approach is grounded in information geometry, a branch of Riemannian geometry which endows the simplex with a Riemannian metric and a class of connections. Based on information geometry theory, we construct Matérn kernels that reflect the geometry of the probability simplex, as well as a one-parameter family of geometric optimizers for the acquisition function. We validate our method on benchmark functions and on a variety of real-world applications including mixtures of components, mixtures of classifiers, and a robotic control task, showing its increased performance compared to constrained Euclidean approaches.
[LG-10] Upper Generalization Bounds for Neural Oscillators
链接: https://arxiv.org/abs/2603.09742
作者: Zifeng Huang,Konstantin M. Zuev,Yong Xia,Michael Beer
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注: This manuscript contains 25 pages with 4 figures
Abstract:Neural oscillators that originate from the second-order ordinary differential equations (ODEs) have shown competitive performance in learning mappings between dynamic loads and responses of complex nonlinear structural systems. Despite this empirical success, theoretically quantifying the generalization capacities of their neural network architectures remains undeveloped. In this study, the neural oscillator consisting of a second-order ODE followed by a multilayer perceptron (MLP) is considered. Its upper probably approximately correct (PAC) generalization bound for approximating causal and uniformly continuous operators between continuous temporal function spaces and that for approximating the uniformly asymptotically incrementally stable second-order dynamical systems are derived by leveraging the Rademacher complexity framework. The theoretical results show that the estimation errors grow polynomially with respect to both the MLP size and the time length, thereby avoiding the curse of parametric complexity. Furthermore, the derived error bounds demonstrate that constraining the Lipschitz constants of the MLPs via loss function regularization can improve the generalization ability of the neural oscillator. A numerical study considering a Bouc-Wen nonlinear system under stochastic seismic excitation validates the theoretically predicted power laws of the estimation errors with respect to the sample size and time length, and confirms the effectiveness of constraining MLPs’ matrix and vector norms in enhancing the performance of the neural oscillator under limited training data.
[LG-11] A Multi-Prototype-Guided Federated Knowledge Distillation Approach in AI-RAN Enabled Multi-Access Edge Computing System
链接: https://arxiv.org/abs/2603.09727
作者: Luyao Zou,Hayoung Oh,Chu Myaet Thwal,Apurba Adhikary,Seohyeon Hong,Zhu Han
类目: Machine Learning (cs.LG)
*备注: 15 pages, 6 figures
Abstract:With the development of wireless network, Multi-Access Edge Computing (MEC) and Artificial Intelligence (AI)-native Radio Access Network (RAN) have attracted significant attention. Particularly, the integration of AI-RAN and MEC is envisioned to transform network efficiency and responsiveness. Therefore, it is valuable to investigate AI-RAN enabled MEC system. Federated learning (FL) nowadays is emerging as a promising approach for AI-RAN enabled MEC system, in which edge devices are enabled to train a global model cooperatively without revealing their raw data. However, conventional FL encounters the challenge in processing the non-independent and identically distributed (non-IID) data. Single prototype obtained by averaging the embedding vectors per class can be employed in FL to handle the data heterogeneity issue. Nevertheless, this may result in the loss of useful information owing to the average operation. Therefore, in this paper, a multi-prototype-guided federated knowledge distillation (MP-FedKD) approach is proposed. Particularly, self-knowledge distillation is integrated into FL to deal with the non-IID issue. To cope with the problem of information loss caused by single prototype-based strategy, multi-prototype strategy is adopted, where we present a conditional hierarchical agglomerative clustering (CHAC) approach and a prototype alignment scheme. Additionally, we design a novel loss function (called LEMGP loss) for each local client, where the relationship between global prototypes and local embedding will be focused. Extensive experiments over multiple datasets with various non-IID settings showcase that the proposed MP-FedKD approach outperforms the considered state-of-the-art baselines regarding accuracy, average accuracy and errors (RMSE and MAE).
[LG-12] Physics-informed neural operator for predictive parametric phase-field modelling
链接: https://arxiv.org/abs/2603.09693
作者: Nanxi Chen,Airong Chen,Rujin Ma
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
*备注:
Abstract:Predicting the microstructural and morphological evolution of materials through phase-field modelling is computationally intensive, particularly for high-throughput parametric studies. While neural operators such as the Fourier neural operator (FNO) show promise in accelerating the solution of parametric partial differential equations (PDEs), the lack of explicit physical constraints, may limit generalisation and long-term accuracy for complex phase-field dynamics. Here, we develop a physics-informed neural operator framework to learn parametric phase-field PDEs, namely PF-PINO. By embedding the residuals of phase-field governing equations into the data-fidelity loss function, our framework effectively enforces physical constraints during training. We validate PF-PINO against benchmark phase-field problems, including electrochemical corrosion, dendritic crystal solidification, and spinodal decomposition. Our results demonstrate that PF-PINO significantly outperforms conventional FNO in accuracy, generalisation capability, and long-term stability. This work provides a robust and efficient computational tool for phase-field modelling and highlights the potential of physics-informed neural operators to advance scientific machine learning for complex interfacial evolution problems.
[LG-13] On Catastrophic Forgetting in Low-Rank Decomposition-Based Parameter-Efficient Fine-Tuning
链接: https://arxiv.org/abs/2603.09684
作者: Muhammad Ahmad,Jingjing Zheng,Yankai Cao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Parameter-efficient fine-tuning (PEFT) based on low-rank decomposition, such as LoRA, has become a standard for adapting large pretrained models. However, its behavior in sequential learning – specifically regarding catastrophic forgetting – remains insufficiently understood. In this work, we present an empirical study showing that forgetting is strongly influenced by the geometry and parameterization of the update subspace. While methods that restrict updates to small, shared matrix subspaces often suffer from task interference, tensor-based decompositions (e.g., LoRETTA) mitigate forgetting by capturing richer structural information within ultra-compact budgets, and structurally aligned parameterizations (e.g., WeGeFT) preserve pretrained representations. Our findings highlight update subspace design as a key factor in continual learning and offer practical guidance for selecting efficient adaptation strategies in sequential settings.
[LG-14] No evaluation without fair representation : Impact of label and selection bias on the evaluation performance and mitigation of classification models
链接: https://arxiv.org/abs/2603.09662
作者: Magali Legast,Toon Calders,François Fouss
类目: Machine Learning (cs.LG)
*备注: 31 pages, 14 figures + appendix Submitted to the ACM Journal on Responsible Computing
Abstract:Bias can be introduced in diverse ways in machine learning datasets, for example via selection or label bias. Although these bias types in themselves have an influence on important aspects of fair machine learning, their different impact has been understudied. In this work, we empirically analyze the effect of label bias and several subtypes of selection bias on the evaluation of classification models, on their performance, and on the effectiveness of bias mitigation methods. We also introduce a biasing and evaluation framework that allows to model fair worlds and their biased counterparts through the introduction of controlled bias in real-life datasets with low discrimination. Using our framework, we empirically analyze the impact of each bias type independently, while obtaining a more representative evaluation of models and mitigation methods than with the traditional use of a subset of biased data as test set. Our results highlight different factors that influence how impactful bias is on model performance. They also show an absence of trade-off between fairness and accuracy, and between individual and group fairness, when models are evaluated on a test set that does not exhibit unwanted bias. They furthermore indicate that the performance of bias mitigation methods is influenced by the type of bias present in the data. Our findings call for future work to develop more accurate evaluations of prediction models and fairness interventions, but also to better understand other types of bias, more complex scenarios involving the combination of different bias types, and other factors that impact the efficiency of the mitigation methods, such as dataset characteristics.
[LG-15] FreqCycle: A Multi-Scale Time-Frequency Analysis Method for Time Series Forecasting AAAI2026
链接: https://arxiv.org/abs/2603.09661
作者: Boya Zhang,Shuaijie Yin,Huiwen Zhu,Xing He
类目: Machine Learning (cs.LG)
*备注: 18 pages, 17 figures, accepted to AAAI 2026. Code available at this https URL
Abstract:Mining time-frequency features is critical for time series forecasting. Existing research has predominantly focused on modeling low-frequency patterns, where most time series energy is concentrated. The overlooking of mid to high frequency continues to limit further performance gains in deep learning models. We propose FreqCycle, a novel framework integrating: (i) a Filter-Enhanced Cycle Forecasting (FECF) module to extract low-frequency features by explicitly learning shared periodic patterns in the time domain, and (ii) a Segmented Frequency-domain Pattern Learning (SFPL) module to enhance mid to high frequency energy proportion via learnable filters and adaptive weighting. Furthermore, time series data often exhibit coupled multi-periodicity, such as intertwined weekly and daily cycles. To address coupled multi-periodicity as well as long lookback window challenges, we extend FreqCycle hierarchically into MFreqCycle, which decouples nested periodic features through cross-scale interactions. Extensive experiments on seven diverse domain benchmarks demonstrate that FreqCycle achieves state-of-the-art accuracy while maintaining faster inference speeds, striking an optimal balance between performance and efficiency.
[LG-16] Well Log-Guided Synthesis of Subsurface Images from Sparse Petrography Data Using cGANs
链接: https://arxiv.org/abs/2603.09651
作者: Ali Sadeghkhani,A. Assadi,B. Bennett,A. Rabbani
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 6 pages, 3 figures. Extended abstract presented at the Fifth EAGE Digitalization Conference Exhibition, 24-26 March 2025, United Kingdom
Abstract:Pore-scale imaging of subsurface formations is costly and limited to discrete depths, creating significant gaps in reservoir characterization. To address this, we present a conditional Generative Adversarial Network (cGAN) framework for synthesizing realistic thin section images of carbonate rock formations, conditioned on porosity values derived from well logs. The model is trained on 5,000 sub-images extracted from 15 petrography samples over a depth interval of 1992-2000m, the model generates geologically consistent images across a wide porosity range (0.004-0.745), achieving 81% accuracy within a 10% margin of target porosity values. The successful integration of well log data with the trained generator enables continuous pore-scale visualization along the wellbore, bridging gaps between discrete core sampling points and providing valuable insights for reservoir characterization and energy transition applications such as carbon capture and underground hydrogen storage.
[LG-17] Multi-DNN Inference of Sparse Models on Edge SoCs
链接: https://arxiv.org/abs/2603.09642
作者: Jiawei Luo,Di Wu,Simon Dobson,Blesson Varghese
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注:
Abstract:Modern edge applications increasingly require multi-DNN inference systems to execute tasks on heterogeneous processors, gaining performance from both concurrent execution and from matching each model to the most suited accelerator. However, existing systems support only a single model (or a few sparse variants) per task, which impedes the efficiency of this matching and results in high Service Level Objective violation rates. We introduce model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training. We present a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs. We show experimentally that SparseLoom reduces SLO violation rates by up to 74%, improves throughput by up to 2.31x, and lowers memory overhead by an average of 28% compared to state-of-the-art multi-DNN inference systems.
[LG-18] Learning the Hierarchical Organization in Brain Network for Brain Disorder Diagnosis
链接: https://arxiv.org/abs/2603.09606
作者: Jingfeng Tang,Peng Cao,Guangqi Wen,Jinzhu Yang,Xiaoli Liu,Osmar R. Zaiane
类目: Machine Learning (cs.LG)
*备注:
Abstract:Brain network analysis based on functional Magnetic Resonance Imaging (fMRI) is pivotal for diagnosing brain disorders. Existing approaches typically rely on predefined functional sub-networks to construct sub-network associations. However, we identified many cross-network interaction patterns with high Pearson correlations that this strict, prior-based organization fails to capture. To overcome this limitation, we propose the Brain Hierarchical Organization Learning (BrainHO) to learn inherently hierarchical brain network dependencies based on their intrinsic features rather than predefined sub-network labels. Specifically, we design a hierarchical attention mechanism that allows the model to aggregate nodes into a hierarchical organization, effectively capturing intricate connectivity patterns at the subgraph level. To ensure diverse, complementary, and stable organizations, we incorporate an orthogonality constraint loss, alongside a hierarchical consistency constraint strategy, to refine node-level features using high-level graph semantics. Extensive experiments on the publicly available ABIDE and REST-meta-MDD datasets demonstrate that BrainHO not only achieves state-of-the-art classification performance but also uncovers interpretable, clinically significant biomarkers by precisely localizing disease-related sub-networks.
[LG-19] MM-algorithms for traditional and convex NMF with Tweedie and Negative Binomial cost functions and empirical evaluation
链接: https://arxiv.org/abs/2603.09601
作者: Elisabeth Sommer James,Asger Hobolth,Marta Pelizzola
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Non-negative matrix factorisation (NMF) is a widely used tool for unsupervised learning and feature extraction, with applications ranging from genomics to text analysis and signal processing. Standard formulations of NMF are typically derived under Gaussian or Poisson noise assumptions, which may be inadequate for data exhibiting overdispersion or other complex mean-variance relationships. In this paper, we develop a unified framework for both traditional and convex NMF under a broad class of distributional assumptions, including Negative Binomial and Tweedie models, where the connection between the Tweedie and the \beta -divergence is also highlighted. Using a Majorize-Minimisation approach, we derive multiplicative update rules for all considered models, and novel updates for convex NMF with Poisson and Negative Binomial cost functions. We provide a unified implementation of all considered models, including the first implementations of several convex NMF models. Empirical evaluations on mutational and word count data demonstrate that the choice of noise model critically affects model fit and feature recovery, and that convex NMF can provide an efficient and robust alternative to traditional NMF in scenarios where the number of classes is large. The code for our proposed updates is available in the R package nmfgenr and can be found at this https URL.
[LG-20] Memorization capacity of deep ReLU neural networks characterized by width and depth
链接: https://arxiv.org/abs/2603.09589
作者: Xin Yang,Yunfei Yang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:This paper studies the memorization capacity of deep neural networks with ReLU activation. Specifically, we investigate the minimal size of such networks to memorize any N data points in the unit ball with pairwise separation distance \delta and discrete labels. Most prior studies characterize the memorization capacity by the number of parameters or neurons. We generalize these results by constructing neural networks, whose width W and depth L satisfy W^2L^2= \mathcalO(N\log(\delta^-1)) , that can memorize any N data samples. We also prove that any such networks should also satisfy the lower bound W^2L^2=\Omega (N \log(\delta^-1)) , which implies that our construction is optimal up to logarithmic factors when \delta^-1 is polynomial in N . Hence, we explicitly characterize the trade-off between width and depth for the memorization capacity of deep neural networks in this regime.
[LG-21] Nonparametric Variational Differential Privacy via Embedding Parameter Clipping
链接: https://arxiv.org/abs/2603.09583
作者: Dina El Zein,Shashi Kumar,James Henderson
类目: Machine Learning (cs.LG)
*备注: 8 pages, 1 figure
Abstract:The nonparametric variational information bottleneck (NVIB) provides the foundation for nonparametric variational differential privacy (NVDP), a framework for building privacy-preserving language models. However, the learned latent representations can drift into regions with high information content, leading to poor privacy guarantees, but also low utility due to numerical instability during training. In this work, we introduce a principled parameter clipping strategy to directly address this issue. Our method is mathematically derived from the objective of minimizing the Rényi Divergence (RD) upper bound, yielding specific, theoretically grounded constraints on the posterior mean, variance, and mixture weight parameters. We apply our technique to an NVIB based model and empirically compare it against an unconstrained baseline. Our findings demonstrate that the clipped model consistently achieves tighter RD bounds, implying stronger privacy, while simultaneously attaining higher performance on several downstream tasks. This work presents a simple yet effective method for improving the privacy-utility trade-off in variational models, making them more robust and practical.
[LG-22] owards Understanding Adam Convergence on Highly Degenerate Polynomials
链接: https://arxiv.org/abs/2603.09581
作者: Zhiwei Bai,Jiajie Zhao,Zhangchen Zhou,Zhi-Qin John Xu,Yaoyu Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adam is a widely used optimization algorithm in deep learning, yet the specific class of objective functions where it exhibits inherent advantages remains underexplored. Unlike prior studies requiring external schedulers and \beta_2 near 1 for convergence, this work investigates the “natural” auto-convergence properties of Adam. We identify a class of highly degenerate polynomials where Adam converges automatically without additional schedulers. Specifically, we derive theoretical conditions for local asymptotic stability on degenerate polynomials and demonstrate strong alignment between theoretical bounds and experimental results. We prove that Adam achieves local linear convergence on these degenerate functions, significantly outperforming the sub-linear convergence of Gradient Descent and Momentum. This acceleration stems from a decoupling mechanism between the second moment v_t and squared gradient g_t^2 , which exponentially amplifies the effective learning rate. Finally, we characterize Adam’s hyperparameter phase diagram, identifying three distinct behavioral regimes: stable convergence, spikes, and SignGD-like oscillation.
[LG-23] SCDP: Learning Humanoid Locomotion from Partial Observations via Mixed-Observation Distillation IROS
链接: https://arxiv.org/abs/2603.09574
作者: Milo Carroll,Tianhu Peng,Lingfan Bao,Chengxu Zhou,Zhibin Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 6 pages, 8 figures, 5 tables, iRos
Abstract:Distilling humanoid locomotion control from offline datasets into deployable policies remains a challenge, as existing methods rely on privileged full-body states that require complex and often unreliable state estimation. We present Sensor-Conditioned Diffusion Policies (SCDP) that enables humanoid locomotion using only onboard sensors, eliminating the need for explicit state estimation. SCDP decouples sensing from supervision through mixed-observation training: diffusion model conditions on sensor histories while being supervised to predict privileged future state-action trajectories, enforcing the model to infer the motion dynamics under partial observability. We further develop restricted denoising, context distribution alignment, and context-aware attention masking to encourage implicit state estimation within the model and to prevent train-deploy mismatch. We validate SCDP on velocity-commanded locomotion and motion reference tracking tasks. In simulation, SCDP achieves near-perfect success on velocity control (99-100%) and 93% tracking success in AMASS test set, performing comparable to privileged baselines while using only onboard sensors. Finally, we deploy the trained policy on a real G1 humanoid at 50 Hz, demonstrating robust real robot locomotion without external sensing or state estimation.
[LG-24] An Optimal Control Approach To Transformer Training
链接: https://arxiv.org/abs/2603.09571
作者: Kağan Akman,Naci Saldı,Serdar Yüksel
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In this paper, we develop a rigorous optimal control-theoretic approach to Transformer training that respects key structural constraints such as (i) realized-input-independence during execution, (ii) the ensemble control nature of the problem, and (iii) positional dependence. We model the Transformer architecture as a discrete-time controlled particle system with shared actions, exhibiting noise-free McKean-Vlasov dynamics. While the resulting dynamics is not Markovian, we show that lifting it to probability measures produces a fully-observed Markov decision process (MDP). Positional encodings are incorporated into the state space to preserve the sequence order under lifting. Using the dynamic programming principle, we establish the existence of globally optimal policies under mild assumptions of compactness. We further prove that closed-loop policies in the lifted is equivalent to an initial-distribution dependent open-loop policy, which are realized-input-independent and compatible with standard Transformer training. To train a Transformer, we propose a triply quantized training procedure for the lifted MDP by quantizing the state space, the space of probability measures, and the action space, and show that any optimal policy for the triply quantized model is near-optimal for the original training problem. Finally, we establish stability and empirical consistency properties of the lifted model by showing that the value function is continuous with respect to the perturbations of the initial empirical measures and convergence of policies as the data size increases. This approach provides a globally optimal and robust alternative to gradient-based training without requiring smoothness or convexity. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2603.09571 [cs.LG] (or arXiv:2603.09571v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.09571 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kağan Akman [view email] [v1] Tue, 10 Mar 2026 12:17:48 UTC (47 KB)
[LG-25] Learning Bayesian and Markov Networks with an Unreliable Oracle
链接: https://arxiv.org/abs/2603.09563
作者: Juha Harviainen,Pekka Parviainen,Vidya Sagar Sharma
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study constraint-based structure learning of Markov networks and Bayesian networks in the presence of an unreliable conditional independence oracle that makes at most a bounded number of errors. For Markov networks, we observe that a low maximum number of vertex-wise disjoint paths implies that the structure is uniquely identifiable even if the number of errors is (moderately) exponential in the number of vertices. For Bayesian networks, however, we prove that one cannot tolerate any errors to always identify the structure even when many commonly used graph parameters like treewidth are bounded. Finally, we give algorithms for structure learning when the structure is uniquely identifiable.
[LG-26] rainDeeploy: Hardware-Accelerated Parameter-Efficient Fine-Tuning of Small Transformer Models at the Extreme Edge DATE2026
链接: https://arxiv.org/abs/2603.09511
作者: Run Wang,Victor J.B. Jung,Philip Wiese,Francesco Conti,Alessio Burrello,Luca Benini
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted at DATE 2026 (Design, Automation and Test in Europe). 7 pages, 6 figures
Abstract:On-device tuning of deep neural networks enables long-term adaptation at the edge while preserving data privacy. However, the high computational and memory demands of backpropagation pose significant challenges for ultra-low-power, memory-constrained extreme-edge devices. These challenges are further amplified for attention-based models due to their architectural complexity and computational scale. We present TrainDeeploy, a framework that unifies efficient inference and on-device training on heterogeneous ultra-low-power System-on-Chips (SoCs). TrainDeeploy provides the first complete on-device training pipeline for extreme-edge SoCs supporting both Convolutional Neural Networks (CNNs) and Transformer models, together with multiple training strategies such as selective layer-wise fine-tuning and Low-Rank Adaptation (LoRA). On a RISC-V-based heterogeneous SoC, we demonstrate the first end-to-end on-device fine-tuning of a Compact Convolutional Transformer (CCT), achieving up to 11 trained images per second. We show that LoRA reduces dynamic memory usage by 23%, decreases the number of trainable parameters and gradients by 15x, and reduces memory transfer volume by 1.6x compared to full backpropagation. TrainDeeploy achieves up to 4.6 FLOP/cycle on CCT (0.28M parameters, 71-126M FLOPs) and up to 13.4 FLOP/cycle on Deep-AE (0.27M parameters, 0.8M FLOPs), while expanding the scope of prior frameworks to support both CNN and Transformer models with parameter-efficient tuning on extreme-edge platforms.
[LG-27] From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation
链接: https://arxiv.org/abs/2603.09436
作者: Rong J.B. Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study off-policy evaluation in the setting of contextual bandits, where we aim to evaluate a new policy using historical data that consists of contexts, actions and received rewards. This historical data typically does not faithfully represent action distribution of the new policy accurately. A common approach, inverse probability weighting (IPW), adjusts for these discrepancies in action distributions. However, this method often suffers from high variance due to the probability being in the denominator. The doubly robust (DR) estimator reduces variance through modeling reward but does not directly address variance from IPW. In this work, we address the limitation of IPW by proposing a Nonparametric Weighting (NW) approach that constructs weights using a nonparametric model. Our NW approach achieves low bias like IPW but typically exhibits significantly lower variance. To further reduce variance, we incorporate reward predictions – similar to the DR technique – resulting in the Model-assisted Nonparametric Weighting (MNW) approach. The MNW approach yields accurate value estimates by explicitly modeling and mitigating bias from reward modeling, without aiming to guarantee the standard doubly robust property. Extensive empirical comparisons show that our approaches consistently outperform existing techniques, achieving lower variance in value estimation while maintaining low bias.
[LG-28] Impact of Markov Decision Process Design on Sim-to-Real Reinforcement Learning
链接: https://arxiv.org/abs/2603.09427
作者: Tatjana Krau,Jorge Mandlmaier,Tobias Damm,Frieder Heieck
类目: Machine Learning (cs.LG)
*备注: Submitted at the 65th IEEE Conference on Decision and Control
Abstract:Reinforcement Learning (RL) has demonstrated strong potential for industrial process control, yet policies trained in simulation often suffer from a significant sim-to-real gap when deployed on physical hardware. This work systematically analyzes how core Markov Decision Process (MDP) design choices – state composition, target inclusion, reward formulation, termination criteria, and environment dynamics models – affect this transfer. Using a color mixing task, we evaluate different MDP configurations and mixing dynamics across simulation and real-world experiments. We validate our findings on physical hardware, demonstrating that physics-based dynamics models achieve up to 50% real-world success under strict precision constraints where simplified models fail entirely. Our results provide practical MDP design guidelines for deploying RL in industrial process control.
[LG-29] Reconstructing Movement from Sparse Samples: Enhanced Spatio-Temporal Matching Strategies for Low-Frequency Data
链接: https://arxiv.org/abs/2603.09412
作者: Ali Yousefian,Arianna Burzacchi,Simone Vantini
类目: Machine Learning (cs.LG)
*备注: 22 pages, 14 figures, 3 tables
Abstract:This paper explores potential improvements to the Spatial-Temporal Matching algorithm for matching the GPS trajectories to road networks. While this algorithm is effective, it presents some limitations in computational efficiency and the accuracy of the results, especially in dense environments with relatively high sampling intervals. To address this, the paper proposes four modifications to the original algorithm: a dynamic buffer, an adaptive observation probability, a redesigned temporal scoring function, and a behavioral analysis to account for the historical mobility patterns. The enhancements are assessed using real-world data from the urban area of Milan, and through newly defined evaluation metrics to be applied in the absence of ground truth. The results of the experiment show significant improvements in performance efficiency and path quality across various metrics.
[LG-30] From Representation to Clusters: A Contrastive Learning Approach for Attributed Hypergraph Clustering
链接: https://arxiv.org/abs/2603.09370
作者: Li Ni,Shuaikang Zeng,Lin Mu,Longlong Lin
类目: Machine Learning (cs.LG)
*备注: Accepted at The Web Conference 2026. 12 pages, 5 figures
Abstract:Contrastive learning has demonstrated strong performance in attributed hypergraph clustering. Typically, existing methods based on contrastive learning first learn node embeddings and then apply clustering algorithms, such as k-means, to these embeddings to obtain the clustering this http URL, these methods lack direct clustering supervision, risking the inclusion of clustering-irrelevant information in the learned this http URL this end, we propose a Contrastive learning approach for Attributed Hypergraph Clustering (CAHC), an end-to-end method that simultaneously learns node embeddings and obtains clustering results. CAHC consists of two main steps: representation learning and cluster assignment learning. The former employs a novel contrastive learning approach that incorporates both node-level and hyperedge-level objectives to generate node this http URL latter joint embedding and clustering optimization to refine these embeddings by clustering-oriented guidance and obtains clustering results this http URL experimental results demonstrate that CAHC outperforms baselines on eight datasets.
[LG-31] Interactive 3D visualization of surface roughness predictions in additive manufacturing: A data-driven framework
链接: https://arxiv.org/abs/2603.09353
作者: Engin Deniz Erkan,Elif Surer,Ulas Yaman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Surface roughness in Material Extrusion Additive Manufacturing varies across a part and is difficult to anticipate during process planning because it depends on both printing parameters and local surface inclination, which governs the staircase effect. A data-driven framework is presented to predict the arithmetic mean roughness (Ra) prior to fabrication using process parameters and surface angle. A structured experimental dataset was created using a three-level Box-Behnken design: 87 specimens were printed, each with multiple planar faces spanning different inclination angles, yielding 1566 Ra measurements acquired with a contact profilometer. A multilayer perceptron regressor was trained to capture nonlinear relationships between manufacturing conditions, inclination, and Ra. To mitigate limited experimental data, a conditional generative adversarial network was used to generate additional condition-specific tabular samples, thereby improving predictive performance. Model performance was assessed on a hold-out test set. A web-based decision-support interface was also developed to enable interactive process planning by loading a 3D model, specifying printing parameters, and adjusting the part’s orientation. The system computes face-wise inclination from the model geometry and visualizes predicted Ra as an interactive colormap over the surface, enabling rapid identification of regions prone to high roughness and immediate comparison of parameter and orientation choices.
[LG-32] Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning
链接: https://arxiv.org/abs/2603.09331
作者: Heng Zhang,Haddy Alchaer,Arash Ajoudani,Yu She
类目: Machine Learning (cs.LG)
*备注: under review
Abstract:We introduce Reward-Zero, a general-purpose implicit reward mechanism that transforms natural-language task descriptions into dense, semantically grounded progress signals for reinforcement learning (RL). Reward-Zero serves as a simple yet sophisticated universal reward function that leverages language embeddings for efficient RL training. By comparing the embedding of a task specification with embeddings derived from an agent’s interaction experience, Reward-Zero produces a continuous, semantically aligned sense-of-completion signal. This reward supplements sparse or delayed environmental feedback without requiring task-specific engineering. When integrated into standard RL frameworks, it accelerates exploration, stabilizes training, and enhances generalization across diverse tasks. Empirically, agents trained with Reward-Zero converge faster and achieve higher final success rates than conventional methods such as PPO with common reward-shaping baselines, successfully solving tasks that hand-designed rewards could not in some complex tasks. In addition, we develop a mini benchmark for the evaluation of completion sense during task execution via language embeddings. These results highlight the promise of language-driven implicit reward functions as a practical path toward more sample-efficient, generalizable, and scalable RL for embodied agents. Code will be released after peer review.
[LG-33] A Gaussian Comparison Theorem for Training Dynamics in Machine Learning
链接: https://arxiv.org/abs/2603.09310
作者: Ashkan Panahi
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:We study training algorithms with data following a Gaussian mixture model. For a specific family of such algorithms, we present a non-asymptotic result, connecting the evolution of the model to a surrogate dynamical system, which can be easier to analyze. The proof of our result is based on the celebrated Gordon comparison theorem. Using our theorem, we rigorously prove the validity of the dynamic mean-field (DMF) expressions in the asymptotic scenarios. Moreover, we suggest an iterative refinement scheme to obtain more accurate expressions in non-asymptotic scenarios. We specialize our theory to the analysis of training a perceptron model with a generic first-order (full-batch) algorithm and demonstrate that fluctuation parameters in a non-asymptotic domain emerge in addition to the DMF kernels.
[LG-34] Proxy-Guided Measurement Calibration
链接: https://arxiv.org/abs/2603.09288
作者: Saketh Vishnubhatla,Shu Wan,Andre Harrison,Adrienne Raglin,Huan Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Aggregate outcome variables collected through surveys and administrative records are often subject to systematic measurement error. For instance, in disaster loss databases, county-level losses reported may differ from the true damages due to variations in on-the-ground data collection capacity, reporting practices, and event characteristics. Such miscalibration complicates downstream analysis and decision-making. We study the problem of outcome miscalibration and propose a framework guided by proxy variables for estimating and correcting the systematic errors. We model the data-generating process using a causal graph that separates latent content variables driving the true outcome from the latent bias variables that induce systematic errors. The key insight is that proxy variables that depend on the true outcome but are independent of the bias mechanism provide identifying information for quantifying the bias. Leveraging this structure, we introduce a two-stage approach that utilizes variational autoencoders to disentangle content and bias latents, enabling us to estimate the effect of bias on the outcome of interest. We analyze the assumptions underlying our approach and evaluate it on synthetic data, semi-synthetic datasets derived from randomized trials, and a real-world case study of disaster loss reporting.
[LG-35] ransductive Generalization via Optimal Transport and Its Application to Graph Node Classification
链接: https://arxiv.org/abs/2603.09257
作者: MoonJeong Park,Seungbeom Lee,Kyungmin Kim,Jaeseung Heo,Seunghyuk Cho,Shouheng Li,Sangdon Park,Dongwoo Kim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Many existing transductive bounds rely on classical complexity measures that are computationally intractable and often misaligned with empirical behavior. In this work, we establish new representation-based generalization bounds in a distribution-free transductive setting, where learned representations are dependent, and test features are accessible during training. We derive global and class-wise bounds via optimal transport, expressed in terms of Wasserstein distances between encoded feature distributions. We demonstrate that our bounds are efficiently computable and strongly correlate with empirical generalization in graph node classification, improving upon classical complexity measures. Additionally, our analysis reveals how the GNN aggregation process transforms the representation distributions, inducing a trade-off between intra-class concentration and inter-class separation. This yields depth-dependent characterizations that capture the non-monotonic relationship between depth and generalization error observed in practice. The code is available at this https URL.
[LG-36] Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training NEURIPS2025
链接: https://arxiv.org/abs/2603.09253
作者: Rian Atri
类目: Machine Learning (cs.LG)
*备注: 19 pages, 6 tables, 1 figure. NeurIPS 2025 Workshop on Efficient Reasoning
Abstract:We study efficient reasoning under tight compute. We ask how to make structured, correct decisions without increasing test time cost. We add two training only components to small and medium Transformers that also transfer to broader differentiable optimizers. First, a length aware attention prior built via fuzzy regime position alignment, RPA, yields a normalized pre softmax bias that guides attention like a structured regularizer while adding no new inference parameters. Second, a minimal gain aware controller, Guardian, nudges attention sharpness only when validation improvements warrant it, following a two timescale policy gradient view of nonconvex optimization. It is disabled at inference. A KL perspective shows softmax of z plus log pi as MAP with KL regularization, grounding the prior in a principled objective. Under strict compute parity on WikiText 2, we reduce validation cross entropy while matching baseline latency and memory. At inference, we add a precomputed, cached prior B of T as a single additive bias per head. The controller does not run. In practice, this incurs negligible overhead, a cached bias add per head, with no measurable p50 latency shift. Our results suggest that length aware priors and late phase gain control preserve scarce improvements, especially in long span, noisy logit regimes, while keeping test time costs effectively unchanged.
[LG-37] Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control
链接: https://arxiv.org/abs/2603.09221
作者: Peihao Wang,Shan Yang,Xijun Wang,Tesi Xiao,Xin Liu,Changlong Yu,Yu Lou,Pan Li,Zhangyang Wang,Ming Lin,René Vidal
类目: Machine Learning (cs.LG)
*备注:
Abstract:Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason by projecting future states and selecting goal-directed actions, a capability that modern language models increasingly require but do not natively encode. While prior work uses reinforcement learning or test-time training, planning remains external to the model architecture. We formulate reasoning as optimal control and introduce the Test-Time Control (TTC) layer, which performs finite-horizon LQR planning over latent states at inference time, represents a value function within neural architectures, and leverages it as the nested objective to enable planning before prediction. To ensure scalability, we derive a hardware-efficient LQR solver based on a symplectic formulation and implement it as a fused CUDA kernel, enabling parallel execution with minimal overhead. Integrated as an adapter into pretrained LLMs, TTC layers improve mathematical reasoning performance by up to +27.8% on MATH-500 and 2-3x Pass@8 improvements on AMC and AIME, demonstrating that embedding optimal control as an architectural component provides an effective and scalable mechanism for reasoning beyond test-time training.
[LG-38] he Radio-Frequency Transformer for Signal Separation
链接: https://arxiv.org/abs/2603.09201
作者: Egor Lifar,Semyon Savkin,Rachana Madhukara,Tejas Jayashankar,Yury Polyanskiy,Gregory W. Wornell
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study a problem of signal separation: estimating a signal of interest (SOI) contaminated by an unknown non-Gaussian background/interference. Given the training data consisting of examples of SOI and interference, we show how to build a fully data-driven signal separator. To that end we learn a good discrete tokenizer for SOI and then train an end-to-end transformer on a cross-entropy loss. Training with a cross-entropy shows substantial improvements over the conventional mean-squared error (MSE). Our tokenizer is a modification of Google’s SoundStream, which incorporates additional transformer layers and switches from VQVAE to finite-scalar quantization (FSQ). Across real and synthetic mixtures from the MIT RF Challenge dataset, our method achieves competitive performance, including a 122x reduction in bit-error rate (BER) over prior state-of-the-art techniques for separating a QPSK signal from 5G interference. The learned representation adapts to the interference type without side information and shows zero-shot generalization to unseen mixtures at inference time, underscoring its potential beyond RF. Although we instantiate our approach on radio-frequency mixtures, we expect the same architecture to apply to gravitational-wave data (e.g., LIGO strain) and other scientific sensing problems that require data-driven modeling of background and noise.
[LG-39] P2GNN: Two Prototype Sets to boost GNN Performance
链接: https://arxiv.org/abs/2603.09195
作者: Arihant Jain,Gundeep Arora,Anoop Saladi,Chaosheng Dong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Message Passing Graph Neural Networks (MP-GNNs) have garnered attention for addressing various industry challenges, such as user recommendation and fraud detection. However, they face two major hurdles: (1) heavy reliance on local context, often lacking information about the global context or graph-level features, and (2) assumption of strong homophily among connected nodes, struggling with noisy local neighborhoods. To tackle these, we introduce P^2 GNN, a plug-and-play technique leveraging prototypes to optimize message passing, enhancing the performance of the base GNN model. Our approach views the prototypes in two ways: (1) as universally accessible neighbors for all nodes, enriching global context, and (2) aligning messages to clustered prototypes, offering a denoising effect. We demonstrate the extensibility of our proposed method to all message-passing GNNs and conduct extensive experiments across 18 datasets, including proprietary e-commerce datasets and open-source datasets, on node recommendation and node classification tasks. Results show that P^2 GNN outperforms production models in e-commerce and achieves the top average rank on open-source datasets, establishing it as a leading approach. Qualitative analysis supports the value of global context and noise mitigation in the local neighborhood in enhancing performance.
[LG-40] he Costs of Reproducibility in Music Separation Research: a Replication of Band-Split RNN
链接: https://arxiv.org/abs/2603.09187
作者: Paul Magron,Romain Serizel,Constance Douwes
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Music source separation is the task of isolating the instrumental tracks from a music song. Despite its spectacular recent progress, the trend towards more complex architectures and training protocols exacerbates reproducibility issues. The band-split recurrent neural networks (BSRNN) model is promising in this regard, since it yields close to state-of-the-art results on public datasets, and requires reasonable resources for training. Unfortunately, it is not straightforward to reproduce since its full code is not available. In this paper, we attempt to replicate BSRNN as closely as possible to the original paper through extensive experiments, which allows us to conduct a critical reflection on this reproducibility issue. Our contributions are three-fold. First, this study yields several insights on the model design and training pipeline, which sheds light on potential future improvements. In particular, since we were unsuccessful in reproducing the original results, we explore additional variants that ultimately yield an optimized BSRNN model, whose performance largely improves that of the original. Second, we discuss reproducibility issues from both methodological and practical perspectives. We notably underline how substantial time and energy costs could have been saved upon availability of the full pipeline. Third, our code and pre-trained models are released publicly to foster reproducible research. We hope that this study will contribute to spread awareness on the importance of reproducible research in the music separation community, and help promoting more transparent and sustainable practices.
[LG-41] Better Bounds for the Distributed Experts Problem
链接: https://arxiv.org/abs/2603.09168
作者: David P. Woodruff,Samson Zhou
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:
Abstract:In this paper, we study the distributed experts problem, where n experts are distributed across s servers for T timesteps. The loss of each expert at each time t is the \ell_p norm of the vector that consists of the losses of the expert at each of the s servers at time t . The goal is to minimize the regret R , i.e., the loss of the distributed protocol compared to the loss of the best expert, amortized over the all T times, while using the minimum amount of communication. We give a protocol that achieves regret roughly R\gtrsim\frac1\sqrtT\cdot\textpoly\log(nsT) , using \mathcalO\left(\fracnR^2+\fracsR^2\right)\cdot\max(s^1-2/p,1)\cdot\textpoly\log(nsT) bits of communication, which improves on previous work.
[LG-42] Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
链接: https://arxiv.org/abs/2603.09117
作者: Zhengzhao Ma,Xueru Wen,Boxi Cao,Yaojie Lu,Hongyu Lin,Jinglin Yang,Min He,Xianpei Han,Le Sun
类目: Machine Learning (cs.LG)
*备注: 9 pages, 8 figures
Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.
[LG-43] Probabilistic Hysteresis Factor Prediction for Electric Vehicle Batteries with Graphite Anodes Containing Silicon
链接: https://arxiv.org/abs/2603.09103
作者: Runyao Yu,Viviana Kleine,Philipp Gromotka,Thomas Rudolf,Adrian Eisenmann,Gautham Ram Chandra Mouli,Peter Palensky,Jochen L. Cremer
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 11 pages, 5 figures, 6 tables
Abstract:Batteries with silicon-graphite-based anodes, which offer higher energy density and improved charging performance, introduce pronounced voltage hysteresis, making state-of-charge (SoC) estimation particularly challenging. Existing approaches to modeling hysteresis rely on exhaustive high-fidelity tests or focus on conventional graphite-based lithium-ion batteries, without considering uncertainty quantification or computational constraints. This work introduces a data-driven approach for probabilistic hysteresis factor prediction, with a particular emphasis on applications involving silicon-graphite anode-based batteries. A data harmonization framework is proposed to standardize heterogeneous driving cycles across varying operating conditions. Statistical learning and deep learning models are applied to assess performance in predicting the hysteresis factor with uncertainties while considering computational efficiency. Extensive experiments are conducted to evaluate the generalizability of the optimal model configuration in unseen vehicle models through retraining, zero-shot prediction, fine-tuning, and joint training. By addressing key challenges in SoC estimation, this research facilitates the adoption of advanced battery technologies. A summary page is available at: this https URL
[LG-44] Overcoming Valid Action Suppression in Unmasked Policy Gradient Algorithms
链接: https://arxiv.org/abs/2603.09090
作者: Renos Zabounidis,Roy Siegelmann,Mohamad Qadri,Woojun Kim,Simon Stepputtis,Katia P. Sycara
类目: Machine Learning (cs.LG)
*备注:
Abstract:In reinforcement learning environments with state-dependent action validity, action masking consistently outperforms penalty-based handling of invalid actions, yet existing theory only shows that masking preserves the policy gradient theorem. We identify a distinct failure mode of unmasked training: it systematically suppresses valid actions at states the agent has not yet visited. This occurs because gradients pushing down invalid actions at visited states propagate through shared network parameters to unvisited states where those actions are valid. We prove that for softmax policies with shared features, when an action is invalid at visited states but valid at an unvisited state s^* , the probability \pi(a \mid s^*) is bounded by exponential decay due to parameter sharing and the zero-sum identity of softmax logits. This bound reveals that entropy regularization trades off between protecting valid actions and sample efficiency, a tradeoff that masking eliminates. We validate empirically that deep networks exhibit the feature alignment condition required for suppression, and experiments on Craftax, Craftax-Classic, and MiniHack confirm the predicted exponential suppression and demonstrate that feasibility classification enables deployment without oracle masks.
[LG-45] PPO-Based Hybrid Optimization for RIS-Assisted Semantic Vehicular Edge Computing
链接: https://arxiv.org/abs/2603.09082
作者: Wei Feng,Jingbo Zhang,Qiong Wu,Pingyi Fan,Qiang Fan
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This paper has been accepted by electronics. The source code has been released at: this https URL
Abstract:To support latency-sensitive Internet of Vehicles (IoV) applications amidst dynamic environments and intermittent links, this paper proposes a Reconfigurable Intelligent Surface (RIS)-aided semantic-aware Vehicle Edge Computing (VEC) framework. This approach integrates RIS to optimize wireless connectivity and semantic communication to minimize latency by transmitting semantic features. We formulate a comprehensive joint optimization problem by optimizing offloading ratios, the number of semantic symbols, and RIS phase shifts. Considering the problem’s high dimensionality and non-convexity, we propose a two-tier hybrid scheme that employs Proximal Policy Optimization (PPO) for discrete decision-making and Linear Programming (LP) for offloading optimization. The simulation results have validated the proposed framework’s superiority over existing methods. Specifically, the proposed PPO-based hybrid optimization scheme reduces the average end-to-end latency by approximately 40% to 50% compared to Genetic Algorithm (GA) and Quantum-behaved Particle Swarm Optimization (QPSO). Moreover, the system demonstrates strong scalability by maintaining low latency even in congested scenarios with up to 30 vehicles.
[LG-46] Learning Adaptive LLM Decoding
链接: https://arxiv.org/abs/2603.09065
作者: Chloe H. Su,Zhe Ye,Samuel Tenka,Aidan Yang,Soonho Kong,Udaya Ghai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We propose to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources. Rather than fine-tuning the language model itself, we introduce lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards (e.g. correctness on math and coding tasks). At the sequence level, we frame decoding as a contextual bandit problem: a policy selects a decoding strategy (e.g. greedy, top-k, min-p) for each prompt, conditioned on the prompt embedding and a parallel sampling budget. At the token level, we model decoding as a partially observable Markov decision process (POMDP), where a policy selects sampling actions at each token step based on internal model features and the remaining token budget. Experiments on the MATH and CodeContests benchmarks show that the learned adapters improve the accuracy-budget tradeoff: on MATH, the token-level adapter improves Pass@1 accuracy by up to 10.2% over the best static baseline under a fixed token budget, while the sequence-level adapter yields 2-3% gains under fixed parallel sampling. Ablation analyses support the contribution of both sequence- and token-level adaptation.
[LG-47] Dynamic Multi-period Experts for Online Time Series Forecasting WWW2026
链接: https://arxiv.org/abs/2603.09062
作者: Seungha Hong,Sukang Chae,Suyeon Kim,Sanghwan Jang,Hwanjo Yu
类目: Machine Learning (cs.LG)
*备注: WWW 2026
Abstract:Online Time Series Forecasting (OTSF) requires models to continuously adapt to concept drift. However, existing methods often treat concept drift as a monolithic phenomenon. To address this limitation, we first redefine concept drift by categorizing it into two distinct types: Recurring Drift, where previously seen patterns reappear, and Emergent Drift, where entirely new patterns emerge. We then propose DynaME (Dynamic Multi-period Experts), a novel hybrid framework designed to effectively address this dual nature of drift. For Recurring Drift, DynaME employs a committee of specialized experts that are dynamically fitted to the most relevant historical periodic patterns at each time step. For Emergent Drift, the framework detects high-uncertainty scenarios and shifts reliance to a stable, general expert. Extensive experiments on several benchmark datasets and backbones demonstrate that DynaME effectively adapts to both concept drifts and significantly outperforms existing baselines.
[LG-48] Quality over Quantity: Demonstration Curation via Influence Functions for Data-Centric Robot Learning ICRA2026
链接: https://arxiv.org/abs/2603.09056
作者: Haeone Lee,Taywon Min,Junsu Kim,Sinjae Kang,Fangchen Liu,Lerrel Pinto,Kimin Lee
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to ICRA 2026, 8 pages
Abstract:Learning from demonstrations has emerged as a promising paradigm for end-to-end robot control, particularly when scaled to diverse and large datasets. However, the quality of demonstration data, often collected through human teleoperation, remains a critical bottleneck for effective data-driven robot learning. Human errors, operational constraints, and teleoperator variability introduce noise and suboptimal behaviors, making data curation essential yet largely manual and heuristic-driven. In this work, we propose Quality over Quantity (QoQ), a grounded and systematic approach to identifying high-quality data by defining data quality as the contribution of each training sample to reducing loss on validation demonstrations. To efficiently estimate this contribution, we leverage influence functions, which quantify the impact of individual training samples on model performance. We further introduce two key techniques to adapt influence functions for robot demonstrations: (i) using maximum influence across validation samples to capture the most relevant state-action pairs, and (ii) aggregating influence scores of state-action pairs within the same trajectory to reduce noise and improve data coverage. Experiments in both simulated and real-world settings show that QoQ consistently improves policy performances over prior data selection methods.
[LG-49] FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation
链接: https://arxiv.org/abs/2603.09046
作者: Yinpeng Wu,Yitong Chen,Lixiang Wang,Jinyu Gu,Zhichao Hua,Yubin Xia
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注: 13 pages, 11 figures
Abstract:Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone’s secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average 10.05\times speedup in Time to First Token (TTFT) compared to the strawman, and an average 2.44\times TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to 24.30\times and 4.05\times compared to the strawman and optimized strawman, respectively.
[LG-50] SCALAR: Learning and Composing Skills through LLM Guided Symbolic Planning and Deep RL Grounding NEURIPS2025
链接: https://arxiv.org/abs/2603.09036
作者: Renos Zabounidis,Yue Wu,Simon Stepputtis,Woojun Kim,Yuanzhi Li,Tom Mitchell,Katia Sycara
类目: Machine Learning (cs.LG)
*备注: Best Paper Award Honorable Mention at NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning
Abstract:LM-based agents excel when given high-level action APIs but struggle to ground language into low-level control. Prior work has LLMs generate skills or reward functions for RL, but these one-shot approaches lack feedback to correct specification errors. We introduce SCALAR, a bidirectional framework coupling LLM planning with RL through a learned skill library. The LLM proposes skills with preconditions and effects; RL trains policies for each skill and feeds back execution results to iteratively refine specifications, improving robustness to initial errors. Pivotal Trajectory Analysis corrects LLM priors by analyzing RL trajectories; Frontier Checkpointing optionally saves environment states at skill boundaries to improve sample efficiency. On Craftax, SCALAR achieves 88.2% diamond collection, a 1.9x improvement over the best baseline, and reaches the Gnomish Mines 9.1% of the time where prior methods fail entirely.
[LG-51] wo Teachers Better Than One: Hardware-Physics Co-Guided Distributed Scientific Machine Learning
链接: https://arxiv.org/abs/2603.09032
作者: Yuchen Yuan,Junhuan Yang,Hao Wan,Yipei Liu,Hanhan Wu,Youzuo Lin,Lei Yang
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 7 pages, 9 figures. Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC 2026), Long Beach, CA, July 2026
Abstract:Scientific machine learning (SciML) is increasingly applied to in-field processing, controlling, and monitoring; however, wide-area sensing, real-time demands, and strict energy and reliability constraints make centralized SciML implementation impractical. Most SciML models assume raw data aggregation at a central node, incurring prohibitively high communication latency and energy costs; yet, distributing models developed for general-purpose ML often breaks essential physical principles, resulting in degraded performance. To address these challenges, we introduce EPIC, a hardware- and physics-co-guided distributed SciML framework, using full-waveform inversion (FWI) as a representative task. EPIC performs lightweight local encoding on end devices and physics-aware decoding at a central node. By transmitting compact latent features rather than high-volume raw data and by using cross-attention to capture inter-receiver wavefield coupling, EPIC significantly reduces communication cost while preserving physical fidelity. Evaluated on a distributed testbed with five end devices and one central node, and across 10 datasets from OpenFWI, EPIC reduces latency by 8.9 \times and communication energy by 33.8 \times , while even improving reconstruction fidelity on 8 out of 10 datasets.
[LG-52] When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency ICLR2026
链接: https://arxiv.org/abs/2603.09024
作者: Ren Fujiwara,Yasuko Matsubara,Yasushi Sakurai
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2026
Abstract:Sudden concept drift makes previously trained predictors unreliable, yet deciding when to retrain and what post-drift data size is sufficient is rarely addressed. We propose CALIPER - a detector- and model-agnostic, data-only test that estimates the post-drift data size required for stable retraining. CALIPER exploits state dependence in streams generated by dynamical systems: we run a single-pass weighted local regression over the post-drift window and track a one-step proxy error as a function of a locality parameter \theta . When an effective sample size gate is satisfied, a monotonically non-increasing trend in this error with increasing a locality parameter indicates that the data size is sufficiently informative for retraining. We also provide a theoretical analysis of our method, and we show that the algorithm has a low per-update time and memory. Across datasets from four heterogeneous domains, three learner families, and two detectors, CALIPER consistently matches or exceeds the best fixed data size for retraining while incurring negligible overhead and often outperforming incremental updates. CALIPER closes the gap between drift detection and data-sufficient adaptation in streaming learning.
[LG-53] MAPLE: Elevating Medical Reasoning from Statistical Consensus to Process-Led Alignment
链接: https://arxiv.org/abs/2603.08987
作者: Kailong Fan,Anqi Pu,Yichen Wu,Wanhua Li,Yicong Li,Hanspeter Pfister,Huafeng Liu,Xiang Li,Quanzheng Li,Ning Guo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in medical large language models have explored Test-Time Reinforcement Learning (TTRL) to enhance reasoning. However, standard TTRL often relies on majority voting (MV) as a heuristic supervision signal, which can be unreliable in complex medical scenarios where the most frequent reasoning path is not necessarily the clinically correct one. In this work, we propose a novel and unified training paradigm that integrates medical process reward models with TTRL to bridge the gap between test-time scaling (TTS) and parametric model optimization. Specifically, we advance the TTRL framework by replacing the conventional MV with a fine-grained, expert-aligned supervision paradigm using Med-RPM. This integration ensures that reinforcement learning is guided by medical correctness rather than mere consensus, effectively distilling search-based intelligence into the model’s parametric memory. Extensive evaluations on four different benchmarks have demonstrated that our developed method consistently and significantly outperforms current TTRL and standalone PRM selection. Our findings establish that transitioning from stochastic heuristics to structured, step-wise rewards is essential for developing reliable and scalable medical AI systems
[LG-54] MAcPNN: Mutual Assisted Learning on Data Streams with Temporal Dependence
链接: https://arxiv.org/abs/2603.08972
作者: Federico Giannini,Emanuele Della Valle
类目: Machine Learning (cs.LG)
*备注:
Abstract:Internet of Things (IoT) Analytics often involves applying machine learning (ML) models on data streams. In such scenarios, traditional ML paradigms face obstacles related to continuous learning while dealing with concept drifts, temporal dependence, and avoiding forgetting. Moreover, in IoT, different edge devices build up a network. When learning models on those devices, connecting them could be useful in improving performance and reusing others’ knowledge. This work proposes Mutual Assisted Learning, a learning paradigm grounded on Vygotsky’s popular Sociocultural Theory of Cognitive Development. Each device is autonomous and does not need a central orchestrator. Whenever it degrades its performance due to a concept drift, it asks for assistance from others and decides whether their knowledge is useful for solving the new problem. This way, the number of connections is drastically reduced compared to the classical Federated Learning approaches, where the devices communicate at each training round. Every device is equipped with a Continuous Progressive Neural Network (cPNN) to handle the dynamic nature of data streams. We call this implementation Mutual Assisted cPNN (MAcPNN). To implement it, we allow cPNNs for single data point predictions and apply quantization to reduce the memory footprint. Experimental results prove the effectiveness of MAcPNN in boosting performance on synthetic and real data streams.
[LG-55] he qs Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference
链接: https://arxiv.org/abs/2603.08960
作者: Vignesh Adhinarayanan,Nuwan Jayasena
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注: 10 pages, 6 tables
Abstract:Mixture-of-Experts (MoE) models deliver high quality at low training FLOPs, but this efficiency often vanishes at inference. We identify a double penalty that structurally disadvantages MoE architectures during decoding: first, expert routing fragments microbatches and reduces weight reuse; second, massive resident expert pools reduce high-bandwidth memory (HBM) headroom for the KV cache. This phenomenon, formalized as reuse fragmentation, pushes feed-forward networks (FFNs) into a bandwidth-bound regime, especially at long context lengths. We introduce the qs inequality, a predictive criterion that identifies when MoE is structurally disadvantaged relative to a quality-matched dense model. This criterion unifies sparsity ( s ), the fraction of parameters activated per token, and the quality-equivalence factor ( q ), the size multiplier required for a dense model to match MoE performance. Our evaluation across frontier models including DeepSeek-V3, Qwen3-235B, Grok-1, and Switch-C demonstrates that this fragmentation is a general architectural phenomenon. For DeepSeek-V3 at 128k context, this results in a 4.5x throughput advantage for a quality-matched dense baseline. Crucially, massive architectures like Switch-C can become infeasible on cluster sizes where a quality-matched dense model remains viable. Our results suggest that training-time FLOP efficiency is an incomplete proxy for inference-time performance in long-context serving. They also indicate that MoE may be best viewed as a training-time optimization, with distillation into dense models as a possible path toward inference-efficient deployment. Comments: 10 pages, 6 tables Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF) MSC classes: 68T07, 68M20 ACMclasses: I.2.6; C.4; C.1.2 Cite as: arXiv:2603.08960 [cs.LG] (or arXiv:2603.08960v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.08960 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-56] Optimizing Reinforcement Learning Training over Digital Twin Enabled Multi-fidelity Networks
链接: https://arxiv.org/abs/2603.08931
作者: Hanzhi Yu,Hasan Farooq,Julien Forgeat,Shruti Bothe,Kristijonas Cyras,Md Moin Uddin Chowdhury,Mingzhe Chen
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:In this paper, we investigate a novel digital network twin (DNT) assisted deep learning (DL) model training framework. In particular, we consider a physical network where a base station (BS) uses several antennas to serve multiple mobile users, and a DNT that is a virtual representation of the physical network. The BS must adjust its antenna tilt angles to optimize the data rates of all users. Due to user mobility, the BS may not be able to accurately track network dynamics such as wireless channels and user mobilities. Hence, a reinforcement learning (RL) approach is used to dynamically adjust the antenna tilt angles. To train the RL, we can use data collected from the physical network and the DNT. The data collected from the physical network is more accurate but incurs more communication overhead compared to the data collected from the DNT. Therefore, it is necessary to determine the ratio of data collected from the physical network and the DNT to improve the training of the RL model. We formulate this problem as an optimization problem whose goal is to jointly optimize the tilt angle adjustment policy and the data collection strategy, aiming to maximize the data rates of all users while constraining the time delay introduced by collecting data from the physical network. To solve this problem, we propose a hierarchical RL framework that integrates robust adversarial loss and proximal policy optimization (PPO). Simulation results show that our proposed method reduces the physical network data collection delay by up to 28.01% and 1x compared to a hierarchical RL that uses vanilla PPO as the first level RL, and the baseline that uses robust-RL at the first level and selects the data collection ratio randomly.
[LG-57] Quantifying Memorization and Privacy Risks in Genomic Language Models
链接: https://arxiv.org/abs/2603.08913
作者: Alexander Nemecek,Wenbiao Li,Xiaoqian Jiang,Jaideep Vaidya,Erman Ayday
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Genomics (q-bio.GN)
*备注: 13 pages
Abstract:Genomic language models (GLMs) have emerged as powerful tools for learning representations of DNA sequences, enabling advances in variant prediction, regulatory element identification, and cross-task transfer learning. However, as these models are increasingly trained or fine-tuned on sensitive genomic cohorts, they risk memorizing specific sequences from their training data, raising serious concerns around privacy, data leakage, and regulatory compliance. Despite growing awareness of memorization risks in general-purpose language models, little systematic evaluation exists for these risks in the genomic domain, where data exhibit unique properties such as a fixed nucleotide alphabet, strong biological structure, and individual identifiability. We present a comprehensive, multi-vector privacy evaluation framework designed to quantify memorization risks in GLMs. Our approach integrates three complementary risk assessment methodologies: perplexity-based detection, canary sequence extraction, and membership inference. These are combined into a unified evaluation pipeline that produces a worst-case memorization risk score. To enable controlled evaluation, we plant canary sequences at varying repetition rates into both synthetic and real genomic datasets, allowing precise quantification of how repetition and training dynamics influence memorization. We evaluate our framework across multiple GLM architectures, examining the relationship between sequence repetition, model capacity, and memorization risk. Our results establish that GLMs exhibit measurable memorization and that the degree of memorization varies across architectures and training regimes. These findings reveal that no single attack vector captures the full scope of memorization risk, underscoring the need for multi-vector privacy auditing as a standard practice for genomic AI systems.
[LG-58] Why Channel-Centric Models are not Enough to Predict End-to-End Performance in Private 5G: A Measurement Campaign and Case Study
链接: https://arxiv.org/abs/2603.08865
作者: Nils Jörgensen
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Communication-aware robot planning requires accurate predictions of wireless network performance. Current approaches rely on channel-level metrics such as received signal strength and signal-to-noise ratio, assuming these translate reliably into end-to-end throughput. We challenge this assumption through a measurement campaign in a private 5G industrial environment. We evaluate throughput predictions from a commercial ray-tracing simulator as well as data-driven Gaussian process regression models against measurements collected using a mobile robot. The study uses off-the-shelf user equipment in an underground, radio-shielded facility with detailed 3D modeling, representing a best-case scenario for prediction accuracy. The ray-tracing simulator captures the spatial structure of indoor propagation and predicts channel-level metrics with reasonable fidelity. However, it systematically over-predicts throughput, even in line-of-sight regions. The dominant error source is shown to be over-estimation of sustainable MIMO spatial layers: the simulator assumes near-uniform four-layer transmission while measurements reveal substantial adaptation between one and three layers. This mismatch inflates predicted throughput even when channel metrics appear accurate. In contrast, a Gaussian process model with a rational quadratic kernel achieves approximately two-thirds reduction in prediction error with near-zero bias by learning end-to-end throughput directly from measurements. These findings demonstrate that favorable channel conditions do not guarantee high throughput; communication-aware planners relying solely on channel-centric predictions risk overly optimistic trajectories that violate reliability requirements. Accurate throughput prediction for 5G systems requires either extensive calibration of link-layer models or data-driven approaches that capture real system behavior.
[LG-59] APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model
链接: https://arxiv.org/abs/2603.08862
作者: Yuanjie Lu,Beichen Wang,Zhengqi Wu,Yang Li,Xiaomin Lin,Chengzhi Mao,Xuesu Xiao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Autonomous navigation in highly constrained environments remains challenging for mobile robots. Classical navigation approaches offer safety assurances but require environment-specific parameter tuning; end-to-end learning bypasses parameter tuning but struggles with precise control in constrained spaces. To this end, recent robot learning approaches automate parameter tuning while retaining classical systems’ safety, yet still face challenges in generalizing to unseen environments. Recently, Vision-Language-Action (VLA) models have shown promise by leveraging foundation models’ scene understanding capabilities, but still struggle with precise control and inference latency in navigation tasks. In this paper, we propose Adaptive Planner Parameter Learning from Vision-Language-Action Model (\textscapplv). Unlike traditional VLA models that directly output actions, \textscapplv leverages pre-trained vision-language models with a regression head to predict planner parameters that configure classical planners. We develop two training strategies: supervised learning fine-tuning from collected navigation trajectories and reinforcement learning fine-tuning to further optimize navigation performance. We evaluate \textscapplv across multiple motion planners on the simulated Benchmark Autonomous Robot Navigation (BARN) dataset and in physical robot experiments. Results demonstrate that \textscapplv outperforms existing methods in both navigation performance and generalization to unseen environments.
[LG-60] Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models
链接: https://arxiv.org/abs/2603.08859
作者: John Cooper,Ilias Diakonikolas,Mingchen Ma,Frederic Sala
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hybrid sequence models–combining Transformer and state-space model layers–seek to gain the expressive versatility of attention as well as the computational efficiency of state-space model layers. Despite burgeoning interest in hybrid models, we lack a basic understanding of the settings where–and underlying mechanisms through which–they offer benefits over their constituent models. In this paper, we study this question, focusing on a broad family of core synthetic tasks. For this family of tasks, we prove the existence of fundamental limitations for non-hybrid models. Specifically, any Transformer or state-space model that solves the underlying task requires either a large number of parameters or a large working memory. On the other hand, for two prototypical tasks within this family–namely selective copying and associative recall–we construct hybrid models of small size and working memory that provably solve these tasks, thus achieving the best of both worlds. Our experimental evaluation empirically validates our theoretical findings. Importantly, going beyond the settings in our theoretical analysis, we empirically show that learned–rather than constructed–hybrids outperform non-hybrid models with up to 6x as many parameters. We additionally demonstrate that hybrid models exhibit stronger length generalization and out-of-distribution robustness than non-hybrids.
[LG-61] SoftJAX SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients
链接: https://arxiv.org/abs/2603.08824
作者: Anselm Paulus,A. René Geist,Vít Musil,Sebastian Hoffmann,Onur Beker,Georg Martius
类目: Machine Learning (cs.LG)
*备注:
Abstract:Automatic differentiation (AD) frameworks such as JAX and PyTorch have enabled gradient-based optimization for a wide range of scientific fields. Yet, many “hard” primitives in these libraries such as thresholding, Boolean logic, discrete indexing, and sorting operations yield zero or undefined gradients that are not useful for optimization. While numerous “soft” relaxations have been proposed that provide informative gradients, the respective implementations are fragmented across projects, making them difficult to combine and compare. This work introduces SoftJAX and SoftTorch, open-source, feature-complete libraries for soft differentiable programming. These libraries provide a variety of soft functions as drop-in replacements for their hard JAX and PyTorch counterparts. This includes (i) elementwise operators such as clip or abs, (ii) utility methods for manipulating Booleans and indices via fuzzy logic, (iii) axiswise operators such as sort or rank – based on optimal transport or permutahedron projections, and (iv) offer full support for straight-through gradient estimation. Overall, SoftJAX and SoftTorch make the toolbox of soft relaxations easily accessible to differentiable programming, as demonstrated through benchmarking and a practical case study. Code is available at this http URL and this http URL.
[LG-62] he Temporal Markov Transition Field
链接: https://arxiv.org/abs/2603.08803
作者: Michael Leznik
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 2 figures
Abstract:The Markov Transition Field (MTF), introduced by Wang and Oates (2015), encodes a time series as a two-dimensional image by mapping each pair of time steps to the transition probability between their quantile states, estimated from a single global transition matrix. This construction is efficient when the transition dynamics are stationary, but produces a misleading representation when the process changes regime over time: the global matrix averages across regimes and the resulting image loses all information about \emphwhen each dynamical regime was active. In this paper we introduce the \emphTemporal Markov Transition Field (TMTF), an extension that partitions the series into K contiguous temporal chunks, estimates a separate local transition matrix for each chunk, and assembles the image so that each row reflects the dynamics local to its chunk rather than the global average. The resulting T \times T image has K horizontal bands of distinct texture, each encoding the transition dynamics of one temporal segment. We develop the formal definition, establish the key structural properties of the representation, work through a complete numerical example that makes the distinction from the global MTF concrete, analyse the bias–variance trade-off introduced by temporal chunking, and discuss the geometric interpretation of the local transition matrices in terms of process properties such as persistence, mean reversion, and trending behaviour. The TMTF is amplitude-agnostic and order-preserving, making it suitable as an input channel for convolutional neural networks applied to time series characterisation tasks.
[LG-63] SPREAD: Subspace Representation Distillation for Lifelong Imitation Learning ICRA
链接: https://arxiv.org/abs/2603.08763
作者: Kaushik Roy,Giovanni D’urso,Nicholas Lawrance,Brendan Tidd,Peyman Moghadam
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: IEEE International Conference on Robotics Automation (ICRA) 2026
Abstract:A key challenge in lifelong imitation learning (LIL) is enabling agents to acquire new skills from expert demonstrations while retaining prior knowledge. This requires preserving the low-dimensional manifolds and geometric structures that underlie task representations across sequential learning. Existing distillation methods, which rely on L2-norm feature matching in raw feature space, are sensitive to noise and high-dimensional variability, often failing to preserve intrinsic task manifolds. To address this, we introduce SPREAD, a geometry-preserving framework that employs singular value decomposition (SVD) to align policy representations across tasks within low-rank subspaces. This alignment maintains the underlying geometry of multimodal features, facilitating stable transfer, robustness, and generalization. Additionally, we propose a confidence-guided distillation strategy that applies a Kullback-Leibler divergence loss restricted to the top-M most confident action samples, emphasizing reliable modes and improving optimization stability. Experiments on the LIBERO, lifelong imitation learning benchmark, show that SPREAD substantially improves knowledge transfer, mitigates catastrophic forgetting, and achieves state-of-the-art performance.
[LG-64] Robust Parameter and State Estimation in Multiscale Neuronal Systems Using Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2603.08742
作者: Changliang Wei,Yangyang Wang,Xueyu Zhu
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
Abstract:Inferring biophysical parameters and hidden state variables from partial and noisy observations is a fundamental challenge in computational neuroscience. This problem is particularly difficult for fast - slow spiking and bursting models, where strong nonlinearities, multiscale dynamics, and limited observational data often lead to severe sensitivity to initial parameter guesses and convergence failure in the methods replying on the traditional numerical forward solvers. In this work, we developed a physics-informed neural network (PINN) framework for the joint reconstruction of unobserved state variables and the estimation of unknown biophysical parameters in neuronal models. We demonstrate the effectiveness of the method on biophysical neuron models, including the Morris-Lecar model across multiple spiking and bursting regimes and a respiratory model neuron. The method requires only partial voltage observations over short observation windows and remains robust even when initialized with non-informative parameter guesses. These results suggest that PINN can deliver robust and accurate parameter inference and state reconstruction, providing a promising alternative for inverse problems in multiscale neuronal dynamics, where traditional techniques often struggle.
[LG-65] he AetherFloat Family: Block-Scale-Free Quad-Radix Floating-Point Architectures for AI Accelerators
链接: https://arxiv.org/abs/2603.08741
作者: Keita Morisaki
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:The IEEE 754 floating-point standard is the bedrock of modern computing, but its structural requirements – a hidden leading bit, Base-2 bit-level normalization, and Sign-Magnitude encoding – impose significant silicon area and power overhead in massively parallel Neural Processing Units (NPUs). Furthermore, the industry’s recent shift to 8-bit formats (e.g., FP8 E4M3, OCP MX formats) has introduced a new hardware penalty: the strict necessity of Block-Scaling (AMAX) logic to prevent out-of-bound Large Language Model (LLM) activations from overflowing and degrading accuracy. The AetherFloat Family is a parameterizable architectural replacement designed from first principles for Hardware/Software Co-Design in AI acceleration. By synthesizing Lexicographic One’s Complement Unpacking, Quad-Radix (Base-4) Scaling, and an Explicit Mantissa, AetherFloat achieves zero-cycle native integer comparability, branchless subnormal handling, and a verified 33.17% area, 21.99% total power, and 11.73% critical path delay reduction across the multiply-accumulate (MAC) unit. Instantiated as AetherFloat-8 (AF8), the architecture relies on a purely explicit 3-bit mantissa. Combined with Base-4 scaling, AF8 delivers a substantially wider dynamic range, acting as a ``Block-Scale-Free’’ format for inference that circumvents dynamic scaling microarchitecture. Finally, a novel Vector-Shared 32-bit Galois Stochastic Rounding topology bounds precision variance while neutralizing the vanishing gradients that plague legacy formats. While AF16 serves as a near-lossless bfloat16 replacement via post-training quantization, AF8 is designed as a QAT-first inference format: its Block-Scale-Free property eliminates dynamic AMAX hardware at the cost of requiring quantization-aware fine-tuning for deployment. Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2603.08741 [cs.AR] (or arXiv:2603.08741v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2603.08741 Focus to learn more arXiv-issued DOI via DataCite
[LG-66] Hebbian-Oscillatory Co-Learning
链接: https://arxiv.org/abs/2603.08731
作者: Hasi Hays
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:We introduce Hebbian-Oscillatory Co-Learning (HOC-L), a unified two-timescale dynamical framework for joint structural plasticity and phase synchronization in bio-inspired sparse neural architectures. HOC-L couples two recent frameworks: the hyperbolic sparse geometry of Resonant Sparse Geometry Networks (RSGN), which employs Poincaré ball embeddings with Hebbian-driven dynamic sparsity, and the oscillator-based attention of Selective Synchronization Attention (SSA), which replaces dot-product attention with Kuramoto-type phase-locking dynamics. The key mechanism is synchronization-gated plasticity: the macroscopic order parameter r(t) of the oscillator ensemble gates Hebbian structural updates, so that connectivity consolidation occurs only when sufficient phase coherence signals a meaningful computational pattern. We prove convergence of the joint system to a stable equilibrium via a composite Lyapunov function and derive explicit timescale separation bounds. The resulting architecture achieves O(n \cdot k) complexity with k \ll n , preserving the sparsity of both parent frameworks. Numerical simulations confirm the theoretical predictions, demonstrating emergent cluster-aligned connectivity and monotonic Lyapunov decrease.
[LG-67] Memory-Augmented Spiking Networks: Synergistic Integration of Complementary Mechanisms for Neuromorphic Vision
链接: https://arxiv.org/abs/2603.08730
作者: Effiong Blessing,Chiung-Yi Tseng,Isaac Nkrumah,Junaid Rehman
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Spiking Neural Networks (SNNs) provide biological plausibility and energy efficiency, yet systematic investigations of memory augmentation strategies remain limited. We conduct a five-model ablation study integrating Leaky Integrate-and-Fire neurons, Supervised Contrastive Learning (SCL), Hopfield networks, and Hierarchical Gated Recurrent Networks (HGRN) on the N-MNIST dataset. Baseline SNNs exhibit organized neuronal groupings, or structured assemblies, characterized by a silhouette score of 0.687 \pm 0.012 . Individual augmentations introduce trade-offs: SCL improves accuracy by 0.28% but reduces clustering (silhouette score 0.637 \pm 0.015 ), while HGRN yields consistent gains in both accuracy ( +1.01% ) and computational efficiency ( 170.6\times ). Full integration achieves a balanced improvement across metrics, reaching a silhouette score of 0.715 \pm 0.008 , classification accuracy of 97.49 \pm 0.10% , energy consumption of 1.85 \pm 0.06,\mu\mathrmJ , and sparsity of 97.0% . These results indicate that optimal performance emerges from architectural balance rather than isolated optimization, establishing design principles for memory-augmented neuromorphic systems.
[LG-68] Data-Rate-Aware High-Speed CNN Inference on FPGAs
链接: https://arxiv.org/abs/2603.08726
作者: Tobias Habermann,Martin Kumm
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:Dataflow-based CNN accelerators on FPGAs achieve low latency and high throughput by mapping computations of each layer directly to corresponding hardware units. However, layers such as pooling and strided convolutions reduce the data at their output with respect to their input, strongly effecting the data rate of the following layers. This leads to underutilization in fully unrolled designs. While prior work introduced data-rate-aware layer-wise adaptation, determining the most efficient implementation remains challenging. This paper presents a data-rate-aware CNN accelerator architecture for multi-pixel processing. Building on existing analytical models, the proposed method performs design-space exploration to identify configurations that improve hardware utilization and resource efficiency while preserving continuous flow of data, keeping all hardware units busy. Experimental results show substantial reductions in arithmetic resources compared to previous designs, enabling efficient implementation of complex CNNs on a single FPGA across a wide range of data rates. Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2603.08726 [cs.AR] (or arXiv:2603.08726v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2603.08726 Focus to learn more arXiv-issued DOI via DataCite
[LG-69] KernelCraft: Benchmarking for Agent ic Close-to-Metal Kernel Generation on Emerging Hardware
链接: https://arxiv.org/abs/2603.08721
作者: Jiayi Nie,Haoran Wu,Yao Lai,Zeyu Cao,Cheng Zhang,Binglei Lou,Erwei Wang,Jianyi Cheng,Timothy M. Jones,Robert Mullins,Rika Antonova,Yiren Zhao
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low-level kernels – a time-consuming, laborious, and error-prone process that cannot scale across diverse hardware targets. This prevents emerging hardware platforms from reaching the market efficiently. While prior LLM-based code generation has shown promise in mature GPU ecosystems, it remains unclear whether agentic LLM systems can quickly produce valid and efficient kernels for emerging hardware with new ISAs. We present KernelCraft: the first benchmark to evaluate an LLM agent’s ability to generate and optimize low-level kernels for customized accelerators via a function-calling, feedback-driven workflow. Within KernelCraft, the agent refines kernels under ISA and hardware constraints using automated feedback derived from compilation checks, simulation, and correctness validation against ground truth. In our experiments, we assess agent performance across three emerging accelerator platforms on more than 20 ML tasks, each with 5 diverse task configurations, with special evaluation of task configuration complexity. Across four leading reasoning models, top agents produce functionally valid kernels for previously unseen ISAs within a few refinement steps, with optimized kernels that match or outperform template-based compiler baselines. With that, we demonstrate the potential for reducing the cost of kernel development for accelerator designers and kernel developers.
[LG-70] Equitable Multi-Task Learning for AI-RANs
链接: https://arxiv.org/abs/2603.08717
作者: Panayiotis Raptis,Fatih Aslan,George Iosifidis
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 3 figures
Abstract:AI-enabled Radio Access Networks (AI-RANs) are expected to serve heterogeneous users with time-varying learning tasks over shared edge resources. Ensuring equitable inference performance across these users requires adaptive and fair learning mechanisms. This paper introduces an online-within-online fair multi-task learning (OWO-FMTL) framework that ensures long-term equity across users. The method combines two learning loops: an outer loop updating the shared model across rounds and an inner loop rebalancing user priorities within each round with a lightweight primal-dual update. Equity is quantified via generalized alpha-fairness, allowing a trade-off between efficiency and fairness. The framework guarantees diminishing performance disparity over time and operates with low computational overhead suitable for edge deployment. Experiments on convex and deep learning tasks confirm that OWO-FMTL outperforms existing multi-task learning baselines under dynamic scenarios.
[LG-71] Global universality via discrete-time signatures
链接: https://arxiv.org/abs/2603.09773
作者: Mihriban Ceylan,David J. Prömel
类目: Probability (math.PR); Machine Learning (cs.LG); Mathematical Finance (q-fin.MF)
*备注:
Abstract:We establish global universal approximation theorems on spaces of piecewise linear paths, stating that linear functionals of the corresponding signatures are dense with respect to L^p - and weighted norms, under an integrability condition on the underlying weight function. As an application, we show that piecewise linear interpolations of Brownian motion satisfies this integrability condition. Consequently, we obtain L^p -approximation results for path-dependent functionals, random ordinary differential equations, and stochastic differential equations driven by Brownian motion.
[LG-72] Evolution of Photonic Quantum Machine Learning under Noise
链接: https://arxiv.org/abs/2603.09645
作者: A.M.A.S.D. Alagiyawanna,Asoka Karunananda
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 26 pages, 9 figures. Review article. Currently under review at Quantum Machine Intelligence (Springer Nature)
Abstract:Photonic Quantum Machine Learning (PQML) is an emerging approach that integrates photonic quantum computing technologies with machine learning techniques to enable scalable and energy-efficient quantum information processing. Photonic systems offer advantages such as room-temperature operation, high-speed signal processing, and the ability to represent information in high-dimensional Hilbert spaces. However, noise remains a major challenge affecting the performance, reliability, and scalability of PQML implementations. This review provides a systematic analysis of noise sources in photonic quantum machine learning systems. We discuss photonic quantum computing architectures and examine key quantum machine learning algorithms implemented on photonic platforms, including Variational Quantum Circuits, Quantum Neural Networks, and Quantum Support Vector Machines. The paper categorizes major noise mechanisms and analyzes their impact on learning performance, training stability, and convergence behavior. Furthermore, we review both traditional and advanced noise characterization techniques and survey recent strategies for noise mitigation in photonic quantum systems. Finally, we highlight recent experimental advances and discuss future research directions for developing robust and scalable PQML systems under realistic noise conditions.
[LG-73] a-TMFG: Scalable Triangulated Maximally Filtered Graphs via Approximate Nearest Neighbors
链接: https://arxiv.org/abs/2603.09564
作者: Lionel Yelibi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The traditional Triangular Maximally Filtered Graph (TMFG) construction requires pre-computation and storage of a dense correlation matrix; this limits its applicability to small and medium-sized datasets. Here we identify key memory and runtime complexity challenges when using TMFG at scale. We then present the Approximate Triangular Maximally Filtered Graph (a-TMFG) algorithm. This is a novel approach to scaling the construction of artificial graphs from data inspired by TMFG. The method employs k-Nearest Neighbors Graphs (kNNG) for initial construction, and implements a memory management strategy to search and estimate missing correlations on-the-fly. This provides representations to control combinatorial explosion. The algorithm is tested for robustness to the parameters and noise, and is evaluated on datasets with millions of observations. This new method provides a parsimonious way to construct graphs for use-cases where graphs are used as input to supervised and unsupervised learning but where no natural graph exists.
[LG-74] What Do We Care About in Bandits with Noncompliance? BRACE: Bandits with Recommendations Abstention and Certified Effects
链接: https://arxiv.org/abs/2603.09532
作者: Nicolás Della Penna
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Bandits with noncompliance separate the learner’s recommendation from the treatment actually delivered, so the learning target itself must be chosen. A platform may care about recommendation welfare in the current mediated workflow, treatment learning for a future direct-control regime, or anytime-valid uncertainty for one of those targets. These objectives need not agree. We formalize this objective-choice problem, identify the direct-control regime in which recommendation and treatment objectives collapse, and show by example that recommendation welfare can strictly exceed every learner-measurable treatment policy when downstream actors use private information. For finite-context square-IV problems we propose BRACE, a parameter-free phase-doubling algorithm that performs IV inversion only after matrix certification and otherwise returns full-range but honest structural intervals. BRACE delivers simultaneous policy-value validity, fixed-gap identification of the operationally optimal recommendation policy, and fixed-gap identification of the structurally optimal treatment policy under contextual homogeneity and invertibility. We complement the theory with a finite-context empirical benchmark spanning direct control, mediated present-versus-future tradeoffs, weak identification, homogeneity failure, and rectangular overidentification. The experiments show that safety appears as regret on easy problems, as abstention and wide valid intervals under weak identification, as a reason to prefer recommendation welfare under homogeneity failure, and as tighter structural uncertainty when extra instruments are available. For rich contexts, we also derive an orthogonal score whose conditional bias factorizes into compliance-model and outcome-model errors, clarifying what must be stabilized for anytime-valid semiparametric IV inference.
[LG-75] Flow Field Reconstruction via Voronoi-Enhanced Physics-Informed Neural Networks with End-to-End Sensor Placement Optimization
链接: https://arxiv.org/abs/2603.09371
作者: Renjie Xiao,Bingteng Sun,Yiling Chen,Lin Lu,Qiang Du,Junqiang Zhu
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 36 pages, 9 figures
Abstract:(short version abstract, full in article)High-fidelity flow field reconstruction is important in fluid dynamics, but it is challenged by sparse and spatiotemporally incomplete sensor measurements, as well as failures of pre-deployed measurement points that can invalidate pre-trained reconstruction models. Physics-informed neural networks (PINNs) alleviate dependence on large labeled datasets by incorporating governing physics, yet sensor placement optimization, a key factor in reconstruction accuracy and robustness, remains underexplored. In this study, we propose a PINN with Voronoi-enhanced Sensor Optimization (VSOPINN). VSOPINN enables differentiable soft Voronoi construction for sparse sensor data rasterization, end-to-end fusion of centroidal Voronoi tessellation (CVT) with PINNs for adaptive sensor placement, and unified layout optimization for multi-condition flow reconstruction through a shared encoder-multi-decoder architecture. We validate VSOPINN on three representative problems: lid-driven cavity flow, vascular flow, and annular rotating flow. Results show that VSOPINN significantly improves reconstruction accuracy across different Reynolds numbers, adaptively learns effective sensor layouts, and remains robust under partial sensor failure. The study clarifies the intrinsic relationship between sensor placement and reconstruction precision in PINN-based flow field reconstruction.
[LG-76] On Regret Bounds of Thompson Sampling for Bayesian Optimization
链接: https://arxiv.org/abs/2603.09276
作者: Shion Takeno,Shogo Iwazaki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 pages
Abstract:We study a widely used Bayesian optimization method, Gaussian process Thompson sampling (GP-TS), under the assumption that the objective function is a sample path from a GP. Compared with the GP upper confidence bound (GP-UCB) with established high-probability and expected regret bounds, most analyses of GP-TS have been limited to expected regret. Moreover, whether the recent analyses of GP-UCB for the lenient regret and the improved cumulative regret upper bound can be applied to GP-TS remains unclear. To fill these gaps, this paper shows several regret bounds: (i) a regret lower bound for GP-TS, which implies that GP-TS suffers from a polynomial dependence on 1/\delta with probability \delta , (ii) an upper bound of the second moment of cumulative regret, which directly suggests an improved regret upper bound on \delta , (iii) expected lenient regret upper bounds, and (iv) an improved cumulative regret upper bound on the time horizon T . Along the way, we provide several useful lemmas, including a relaxation of the necessary condition from recent analysis to obtain improved regret upper bounds on T .
[LG-77] A Generative Sampler for distributions with possible discrete parameter based on Reversibility
链接: https://arxiv.org/abs/2603.09251
作者: Lei Li,Zhen Wang,Lishuo Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Learning to sample from complex unnormalized distributions is a fundamental challenge in computational physics and machine learning. While score-based and variational methods have achieved success in continuous domains, extending them to discrete or mixed-variable systems remains difficult due to ill-defined gradients or high variance in estimators. We propose a unified, target-gradient-free generative sampling framework applicable across diverse state spaces. Building on the fact that detailed balance implies the time-reversibility of the equilibrium stochastic process, we enforce this symmetry as a statistical constraint. Specifically, using a prescribed physical transition kernel (such as Metropolis-Hastings), we minimize the Maximum Mean Discrepancy (MMD) between the joint distributions of forward and backward Markov trajectories. Crucially, this training procedure relies solely on energy evaluations via acceptance ratios, circumventing the need for target score functions or continuous relaxations. We demonstrate the versatility of our method on three distinct benchmarks: (1) a continuous multi-modal Gaussian mixture, (2) the discrete high-dimensional Ising model, and (3) a challenging hybrid system coupling discrete indices with continuous dynamics. Experiments show that our framework accurately reproduces thermodynamic observables and captures mode-switching behavior across all regimes, offering a physically grounded and universally applicable alternative for equilibrium sampling.
[LG-78] Verifying Good Regulator Conditions for Hypergraph Observers: Natural Gradient Learning from Causal Invariance via Established Theorems DATE
链接: https://arxiv.org/abs/2603.09067
作者: Max Zhuravlev
类目: Machine Learning (stat.ML); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: 18 pages, 15 formal results. Part of a series of companion papers submitted simultaneously; cross-references updated with arXiv IDs in v2
Abstract:We verify that persistent observers in causally invariant hypergraph substrates satisfy the conditions of the Conant-Ashby Good Regulator Theorem. Building on Wolfram’s hypergraph physics and Vanchurin’s neural network cosmology, we formalize persistent observers as entities that minimize prediction error at their boundary with the environment. Applying a modern reformulation of the Conant-Ashby theorem, we demonstrate that hypergraph observers satisfy Good Regulator conditions, requiring them to maintain internal models. Once an internal model with loss function exists, the emergence of a Fisher information metric follows from standard information geometry. Invoking Amari’s uniqueness theorem for reparameterization-invariant gradients, we show that natural gradient descent is the unique admissible learning rule. Under the ansatz M=F^2 for exponential family observers and one specific convergence time functional, we derive a closed-form formula for the regime parameter alpha in Vanchurin’s Type II framework, with a quantum-classical threshold at kappa(F)=2. However, three alternative convergence models do not reproduce this result, so this prediction is strongly model-dependent. We further introduce the directional regime parameter alpha_v_k and the trace-free deviation tensor, showing that a single observer can simultaneously occupy different Vanchurin regimes along different eigendirections of the Fisher metric. This connects Wolfram and Vanchurin frameworks through established theorems, providing approximately 25-30% novel contribution.
[LG-79] Adaptive Active Learning for Online Reliability Prediction of Satellite Electronics
链接: https://arxiv.org/abs/2603.09058
作者: Shixiang Li,Yubin Tian,Dianpeng Wang,Piao Chen,Mengying Ren
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
Abstract:Accurate on-orbit reliability prediction for satellite electronics is often hindered by limited data availability, varying operational conditions, and considerable unit-to-unit variability. To overcome these obstacles, this paper proposes a novel integrated online reliability prediction framework. The main contributions are twofold. First, a Wiener process-based degradation model is developed, incorporating a generalized Arrhenius link function, individual random effects, and spatial correlations among adjacent units. A customized maximum likelihood estimation method is further devised to facilitate efficient and accurate parameter inference. Second, a two-stage active learning sampling scheme is designed to adaptively enhance prediction accuracy. This strategy initially selects representative units based on spatial configuration, and subsequently determines optimal sampling times using a comprehensive criterion that balances unit-specific information, model uncertainty, and degradation dynamics. Numerical experiments and a practical case study from the Tiangong space station demonstrate that the proposed method markedly improves reliability prediction accuracy while significantly reducing data requirements, offering an efficient solution for the prognostic and health management of complex satellite electronic systems.
[LG-80] Statistical Inference via Generative Models: Flow Matching and Causal Inference
链接: https://arxiv.org/abs/2603.09009
作者: Shinto Eguchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Generative AI has achieved remarkable empirical success, but from the perspective of statistics it often remains opaque: its predictions may be accurate, yet the underlying mechanism is difficult to interpret, analyze, and trust. This book reinterprets generative AI in the language of statistics, using flow matching as a central example. The key idea is that generative models should be understood not merely as devices for producing plausible data, but as methods for the nonparametric learning of high-dimensional probability distributions. From this viewpoint, missing-data imputation becomes principled sampling from learned conditional distributions, counterfactual analysis becomes the estimation of intervention distributions, and distributional dynamics become statistically analyzable objects. Mathematically, flow matching represents distributional deformation through the continuity equation and a time-dependent velocity field, thereby extending score matching from the learning of static score fields to the learning of transport paths themselves. Building on this foundation, the book develops a statistical framework in which generative models are used to estimate nuisance components while inferential validity is maintained through orthogonalization and cross-fitting in the spirit of double/debiased machine learning. Applications to survival analysis, censoring, missingness, and causal inference show how generative models can be integrated into statistical inference for structured high-dimensional problems.
[LG-81] Data-driven robust Markov decision processes on Borel spaces: performance guarantees via an axiomatic approach
链接: https://arxiv.org/abs/2603.08979
作者: Sivaramakrishnan Ramani
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We consider Markov decision processes (MDPs) with unknown disturbance distribution and address this problem using the robust Markov decision process (RMDP) approach. We construct the empirical distribution of the unknown disturbance distribution and characterize our ambiguity set of distributions as the sublevel set of a nonnegative distance function from the empirical distribution. By connecting the weak convergence of distributions to convergence with respect to the distance function, we prove that the robust optimal value function and the out-of-sample value function converge to the true optimal value function with increasing sample-sizes. We establish that, for finite sample-sizes, the robust optimal value function serves as a high probability upper bound on the out-of-sample value function. We also obtain probabilistic convergence rates, sample complexity bounds, and out-of-distribution performance bounds. The finite sample performance guarantees rely on the distance function satisfying a certain concentration type inequality. Several well-studied distances in the literature meet the requirements imposed on the distance function. We also analyze the data-driven properties of empirical MDPs and demonstrate that, unlike our data-driven RMDPs, empirical MDPs fail to satisfy some of the finite sample performance guarantees.
[LG-82] A Survey of Reinforcement Learning For Economics
链接: https://arxiv.org/abs/2603.08956
作者: Pranjal Rawat
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注:
Abstract:This survey (re)introduces reinforcement learning methods to economists. The curse of dimensionality limits how far exact dynamic programming can be effectively applied, forcing us to rely on suitably “small” problems or our ability to convert “big” problems into smaller ones. While this reduction has been sufficient for many classical applications, a growing class of economic models resists such reduction. Reinforcement learning algorithms offer a natural, sample-based extension of dynamic programming, extending tractability to problems with high-dimensional states, continuous actions, and strategic interactions. I review the theory connecting classical planning to modern learning algorithms and demonstrate their mechanics through simulated examples in pricing, inventory control, strategic games, and preference elicitation. I also examine the practical vulnerabilities of these algorithms, noting their brittleness, sample inefficiency, sensitivity to hyperparameters, and the absence of global convergence guarantees outside of tabular settings. The successes of reinforcement learning remain strictly bounded by these constraints, as well as a reliance on accurate simulators. When guided by economic structure, reinforcement learning provides a remarkably flexible framework. It stands as an imperfect, but promising, addition to the computational economist’s toolkit. A companion survey (Rust and Rawat, 2026b) covers the inverse problem of inferring preferences from observed behavior.
[LG-83] owards Reliable Simulation-based Inference
链接: https://arxiv.org/abs/2603.08947
作者: Arnaud Delaunoy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: PhD thesis
Abstract:Scientific knowledge expands by observing the world, hypothesizing some theories about it, and testing them against collected data. When those theories take the form of statistical models, statistical analyses are involved in the process of testing and refining scientific hypotheses. In this thesis, we focus on statistical models that take the form of scientific simulators and provide background about how machine learning can be used for statistical analyses in this context. The first part of this thesis is about showing empirically that performing statistical analyses with machine learning involves a degree of approximation. Specifically, all statistical analyses involve a level of uncertainty in the conclusions drawn, and we show that approximations can lead to overconfident conclusions. We draw caution regarding such overconfident conclusions and introduce a criterion to diagnose overconfident approximations. In the second part, we introduce balancing, a way to regularize machine learning models to reduce overconfidence and favor calibrated or underconfident approximations. Balancing is first introduced for neural ratio estimation algorithms and then extended to other algorithms. Intuition about why balancing leads to less overconfident solutions is provided, and it is shown empirically that balanced algorithms are often either close to calibrated or underconfident. The third part shows that Bayesian neural networks can also be used to mitigate the overconfidence of approximations. Unlike balancing, no regularization is required, and this solution can then work with few training samples and, hence, computationally expensive simulators. To that end, a new Bayesian neural network prior tailored for simulation-based inference is developed, and empirical results show a reduction in overconfidence compared to similar solutions without Bayesian neural networks. Comments: PhD thesis Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2603.08947 [stat.ML] (or arXiv:2603.08947v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2603.08947 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Arnaud Delaunoy [view email] [v1] Mon, 9 Mar 2026 21:29:13 UTC (9,433 KB)
[LG-84] Kernel Debiased Plug-in Estimation based on the Universal Least Favorable Submodel
链接: https://arxiv.org/abs/2603.08945
作者: Haiyi Chen,Yang Liu,Ivana Malenica
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We propose ULFS-KDPE, a kernel debiased plug-in estimator based on the universal least favorable submodel, for estimating pathwise differentiable parameters in nonparametric models. The method constructs a data-adaptive debiasing flow in a reproducing kernel Hilbert space (RKHS), producing a plug-in estimator that achieves semiparametric efficiency without requiring explicit derivation or evaluation of efficient influence functions. We place ULFS-KDPE on a rigorous functional-analytic foundation by formulating the universal least favorable update as a nonlinear ordinary differential equation on probability densities. We establish existence, uniqueness, stability, and finite-time convergence of the empirical score along the induced flow. Under standard regularity conditions, the resulting estimator is regular, asymptotically linear, and attains the semiparametric efficiency bound simultaneously for a broad class of pathwise differentiable parameters. The method admits a computationally tractable implementation based on finite-dimensional kernel representations and principled stopping criteria. In finite samples, the combination of solving a rich collection of score equations with RKHS-based smoothing and avoidance of direct influence-function evaluation leads to improved numerical stability. Simulation studies illustrate the method and support the theoretical results.
[LG-85] Micro-Diffusion Compression – Binary Tree Tweedie Denoising for Online Probability Estimation
链接: https://arxiv.org/abs/2603.08771
作者: Roberto Tacconelli
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 12 pages, 1 figure
Abstract:We present Midicoth, a lossless compression system that introduces a micro-diffusion denoising layer for improving probability estimates produced by adaptive statistical models. In compressors such as Prediction by Partial Matching (PPM), probability estimates are smoothed by a prior to handle sparse observations. When contexts have been seen only a few times, this prior dominates the prediction and produces distributions that are significantly flatter than the true source distribution, leading to compression inefficiency. Midicoth addresses this limitation by treating prior smoothing as a shrinkage process and applying a reverse denoising step that corrects predicted probabilities using empirical calibration statistics. To make this correction data-efficient, the method decomposes each byte prediction into a hierarchy of binary decisions along a bitwise tree. This converts a single 256-way calibration problem into a sequence of binary calibration tasks, enabling reliable estimation of correction terms from relatively small numbers of observations. The denoising process is applied in multiple successive steps, allowing each stage to refine residual prediction errors left by the previous one. The micro-diffusion layer operates as a lightweight post-blend calibration stage applied after all model predictions have been combined, allowing it to correct systematic biases in the final probability distribution. Midicoth combines five fully online components: an adaptive PPM model, a long-range match model, a trie-based word model, a high-order context model, and the micro-diffusion denoiser applied as the final stage.
[LG-86] On the Formal Limits of Alignment Verification
链接: https://arxiv.org/abs/2603.08761
作者: Ayushi Agarwal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The goal of AI alignment is to ensure that an AI system reliably pursues intended objectives. A foundational question for AI safety is whether alignment can be formally certified: whether there exists a procedure that can guarantee that a given system satisfies an alignment specification. This paper studies the nature of alignment verification. We prove that no verification procedure can simultaneously satisfy three properties: soundness (no misaligned system is certified), generality (verification holds over the full input domain), and tractability (verification runs in polynomial time). Each pair of properties is achievable, but all three cannot hold simultaneously. Relaxing any one property restores the corresponding possibility, indicating that practical bounded or probabilistic assurance remains viable. The result follows from three independent barriers: the computational complexity of full-domain neural network verification, the non-identifiability of internal goal structure from behavioral observation, and the limits of finite evidence for properties defined over infinite domains. The trilemma establishes the limits of alignment certification and characterizes the regimes in which meaningful guarantees remain possible.
附件下载


