本篇博文主要内容为 2026-03-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-03)
今日共更新934篇论文,其中:
- 自然语言处理共125篇(Computation and Language (cs.CL))
- 人工智能共318篇(Artificial Intelligence (cs.AI))
- 计算机视觉共256篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共250篇(Machine Learning (cs.LG))
- 多智能体系统共16篇(Multiagent Systems (cs.MA))
- 信息检索共23篇(Information Retrieval (cs.IR))
- 人机交互共40篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Boltzmann-based Exploration for Robust Decentralized Multi-Agent Planning ICAPS2026
【速读】:该论文旨在解决去中心化蒙特卡洛树搜索(Decentralized Monte Carlo Tree Search, Dec-MCTS)在稀疏或奖励分布不均衡环境中的性能瓶颈问题。其解决方案的核心在于提出协同玻尔兹曼蒙特卡洛树搜索(Coordinated Boltzmann MCTS, CB-MCTS),通过用随机的玻尔兹曼策略(Boltzmann policy)替代传统的确定性UCB1(Upper Confidence Bound 1)选择机制,并引入随时间衰减的熵奖励(entropy bonus),从而实现更持续且聚焦的探索行为。这一设计首次成功将玻尔兹曼探索机制应用于多智能体系统,有效提升了在欺骗性场景下的规划性能,同时在标准基准测试中保持竞争力,为多智能体规划提供了鲁棒性更强的新方法。
链接: https://arxiv.org/abs/2603.02154
作者: Nhat Nguyen,Duong Nguyen,Gianluca Rizzo,Hung Nguyen
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: To appear in ICAPS 2026
Abstract:Decentralized Monte Carlo Tree Search (Dec-MCTS) is widely used for cooperative multi-agent planning but struggles in sparse or skewed reward environments. We introduce Coordinated Boltzmann MCTS (CB-MCTS), which replaces deterministic UCT with a stochastic Boltzmann policy and a decaying entropy bonus for sustained yet focused exploration. While Boltzmann exploration has been studied in single-agent MCTS, applying it in multi-agent systems poses unique challenges. CB-MCTS is the first to address this. We analyze CB-MCTS in the simple-regret setting and show in simulations that it outperforms Dec-MCTS in deceptive scenarios and remains competitive on standard benchmarks, providing a robust solution for multi-agent planning.
[MA-1] GenDB: The Next Generation of Query Processing – Synthesized Not Engineered
【速读】:该论文旨在解决传统查询处理系统因架构复杂、难以扩展以及开发成本高而无法快速适应新技术和用户需求变化的问题。其核心挑战在于现有数据库管理系统(DBMS)在面对动态工作负载和硬件环境时缺乏灵活性与可定制性。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)自动生成针对特定数据、工作负载和硬件资源的实例化优化查询执行代码,从而替代传统静态引擎的维护与扩展。文中提出的GenDB系统通过多智能体架构实现这一思路,以Claude Code Agent作为底层组件生成定制化执行逻辑,在TPC-H及新设计的无数据泄露风险基准测试中显著优于DuckDB、Umbra、MonetDB、ClickHouse和PostgreSQL等主流查询引擎。
链接: https://arxiv.org/abs/2603.02081
作者: Jiale Lao,Immanuel Trummer
机构: Cornell University (康奈尔大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Traditional query processing relies on engines that are carefully optimized and engineered by many experts. However, new techniques and user requirements evolve rapidly, and existing systems often cannot keep pace. At the same time, these systems are difficult to extend due to their internal complexity, and developing new systems requires substantial engineering effort and cost. In this paper, we argue that recent advances in Large Language Models (LLMs) are starting to shape the next generation of query processing systems. We propose using LLMs to synthesize execution code for each incoming query, instead of continuously building, extending, and maintaining complex query processing engines. As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources. We implemented an early prototype of GenDB that uses Claude Code Agent as the underlying component in the multi-agent system, and we evaluate it on OLAP workloads. We use queries from the well-known TPC-H benchmark and also construct a new benchmark designed to reduce potential data leakage from LLM training data. We compare GenDB with state-of-the-art query engines, including DuckDB, Umbra, MonetDB, ClickHouse, and PostgreSQL. GenDB achieves significantly better performance than these systems. Finally, we discuss the current limitations of GenDB and outline future extensions and related research challenges. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2603.02081 [cs.DB] (or arXiv:2603.02081v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2603.02081 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-2] Exploring Plan Space through Conversation: An Agent ic Framework for LLM -Mediated Explanations in Planning
【速读】:该论文旨在解决在现实世界的顺序决策问题中,如何通过自动化计划生成系统增强人机协同推理与偏好 elicitation 的问题,其核心挑战在于提升用户对AI生成方案的理解与信任。解决方案的关键在于提出一种多智能体大语言模型(Large Language Model, LLM)架构,该架构不依赖特定解释框架,能够根据用户需求和上下文动态生成交互式解释,并通过目标冲突解释的实例化验证了其有效性,从而支持自然、灵活的人机交互过程。
链接: https://arxiv.org/abs/2603.02070
作者: Guilhem Fouilhé,Rebecca Eifler,Antonin Poché,Sylvie Thiébaux,Nicholas Asher
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Preprint
Abstract:When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human’s role is to guide the AI planner according to their preferences and expertise. In this context, explanations that respond to users’ questions are crucial to improve their understanding of potential solutions and increase their trust in the system. To enable natural interaction with such a system, we present a multi-agent Large Language Model (LLM) architecture that is agnostic to the explanation framework and enables user- and context-dependent interactive explanations. We also describe an instantiation of this framework for goal-conflict explanations, which we use to conduct a user study comparing the LLM-powered interaction with a baseline template-based explanation interface.
[MA-3] Selection as Power: Constrained Reinforcement for Bounded Decision Authority
【速读】:该论文旨在解决高风险代理系统中因上游选择权力(selection power)集中而导致的治理风险问题,尤其关注在动态环境中如何避免强化学习带来的选择权力不可逆集中。原框架假设治理约束是静态的,无法随时间调整,而本文通过引入激励型选择治理机制(incentivized selection governance),将外部强制的主权约束(sovereignty constraints)嵌入到强化学习参数更新过程中,使评分器(scoring)和缩减器(reducer)参数的优化始终受限于治理定义的可行集。其关键在于:将强化学习视为受约束的马尔可夫决策过程,通过投影操作确保参数更新不会突破预设边界,从而防止在重复反馈下出现确定性主导(deterministic dominance)现象。这种机制实现了学习动态与结构多样性共存,同时维持了选择权力的有界性,为在高风险场景(如金融领域)中安全集成强化学习提供了原则性的解决方案。
链接: https://arxiv.org/abs/2603.02019
作者: Jose Manuel de la Chica Rodriguez,Juan Manuel Vera Díaz
机构: AI Lab, Grupo Santander (Grupo Santander 人工智能实验室)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:
Abstract:Selection as Power argued that upstream selection authority, rather than internal objective misalignment, constitutes a primary source of risk in high-stakes agentic systems. However, the original framework was static: governance constraints bounded selection power but did not adapt over time. In this work, we extend the framework to dynamic settings by introducing incentivized selection governance, where reinforcement updates are applied to scoring and reducer parameters under externally enforced sovereignty constraints. We formalize selection as a constrained reinforcement process in which parameter updates are projected onto governance-defined feasible sets, preventing concentration beyond prescribed bounds. Across multiple regulated financial scenarios, unconstrained reinforcement consistently collapses into deterministic dominance under repeated feedback, especially at higher learning rates. In contrast, incentivized governance enables adaptive improvement while maintaining bounded selection concentration. Projection-based constraints transform reinforcement from irreversible lock-in into controlled adaptation, with governance debt quantifying the tension between optimization pressure and authority bounds. These results demonstrate that learning dynamics can coexist with structural diversity when sovereignty constraints are enforced at every update step, offering a principled approach to integrating reinforcement into high-stakes agentic systems without surrendering bounded selection authority. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG) Cite as: arXiv:2603.02019 [cs.MA] (or arXiv:2603.02019v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2603.02019 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-4] he Observer-Situation Lattice: A Unified Formal Basis for Perspective-Aware Cognition AAMAS2026
【速读】:该论文旨在解决多智能体环境中自主代理在复杂情境下难以有效整合不同观察者、时间点和上下文中的推理问题,尤其在处理其他智能体信念(Theory of Mind)时表现脆弱与不完整。其解决方案的关键在于提出一种统一的数学结构——观测者-情境格(Observer-Situation Lattice, OSL),该结构是一个有限完备格,每个元素代表一个唯一的观测者-情境对,从而为视角感知认知提供了一个统一且语义一致的表示空间。在此基础上,论文设计了两个核心算法:相对化信念传播(Relativized Belief Propagation)用于高效增量更新信息,以及最小矛盾分解(Minimal Contradiction Decomposition)用于识别并隔离矛盾组件,从而实现可扩展、鲁棒的信念管理机制。
链接: https://arxiv.org/abs/2603.01407
作者: Saad Alqithami
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注: Extended version of the AAMAS 2026 paper with the same title
Abstract:Autonomous agents operating in complex, multi-agent environments must reason about what is true from multiple perspectives. Existing approaches often struggle to integrate the reasoning of different agents, at different times, and in different contexts, typically handling these dimensions in separate, specialized modules. This fragmentation leads to a brittle and incomplete reasoning process, particularly when agents must understand the beliefs of others (Theory of Mind). We introduce the Observer-Situation Lattice (OSL), a unified mathematical structure that provides a single, coherent semantic space for perspective-aware cognition. OSL is a finite complete lattice where each element represents a unique observer-situation pair, allowing for a principled and scalable approach to belief management. We present two key algorithms that operate on this lattice: (i) Relativized Belief Propagation, an incremental update algorithm that efficiently propagates new information, and (ii) Minimal Contradiction Decomposition, a graph-based procedure that identifies and isolates contradiction components. We prove the theoretical soundness of our framework and demonstrate its practical utility through a series of benchmarks, including classic Theory of Mind tasks and a comparison with established paradigms such as assumption-based truth maintenance systems. Our results show that OSL provides a computationally efficient and expressive foundation for building robust, perspective-aware autonomous agents.
[MA-5] Exploration enhances cooperation in the multi-agent communication system
【速读】:该论文旨在解决多智能体系统中如何设计协议以增强合作的问题,尤其关注低成本、非绑定的“廉价对话”(cheap talk)在促进协作中的作用。现有理论框架通常忽略随机探索(noise)以保证分析可 tractability,但其对系统性能的影响尚不明确。论文提出了一种两阶段演化博弈模型,将信号传递与捐赠博弈相结合,并在决策过程中显式引入探索机制。解决方案的关键在于:通过模拟不同拓扑结构下的代理行为,发现存在一个普遍最优的探索率,能够最大化系统整体合作水平;机制上,适度的探索破坏了背叛的稳定性,促使自组织合作联盟形成周期性成功,而合作峰值则依赖于振荡周期与放大效应之间的精细平衡。这表明,在基于通信的智能系统中,战略性地引入随机性(即工程化的随机探索)比追求确定性刚性更为重要,是维持合作和实现最优性能的核心策略。
链接: https://arxiv.org/abs/2603.01401
作者: Zhao Song,Chen Shen,Zhen Wang, TheAnh Han
机构: Teesside University (提赛德大学); Kyushu University (九州大学); Northwestern Polytechnical University (西北工业大学)
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Designing protocols enhancing cooperation for multi-agent systems remains a grand challenge. Cheap talk, defined as costless, non-binding communication before formal action, serves as a pivotal solution. However, existing theoretical frameworks often exclude random exploration, or noise, for analytical tractability, leaving its functional impact on system performance largely unexplored. To bridge this gap, we propose a two-stage evolutionary game-theoretical model, integrating signalling with a donation game, with exploration explicitly incorporated into the decision-making. Our agent-based simulations across topologies reveal a universal optimal exploration rate that maximises system-wide cooperation. Mechanistically, moderate exploration undermines the stability of defection and catalyses the self-organised cooperative alliances, facilitating their cyclic success. Moreover, the cooperation peak is enabled by the delicate balance between oscillation period and amplification. Our findings suggest that rather than pursuing deterministic rigidity, embracing strategic exploration, as a form of engineered randomness, is essential to sustain cooperation and realise optimal performance in communication-based intelligent systems.
[MA-6] Epistemic Gain Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning
【速读】:该论文旨在解决多智能体辩论(Multi-Agent Debate, MAD)中信息交换如何影响模型推理能力的问题,特别是解释MAD在实践中表现出的悖论现象,如准确率提升伴随token熵显著增加,以及同质与异质模型组合间表现的巨大差异。解决方案的关键在于提出一个贝叶斯不确定性分析框架,将总预测不确定性分解为可通过辩论上下文减少的表征不确定性(epistemic uncertainty)和由模型内部噪声引起的随机不确定性(aleatoric uncertainty)。基于此,论文进一步设计了一种不确定性引导的多智能体强化学习(MARL)算法,显式优化降低aleatoric噪声并提升epistemic信息利用效率,从而实现辩论后准确率和稳定性的显著提升,并推动个体推理能力超越单一智能体强化学习。
链接: https://arxiv.org/abs/2603.01221
作者: Dan Qiao,Binbin Chen,Fengyu Cai,Jianlong Chen,Wenhao Li,Fuxin Jiang,Zuzhi Chen,Hongyuan Zha,Tieying Zhang,Baoxiang Wang
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Multi-Agent Debate (MAD) has shown promise in leveraging collective intelligence to improve reasoning and reduce hallucinations, yet it remains unclear how information exchange shapes the underlying ability. Empirically, MAD exhibits paradoxical phenomena, such as accuracy improvement accompanied by substantial increase in token entropy, and remarkable divergence between homogeneous and heterogeneous model combinations. In this paper, we propose a Bayesian uncertainty analysis framework for MAD, which decomposes total predictive uncertainty into epistemic uncertainty reducible by debate context and aleatoric uncertainty induced by internal model noise. Across multiple model configurations, we find that effective debate hinges on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty-guided multi-agent reinforcement learning (MARL) algorithm that explicitly optimizes aleatoric noise reduction and epistemic information utilization. Experiments show that our training significantly improves post-debate accuracy and stability, and enhances individual reasoning beyond single-agent RL, providing a unified Bayesian uncertainty perspective for understanding and improving MAD.
[MA-7] Can AI Agents Agree?
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)作为协作代理在拜占庭共识场景下行为不可靠的问题,尤其是在无利益冲突的群体中达成一致的能力尚未被系统研究。其关键解决方案是通过同步全连接仿真环境,在标量值共识任务中测试不同模型规模、群体规模和拜占庭代理比例下的LLM代理表现,结果表明即使在良性环境中,有效共识也难以稳定实现,且随着群体规模扩大而进一步恶化;引入少量拜占庭代理后成功率显著下降,失败主要表现为活性丧失(如超时和收敛停滞),而非细微的数值篡改,揭示了当前LLM代理群体在无依赖机制下缺乏可靠协调能力。
链接: https://arxiv.org/abs/2603.01213
作者: Frédéric Berdoz,Leonardo Rugli,Roger Wattenhofer
机构: 未知
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:
Abstract:Large language models are increasingly deployed as cooperating agents, yet their behavior in adversarial consensus settings has not been systematically studied. We evaluate LLM-based agents on a Byzantine consensus game over scalar values using a synchronous all-to-all simulation. We test consensus in a no-stake setting where agents have no preferences over the final value, so evaluation focuses on agreement rather than value optimality. Across hundreds of simulations spanning model sizes, group sizes, and Byzantine fractions, we find that valid agreement is not reliable even in benign settings and degrades as group size grows. Introducing a small number of Byzantine agents further reduces success. Failures are dominated by loss of liveness, such as timeouts and stalled convergence, rather than subtle value corruption. Overall, the results suggest that reliable agreement is not yet a dependable emergent capability of current LLM-agent groups even in no-stake settings, raising caution for deployments that rely on robust coordination.
[MA-8] MedCollab: Causal-Driven Multi-Agent Collaboration for Full-Cycle Clinical Diagnosis via IBIS-Structured Argumentation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗场景中应用时存在的诊断幻觉(diagnostic hallucinations)和推理过程缺乏可解释性的问题。其核心解决方案是提出MedCollab框架,关键在于通过模拟现代医院的分层会诊流程,构建一个包含动态专科医师招募机制、基于问题的信息系统(Issue-Based Information System, IBIS)论证协议、疾病因果链结构化建模以及多轮共识机制的多智能体系统,从而实现从症状到诊断的逻辑严谨、证据可追溯、结构清晰的决策路径,显著提升诊断准确性与临床合规性。
链接: https://arxiv.org/abs/2603.01131
作者: Yuqi Zhan,Xinyue Wu,Tianyu Lin,Yutong Bao,Xiaoyu Wang,Weihao Cheng,Huangwei Chen,Feiwei Qin,Zhu Zhu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown promise in healthcare applications, however, their use in clinical practice is still limited by diagnostic hallucinations and insufficiently interpretable reasoning. We present MedCollab, a novel multi-agent framework that emulates the hierarchical consultation workflow of modern hospitals to autonomously navigate the full-cycle diagnostic process. The framework incorporates a dynamic specialist recruitment mechanism that adaptively assembles clinical and examination agents according to patient-specific symptoms and examination results. To ensure the rigor of clinical work, we adopt a structured Issue-Based Information System (IBIS) argumentation protocol that requires agents to provide ``Positions’’ backed by traceable evidence from medical knowledge and clinical data. Furthermore, the framework constructs a Hierarchical Disease Causal Chain that transforms flattened diagnostic predictions into a structured model of pathological progression through explicit logical operators. A multi-round Consensus Mechanism iteratively filters low-quality reasoning through logic auditing and weighted voting. Evaluated on real-world clinical datasets, MedCollab significantly outperforms pure LLMs and medical multi-agent systems in Accuracy and RaTEScore, demonstrating a marked reduction in medical hallucinations. These findings indicate that MedCollab provides an extensible, transparent, and clinically compliant approach to medical decision-making.
[MA-9] Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems
【速读】:该论文旨在解决多智能体系统中智能体能否可靠地利用分布式信息进行计算(而非仅交换信息)的问题,即是否存在“通信-推理差距”(Communication-Reasoning Gap)。其解决方案的关键在于提出Silo-Bench——一个角色无关的基准测试集,包含30个算法任务并覆盖三个通信复杂度层级,通过评估54种配置下的1,620次实验,揭示了智能体虽能自发形成适配任务的协作拓扑并主动交换信息,却在将分布式状态整合为正确答案时系统性失败,且该推理整合阶段的缺陷随规模扩大而加剧,最终抵消并行化优势。这一发现表明,单纯增加智能体数量无法突破上下文限制,而Silo-Bench为追踪迈向真正协同式多智能体系统的研究进展提供了基础工具。
链接: https://arxiv.org/abs/2603.01045
作者: Yuzhe Zhang,Feiran Liu,Yi Shan,Xinyi Huang,Xin Yang,Yueqi Zhu,Xuxin Cheng,Cao Liu,Ke Zeng,Terry Jingchen Zhang,Wenyuan Jiang
机构: Beijing University of Technology, Beijing, China; Zhejiang University, Hangzhou, China; ETH Zürich, Switzerland; Meituan LongCat Interaction Team; Vector Institute for Artificial Intelligence
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures
Abstract:Large language models are increasingly deployed in multi-agent systems to overcome context limitations by distributing information across agents. Yet whether agents can reliably compute with distributed information – rather than merely exchange it – remains an open question. We introduce Silo-Bench, a role-agnostic benchmark of 30 algorithmic tasks across three communication complexity levels, evaluating 54 configurations over 1,620 experiments. Our experiments expose a fundamental Communication-Reasoning Gap: agents spontaneously form task-appropriate coordination topologies and exchange information actively, yet systematically fail to synthesize distributed state into correct answers. The failure is localized to the reasoning-integration stage – agents often acquire sufficient information but cannot integrate it. This coordination overhead compounds with scale, eventually eliminating parallelization gains entirely. These findings demonstrate that naively scaling agent count cannot circumvent context limitations, and Silo-Bench provides a foundation for tracking progress toward genuinely collaborative multi-agent systems.
[MA-10] SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents for Rapid Design Evaluation
【速读】:该论文旨在解决传统A/B测试因依赖真实用户流量而导致迭代速度慢、某些实验难以实施的问题,尤其在低流量页面、多变量对比、微调优化及隐私敏感场景中尤为明显。其解决方案的关键在于提出SimAB系统,将A/B测试重构为一种基于人物画像(persona)条件的AI代理模拟方法:通过设计截图和转化目标生成用户人物画像,部署AI代理模拟用户偏好决策,聚合结果并提供可解释的推理依据,从而实现快速、隐私保护的实验评估。该方法显著提升了早期反馈效率与设计筛选能力,并在47个历史实验上实现了67%的整体准确率(高置信度案例达83%)。
链接: https://arxiv.org/abs/2603.01024
作者: Tim Rieder,Marian Schneider,Mario Truss,Vitaly Tsaplin,Alina Rublea,Sinem Dere,Francisco Chicharro Sanz,Tobias Reiss,Mustafa Doga Dogan
机构: ETH Zurich (苏黎世联邦理工学院); Adobe (Adobe公司); Adobe Research (Adobe研究院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 18 pages
Abstract:A/B testing is a standard method for validating design decisions, yet its reliance on real user traffic limits iteration speed and makes certain experiments impractical. We present SimAB, a system that reframes A/B testing as a fast, privacy-preserving simulation using persona-conditioned AI agents. Given design screenshots and a conversion goal, SimAB generates user personas, deploys them as agents that state their preference, aggregates results, and synthesizes rationales. Through a formative study with experimentation practitioners, we identified scenarios where traffic constraints hinder testing, including low-traffic pages, multi-variant comparisons, micro-optimizations, and privacy-sensitive contexts. Our design emphasizes speed, early feedback, actionable rationales, and audience specification. We evaluate SimAB against 47 historical A/B tests with known outcomes, achieving 67% overall accuracy, increasing to 83% for high-confidence cases. Additional experiments show robustness to naming and positional bias and demonstrate accuracy gains from personas. Practitioner feedback suggests that SimAB supports faster evaluation cycles and rapid screening of designs difficult to assess with traditional A/B tests.
[MA-11] BioProAgent : Neuro-Symbolic Grounding for Constrained Scientific Planning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科学发现中虽具备较强推理能力,但在不可逆的湿实验环境(wet-labs)中难以可靠执行的问题。此类环境中,概率性幻觉不仅导致错误结果,还可能引发设备损坏或实验失败。解决方案的关键在于提出一种神经符号框架 BioProAgent,其核心是将概率规划锚定于确定性的有限状态机(Finite State Machine, FSM),并通过状态增强型规划机制强制执行“设计-验证-修正”(Design-Verify-Rectify)工作流,确保硬件兼容性后再执行;同时引入语义符号接地(Semantic Symbol Grounding)方法,通过符号抽象显著降低上下文消耗(约减少6倍),从而提升复杂设备场景下的执行可靠性。
链接: https://arxiv.org/abs/2603.00876
作者: Yuyang Liu,Jingya Wang,Liuzhenghao Lv,Yonghong Tian
机构: Peking University (北京大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large language models (LLMs) have demonstrated significant reasoning capabilities in scientific discovery but struggle to bridge the gap to physical execution in wet-labs. In these irreversible environments, probabilistic hallucinations are not merely incorrect, but also cause equipment damage or experimental failure. To address this, we propose \textbfBioProAgent, a neuro-symbolic framework that anchors probabilistic planning in a deterministic Finite State Machine (FSM). We introduce a State-Augmented Planning mechanism that enforces a rigorous \textitDesign-Verify-Rectify workflow, ensuring hardware compliance before execution. Furthermore, we address the context bottleneck inherent in complex device schemas by \textitSemantic Symbol Grounding, reducing token consumption by \sim 6 \times through symbolic abstraction. In the extended BioProBench benchmark, BioProAgent achieves 95.6% physical compliance (compared to 21.0% for ReAct), demonstrating that neuro-symbolic constraints are essential for reliable autonomy in irreversible physical environments. \footnoteCode at this https URL and project at this https URL
[MA-12] DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths
【速读】:该论文旨在解决多智能体系统中缺乏可解释性与可控性的关键问题,尤其是在无预设角色、控制流或通信约束的通用大语言模型(Large Language Model, LLM)智能体协作场景下,如何实现可观测、可解释并可干预的涌现式协同行为。其解决方案的关键在于提出动态交互图(Dynamic Interaction Graph, DIG),将智能体间的协作过程建模为一个随时间演化的因果网络,从而首次实现了对涌现协同行为的可视化与解析,支持实时识别、解释和纠正由协作路径引发的错误模式,填补了理解通用LLM智能体在真正自主多智能体系统中协同解决问题机制的空白。
链接: https://arxiv.org/abs/2603.00309
作者: Hanqing Yang,Hyungwoo Lee,Yuhang Yao,Zhiwei Liu,Kay Liu,Jingdi Chen,Carlee Joe-Wong
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems utilize predefined workflows or agent roles in order to reduce complexity, ideally these agents would be truly autonomous, able to achieve emergent collaboration even as the number of collaborating agents increases. Yet in practice, such unstructured interactions can lead to redundant work and cascading failures that are difficult to interpret or correct. In this work, we study multi-agent systems composed of general-purpose LLM agents that operate without predefined roles, control flow, or communication constraints, relying instead on emergent collaboration to solve problems. We introduce the Dynamic Interaction Graph (DIG), which captures emergent collaboration as a time-evolving causal network of agent activations and interactions. DIG makes emergent collaboration observable and explainable for the first time, enabling real-time identification, explanation, and correction of collaboration-induced error patterns directly from agents’ collaboration paths. Thus, DIG fills a critical gap in understanding how general LLM agents solve problems together in truly agentic multi-agent systems. The project webpage can be found at: this https URL.
[MA-13] Graph-theoretic Agreement Framework for Multi-agent LLM Systems
【速读】:该论文旨在解决分布式多智能体架构中自主协调的验证与安全问题,特别是针对生成式 AI(Generative AI)系统中因对抗性批评机制(如多智能体辩论、宪法监督和助手-批评者循环)导致的推理不稳定性与共识失效问题。其核心挑战在于:大型语言模型(LLM)作为动力学系统,其潜在状态难以从输出文本中完全观测,从而使得网络结构拓扑与微观代理可观测性之间存在耦合风险。解决方案的关键在于建立一个严格的图论框架,将Transformer交叉熵对数似然映射到带符号拉普拉斯矩阵,利用结构平衡理论刻画共识稳定性,并证明未观测到的隐藏提示会引发拓扑后门攻击,破坏合作共识;进而通过限制交互拓扑为弦图(chordal graphs),并结合Gram-Schmidt正交化矩阵分解方法,实现秩一谱边缘扰动,确定性地打破专家对称性,使特征值移入稳定左半平面,从而保障系统收敛性。
链接: https://arxiv.org/abs/2603.00121
作者: Muhammad Umar Javed
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:The shift from monolithic LLMs to distributed multi-agent architectures demands new frameworks for verifying and securing autonomous coordination. Unlike traditional multi-agent systems focused on cooperative state alignment, modern LLM patterns: multi-agent debate, constitutional oversight, helper-critic loops-rely on adversarial critique for error correction and reasoning refinement. Since LLMs are dynamical systems whose latent states are imperfectly observable from verbalized outputs, securing these networks requires understanding both macroscopic topology and microscopic agent observability. This paper establishes a rigorous graph-theoretic framework for analyzing consensus in signed, directed interaction networks, bridging graph theory and LLM reasoning by formally mapping Transformer cross-entropy log-odds to the signed Laplacian. We characterize agreement stability through structural balance theory, showing how unbalanced critique cycles produce logical frustration and persistent reasoning oscillations, and prove that unobservable latent states from hidden system prompts act as topological Trojan horses that destabilize cooperative consensus. To resolve unobservable deadlocks, we restrict interaction topologies to chordal graphs and apply matrix decomposition with Gram-Schmidt orthogonalization, proving that rank-one spectral edge perturbations deterministically break expertise symmetry by shifting eigenvalues into the stable left-half plane. Core contributions include consensus theorems, polynomial-time Perfect Elimination Ordering verification algorithms, and large-scale empirical validation on clustered ensembles of LLaMA-3, Mistral, and Gemma agents.
[MA-14] SIGMAS: Second-Order Interaction-based Grouping for Overlapping Multi-Agent Swarms AAMAS2026
【速读】:该论文旨在解决重叠多智能体群体中潜在群体结构的无监督识别问题,即在缺乏真实标签的情况下,从智能体轨迹中推断出具有内在持久成员关系的群体结构。传统方法通常依赖于直接的成对交互建模,难以应对群体间动态重叠和复杂交互场景。其解决方案的关键在于提出SIGMAS(基于二阶交互的多智能体群体分组框架),通过建模智能体之间二阶交互(second-order interaction)——即个体如何以相似方式与其他个体互动——来捕捉群体内部的一致性模式,并引入一个可学习的门控机制(gating mechanism),自适应地平衡个体行为与集体动力学,从而实现鲁棒且精细的群体推理。
链接: https://arxiv.org/abs/2603.00120
作者: Minah Lee,Saibal Mukhopadhyay
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校); Georgia Institute of Technology (佐治亚理工学院)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted at AAMAS 2026
Abstract:Swarming systems, such as drone fleets and robotic teams, exhibit complex dynamics driven by both individual behaviors and emergent group-level interactions. Unlike traditional multi-agent domains such as pedestrian crowds or traffic systems, swarms typically consist of a few large groups with inherent and persistent memberships, making group identification essential for understanding fine-grained behavior. We introduce the novel task of group prediction in overlapping multi-agent swarms, where latent group structures must be inferred directly from agent trajectories without ground-truth supervision. To address this challenge, we propose SIGMAS (Second-order Interaction-based Grouping for Multi-Agent Swarms), a self-supervised framework that goes beyond direct pairwise interactions and model second-order interaction across agents. By capturing how similarly agents interact with others, SIGMAS enables robust group inference and adaptively balances individual and collective dynamics through a learnable gating mechanism for joint reasoning. Experiments across diverse synthetic swarm scenarios demonstrate that SIGMAS accurately recovers latent group structures and remains robust under simultaneously overlapping swarm dynamics, establishing both a new benchmark task and a principled modeling framework for swarm understanding.
[MA-15] Position: AI Agents Are Not (Yet) a Panacea for Social Simulation
【速读】:该论文试图解决当前基于大语言模型(Large Language Models, LLMs)的智能体在社会模拟中被过度乐观地视为“万能方案”的问题,指出LLM集成代理在群体动态建模中的局限性。其核心问题是:现有代理流水线通常优化和验证的是角色扮演合理性(role-playing plausibility),而非真实人类行为的有效性(human behavioral validity),且集体结果常由代理与环境的协同演化决定,而非仅依赖代理间通信;此外,在政策导向场景中,交互协议、调度机制和初始信息先验可能主导模拟结果。解决方案的关键在于提出一种统一的形式化框架——将AI代理社会模拟建模为一个包含显式环境交互和调度机制的不完全可观测马尔可夫博弈(environment-involved partially observable Markov game),从而明确暴露并可审计模拟中的关键假设,推动更严谨的社会模拟研究。
链接: https://arxiv.org/abs/2603.00113
作者: Yiming Li,Dacheng Tao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 13 pages
Abstract:Recent advances in large language models (LLMs) have spurred growing interest in using LLM-integrated agents for social simulation, often under the implicit assumption that realistic population dynamics will emerge once role-specified agents are placed in a networked multi-agent setting. This position paper argues that LLM-based agents are not (yet) a panacea for social simulation. We attribute this over-optimism to a systematic mismatch between what current agent pipelines are typically optimized and validated to produce and what simulation-as-science requires. Concretely, role-playing plausibility does not imply faithful human behavioral validity; collective outcomes are frequently mediated by agent-environment co-dynamics rather than agent-agent messaging alone; and results can be dominated by interaction protocols, scheduling, and initial information priors, especially in policy-oriented settings. To make these assumptions explicit and auditable, we propose a unified formulation of AI agent-based social simulation as an environment-involved partially observable Markov game with explicit exposure and scheduling mechanisms and call for further actions.
自然语言处理
[NLP-0] Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training
【速读】: 该论文旨在解决当前语言模型在符号推理能力上的局限性,即标准预训练语料难以支撑模型在复杂逻辑、规划、因果推断等核心形式化领域实现持续进阶。现有方法多依赖固定模板或谜题,缺乏可扩展性和分布多样性,无法满足大规模训练需求。其解决方案的关键在于提出一个名为Reasoning Core的可扩展生成套件,能够程序化地生成跨多个核心形式化领域的可验证符号推理数据,包括PDDL规划、一阶逻辑(First-Order Logic)带等式、上下文无关语法解析与生成、随机贝叶斯网络中的因果推理以及方程组求解;每个任务均配备外部求解器用于严格验证,并支持连续难度控制以适配课程学习设计,同时提供可监督训练所需的推理轨迹和强化学习所需的可验证奖励函数,从而有效提升模型的推理性能并保持语言建模质量。
链接: https://arxiv.org/abs/2603.02208
作者: Valentin Lacombe,Valentin Quesnel,Damien Sileo
机构: Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRIStAL, F-59000 Lille, France
类目: Computation and Language (cs.CL)
备注: Keywords: LLMs, NLP, Dataset, Corpus, Procedural Pre-training, Reasoning, Logic, Formal Semantics this https URL
Abstract:Training on verifiable symbolic data is a promising way to expand the reasoning frontier of language models beyond what standard pre-training corpora provide. Yet existing procedural generators often rely on fixed puzzles or templates and do not deliver the distributional breadth needed at scale. We introduce Reasoning Core, a scalable suite that procedurally generates verifiable symbolic reasoning data across core formal domains: PDDL planning over randomized domains, first-order logic with equality, context-free grammar parsing and generation, causal reasoning over random Bayesian networks, and systems of equations. Each task is paired with an external solver for rigorous verification and admits continuous difficulty control for curriculum design. Examples can optionally include solver-derived reasoning traces, enabling supervised training from the earliest pre-training stages, and the same interface provides verifiable reward functions for reinforcement learning. Our experiments show that mixing Reasoning Core data into pre-training improves downstream reasoning while preserving, or slightly improving, language modeling quality. Zero-shot evaluations confirm these tasks challenge frontier models such as GPT-5. The code and data are publicly available under the MIT license.
[NLP-1] ool Verification for Test-Time Reinforcement Learning
【速读】: 该论文旨在解决测试时强化学习(Test-time reinforcement learning, TTRL)在自进化大型推理模型(Large Reasoning Models, LRMs)中因高频但未经验证的共识所导致的奖励信号偏差与模式崩溃问题。解决方案的关键在于提出T³RL(Tool-Verification for Test-Time Reinforcement Learning),其核心机制是在奖励估计过程中引入测试时工具验证(test-time tool verification),通过外部工具(如代码执行)作为证据来增强可验证轨迹的权重,从而在投票机制中生成更可靠的伪标签,实现稳定且高效的在线数据合成与模型自进化。
链接: https://arxiv.org/abs/2603.02203
作者: Ruotong Liao,Nikolai Röhrich,Xiaohan Wang,Yuhui Zhang,Yasaman Samadzadeh,Volker Tresp,Serena Yeung-Levy
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 11 figures
Abstract:Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.
[NLP-2] Organizing Orchestrating and Benchmarking Agent Skills at Ecosystem Scale
【速读】: 该论文旨在解决生成式 AI(Generative AI)代理技能生态系统中如何有效利用、管理和扩展技能资源的核心问题。其解决方案的关键在于提出首个系统性的框架 AgentSkillOS,该框架包含两个阶段:一是“管理技能”阶段,通过节点级递归分类构建能力树(capability tree),实现技能的高效发现;二是“解决问题”阶段,基于有向无环图(DAG)的流水线机制检索、编排并执行多技能组合。实验表明,树状检索能近似最优技能选择,而 DAG 编排显著优于扁平化调用方式,验证了结构化组合是释放技能潜力的关键。
链接: https://arxiv.org/abs/2603.02176
作者: Hao Li,Chunjiang Mu,Jianhao Chen,Siyue Ren,Zhiyao Cui,Yiqun Zhang,Lei Bai,Shuyue Hu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree via node-level recursive categorization for efficient discovery; and (ii) Solve Tasks, which retrieves, orchestrates, and executes multiple skills through DAG-based pipelines. To evaluate the agent’s ability to invoke skills, we construct a benchmark of 30 artifact-rich tasks across five categories: data computation, document creation, motion video, visual design, and web interaction. We assess the quality of task outputs using LLM-based pairwise evaluation, and the results are aggregated via a Bradley-Terry model to produce unified quality scores. Experiments across three skill ecosystem scales (200 to 200K skills) show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill this http URL findings confirm that structured composition is the key to unlocking skill potential. Our GitHub repository is available at:this https URL.
[NLP-3] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER) ICDAR
【速读】: 该论文旨在解决犯罪相关文档中关键信息提取的难题,特别是针对现实世界犯罪场景下标注数据严重不足的问题。其解决方案的关键在于构建了一个名为CrimeNERdb的犯罪相关命名实体识别(Named-Entity Recognition, NER)数据库,包含超过1500份从公开恐怖袭击报告和美国司法部新闻稿中抽取并标注的文档,并定义了5类粗粒度与22类细粒度的犯罪实体类别。此外,论文通过在零样本(Zero-Shot)和少样本(Few-Shot)设置下对前沿NER模型及通用大语言模型(Large Language Models, LLMs)进行实验,验证了该数据集在实际应用中的有效性与质量。
链接: https://arxiv.org/abs/2603.02150
作者: Miguel Lopez-Duran,Julian Fierrez,Aythami Morales,Daniel DeAlcala,Gonzalo Mancera,Javier Irigoyen,Ruben Tolosana,Oscar Delgado,Francisco Jurado,Alvaro Ortigosa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Sent for review at the main conference of the International Conference of Document Analysis and Recognition (ICDAR) 2026
Abstract:The extraction of critical information from crime-related documents is a crucial task for law enforcement agencies. Named-Entity Recognition (NER) can perform this task in extracting information about the crime, the criminal, or law enforcement agencies involved. However, there is a considerable lack of adequately annotated data on general real-world crime scenarios. To address this issue, we present CrimeNER, a case-study of Crime-related zero- and Few-Shot NER, and a general Crime-related Named-Entity Recognition database (CrimeNERdb) consisting of more than 1.5k annotated documents for the NER task extracted from public reports on terrorist attacks and the U.S. Department of Justice’s press notes. We define 5 types of coarse crime entity and a total of 22 types of fine-grained entity. We address the quality of the case-study and the annotated data with experiments on Zero and Few-Shot settings with State-of-the-Art NER models as well as generalist and commonly used Large Language Models.
[NLP-4] LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards ICLR2026
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在长上下文场景下性能下降的问题,其根本原因在于仅依赖最终答案的稀疏奖励信号无法有效引导模型进行外部信息的定位与推理(即上下文接地,context grounding),导致梯度消失,学习变得不可行。解决方案的关键在于提出LongRLVR方法,通过引入一个密集且可验证的上下文奖励(context reward)作为辅助信号,直接激励模型选择正确的证据来源,从而提供稳定的学习梯度,显著提升长上下文任务中的推理能力。
链接: https://arxiv.org/abs/2603.02146
作者: Guanzheng Chen,Michael Qizhe Shieh,Lidong Bing
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICLR 2026
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding–the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model’s scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at this https URL.
[NLP-5] LLM s as Strategic Actors: Behavioral Alignment Risk Calibration and Argumentation Framing in Geopolitical Simulations
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在结构化地缘政治模拟环境中行为特征不明确的问题,特别是其在多轮战略决策中与人类决策者表现差异的机制。解决方案的关键在于通过四个真实世界危机模拟场景对六种前沿LLMs进行系统评估,量化其在行动一致性(action alignment)、风险校准(risk calibration)以及基于国际关系理论的论证框架(argumentative framing)三个维度的表现,并揭示模型随时间演化出的行为模式与策略更新机制。结果表明,尽管初始阶段LLMs能近似人类决策模式,但随模拟轮次推进逐渐显现出不同于人类的策略路径和解释倾向,尤其表现为以稳定、协调与风险规避为核心的规范性-合作型推理,而缺乏对抗性思维。
链接: https://arxiv.org/abs/2603.02128
作者: Veronika Solopova,Viktoria Skorik,Maksym Tereshchenko,Alina Haidun,Ostap Vykhopen
机构: Mantisanalytics(曼蒂斯分析)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Large language models (LLMs) are increasingly proposed as agents in strategic decision environments, yet their behavior in structured geopolitical simulations remains under-researched. We evaluate six popular state-of-the-art LLMs alongside results from human results across four real-world crisis simulation scenarios, requiring models to select predefined actions and justify their decisions across multiple rounds. We compare models to humans in action alignment, risk calibration through chosen actions’ severity, and argumentative framing grounded in international relations theory. Results show that models approximate human decision patterns in base simulation rounds but diverge over time, displaying distinct behavioural profiles and strategy updates. LLM explanations for chosen actions across all models exhibit a strong normative-cooperative framing centered on stability, coordination, and risk mitigation, with limited adversarial reasoning.
[NLP-6] Recursive Models for Long-Horizon Reasoning
【速读】: 该论文旨在解决现代语言模型在有限上下文长度下进行长时程推理(long-horizon reasoning)的瓶颈问题。其核心挑战在于,传统自回归(autoregressive)模型受限于单一序列的上下文管理方式,难以处理需要跨多个步骤、大规模信息交互的复杂任务。解决方案的关键在于引入递归机制(recursion),提出递归模型(recursive models),使模型能够自我调用以在隔离的子上下文中解决子任务;理论证明表明,任何可计算问题均可被递归分解为若干子任务,每个子任务所需的有效上下文规模呈指数级小于标准自回归方法,从而突破单序列上下文管理的极限。此外,该框架还可扩展至具备任意上下文处理与控制流能力的现代智能体系统,并证明递归模型在此类系统中具有最优计算能力。
链接: https://arxiv.org/abs/2603.02112
作者: Chenxiao Yang,Nathan Srebro,Zhiyuan Li
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Modern language models reason within bounded context, an inherent constraint that poses a fundamental barrier to long-horizon reasoning. We identify recursion as a core principle for overcoming this barrier, and propose recursive models as a minimal realization, where the model can recursively invoke itself to solve subtasks in isolated contexts. We prove that any computable problem admits a recursive decomposition in which each subtask requires only exponentially smaller active context than standard autoregressive models; this strictly surpasses any context management approach confined to a single sequence, such as summarization. We further generalize our framework to modern agentic systems with arbitrary context processing and control flows, and prove that recursive models can achieve optimal power within this broader class. Experimentally, we train a 3B model to reason recursively and evaluate on Boolean satisfiability, a task requiring long-horizon combinatorial search, where it significantly outperforms frontier LLMs.
[NLP-7] Recursive Think-Answer Process for LLM s and VLMs CVPR2026
【速读】: 该论文旨在解决当前基于思维链(Chain-of-Thought, CoT)的推理模型(如DeepSeek-R1)在单次推理过程中仍易产生输出错误的问题,尽管这些模型具备可解释的内部推理能力且常出现类似“Oops!”的自我反思提示。解决方案的关键在于提出一种高效的递归式思考-作答过程(Recursive Think-Answer Process, R-TAP),其核心是引入一个置信度生成器(confidence generator),用于评估模型响应的确定性并驱动迭代优化;同时设计两种互补奖励机制——递归置信度提升奖励(Recursively Confidence Increase Reward)与最终答案置信度奖励(Final Answer Confidence Reward),从而显著提升大语言模型(LLMs)和视觉-语言模型(VLMs)的推理准确性与稳定性,并减少推理过程中对“Oops”类自省表达的依赖,实现更高效、可靠的推理流程。
链接: https://arxiv.org/abs/2603.02099
作者: Byung-Kwan Lee,Youngchae Chee,Yong Man Ro
机构: 未知
类目: Computation and Language (cs.CL)
备注: CVPR 2026 Findings, Project page: this https URL
Abstract:Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like “Oops!”, they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of “Oops”-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.
[NLP-8] ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLM s across Difficulty Levels
【速读】: 该论文旨在解决当前医疗领域大语言模型(Large Language Models, LLMs)评估基准静态化、任务孤立化的问题,无法真实反映临床工作流程中开放性、纵向性和安全性复杂性的局限。其核心解决方案是提出ClinConsensus——一个由临床专家构建、验证并质量控制的中文医疗基准,涵盖从预防到长期随访的完整诊疗链条,包含36个医学专科、12类常见临床任务及渐进式复杂度。关键创新在于引入基于评分量表的评阅协议和临床适用一致性评分(Clinically Applicable Consistency Score, CACS@k),以及双评委评估框架:结合高能力LLM作为评判者与通过监督微调训练的轻量化本地部署判别模型,实现可扩展、可复现且贴近医生判断的评估体系。
链接: https://arxiv.org/abs/2603.02097
作者: Xiang Zheng,Han Li,Wenjie Luo,Weiqi Zhai,Yiyuan Li,Chuanmiao Yan,Tianyi Tang,Yubo Ma,Kexin Yang,Dayiheng Liu,Hu Wei,Bing Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures,
Abstract:Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care–from prevention and intervention to long-term follow-up–covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.
[NLP-9] Learning from Synthetic Data Improves Multi-hop Reasoning ICLR2026
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)微调大语言模型(Large Language Models, LLMs)时对高质量可验证数据的依赖问题,此类数据通常来源于人工标注、前沿LLM生成或LLM-based验证器评分,均存在成本高、幻觉风险或准确性不足等局限。解决方案的关键在于使用规则生成的合成数据(rule-generated synthetic data)进行RL微调,实验证明即使这些数据仅包含虚构知识,也能显著提升LLMs在多跳推理任务中的表现,并揭示其核心机制是教会模型掌握“知识组合”这一基础且泛化性强的推理能力。
链接: https://arxiv.org/abs/2603.02091
作者: Anmol Kabra,Yilun Yin,Albert Gong,Kamilė Stankevičiūtė,Dongyoung Go,Johann Lee,Katie Z. Luo,Carla P. Gomes,Kilian Q. Weinberger
机构: Cornell University (康奈尔大学); University of Cambridge (剑桥大学); Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ICLR 2026
Abstract:Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge – a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.
[NLP-10] Modeling Grammatical Hypothesis Testing in Young Learners: A Sequence-Based Learning Analytics Study of Morphosyntactic Reasoning in an Interactive Game
【速读】: 该论文旨在解决传统语言能力评估方法难以捕捉儿童在句子构建过程中真实认知策略的问题,特别是针对法语形态句法一致性的习得过程。其解决方案的关键在于采用基于序列的学习分析方法,将每个滑块操作视为假设检验行为,利用细粒度的动作序列(action sequences)量化学生向正确语法解的收敛路径,并引入汉明距离(Hamming distance)来测量动作序列与有效语法解之间的接近程度,从而揭示学习者在不同难度任务中的动态推理模式和错误修正机制。
链接: https://arxiv.org/abs/2603.02084
作者: Thierry Geoffre,Trystan Geoffre
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This study investigates grammatical reasoning in primary school learners through a sequence-based learning analytics approach, leveraging fine-grained action sequences from an interactive game targeting morphosyntactic agreement in French. Unlike traditional assessments that rely on final answers, we treat each slider movement as a hypothesis-testing action, capturing real-time cognitive strategies during sentence construction. Analyzing 597 gameplay sessions (9,783 actions) from 100 students aged 8-11 in authentic classroom settings, we introduce Hamming distance to quantify proximity to valid grammatical solutions and examine convergence patterns across exercises with varying levels of difficulty. Results reveal that determiners and verbs are key sites of difficulty, with action sequences deviating from left-to-right usual treatment. This suggests learners often fix the verb first and adjust preceding elements. Exercises with fewer solutions exhibit slower and more erratic convergence, while changes in the closest valid solution indicate dynamic hypothesis revision. Our findings demonstrate how sequence-based analytics can uncover hidden dimensions of linguistic reasoning, offering a foundation for real-time scaffolding and teacher-facing tools in linguistically diverse classrooms.
[NLP-11] What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies
【速读】: 该论文旨在解决儿童习得填充空位关系(filler-gap dependencies)是否依赖于先天语法知识,还是仅依靠儿童言语中分布性证据的问题。由于相关输入难以在大规模且细粒度层面进行量化,这一问题长期难以明确。论文的关键解决方案是构建一个自动化系统,能够从口语语料库中识别三种核心填充空位结构(主句wh疑问句、嵌套wh疑问句和定语从句),并进一步标注提取位置(主语、宾语或状语)。该方法结合短语结构分析(constituency parsing)与依存句法分析(dependency parsing),利用二者在构式分类和提取位置识别上的互补优势,从而实现高精度的细粒度标注。通过在57个英语CHILDES语料库上应用该系统,研究者得以刻画儿童早期语言输入中的填充空位特征及其发展轨迹,为后续语言习得与计算建模研究提供了可量化的基础数据。
链接: https://arxiv.org/abs/2603.02082
作者: Zhenghao Herbert Zhou,William Dai,Maya Viswanathan,Simon Charlow,R. Thomas McCoy,Robert Frank
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Children’s acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora – matrix wh-questions, embedded wh-questions, and relative clauses – and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children’s filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.
[NLP-12] EstLLM : Enhancing Estonian Capabilities in Multilingual LLM s via Continued Pretraining and Post-Training
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在小语种(如爱沙尼亚语)上性能显著低于英语的问题,即多语言预训练模型对资源较少语言的支持不均衡。解决方案的关键在于通过持续预训练(Continued Pretraining, CPT),在保持原模型英语和通用推理能力的前提下,提升特定小语种的能力:具体方法是使用Llama 3.1 8B作为基础模型,在CPT阶段引入以爱沙尼亚语为主的混合数据集,同时通过英语重放(English replay)、代码、数学和指令类数据来近似原始训练分布,从而避免灾难性遗忘;随后结合监督微调(Supervised Fine-Tuning)、偏好优化(Preference Optimization)与对话向量合并(Chat Vector Merging)实现强指令遵循行为。实验证明该方案在爱沙尼亚语基准测试中实现了语言能力、知识、推理、翻译质量和指令遵循等方面的系统性提升,且英文基准性能保持竞争力。
链接: https://arxiv.org/abs/2603.02041
作者: Aleksei Dorkin,Taido Purason,Emil Kalbaliyev,Hele-Andra Kuulmets,Marii Ojastu,Mark Fišel,Tanel Alumäe,Eleri Aedmaa,Krister Kruusmaa,Kairit Sirts
机构: University of Tartu (塔尔图大学); Tallinn University of Technology (塔林理工大学); Institute of the Estonian Language (爱沙尼亚语言研究所); Tallinn University (塔林大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.
[NLP-13] Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT
【速读】: 该论文旨在解决3D CT医学影像与文本报告之间细粒度对齐不足的问题,现有方法多依赖有限的公共数据集且仅提供粗粒度全局监督,导致模型在跨模态检索和疾病分类任务中表现受限。其关键解决方案在于:首先,在单中心收集的98k报告-体积对(50k患者)基础上结合公开数据集,采用SigLIP风格的对比预训练,并引入基于提示的疾病监督以增强视觉-文本嵌入空间的一致性;其次,创新性地挖掘出262k个“文本片段-切片”配对(来自放射科医生常引用的具体图像位置),提出“扫描内片段定位”(intra-scan snippet localization)任务,通过预测文本片段对应的轴向深度实现精准空间锚定,将平均绝对误差降至36.3 mm(特征分辨率为12 mm),显著优于基线(67.0 mm)。该局部定位目标在不损害检索与分类性能的前提下,构建了一个统一模型,同时支持文本到图像检索、疾病分类和扫描内图像定位。
链接: https://arxiv.org/abs/2603.02026
作者: Simon Ging(1 and 2),Philipp Arnold(3),Sebastian Walter(4),Hani Alnahas(1),Hannah Bast(4),Elmar Kotter(3),Jiancheng Yang(5 and 6),Behzad Bozorgtabar(2),Thomas Brox(1) ((1) Computer Vision Group, University of Freiburg, Germany, (2) Adaptive amp; Agentic AI (A3) Lab, Aarhus University, Denmark, (3) Department of Radiology, Medical Center – University of Freiburg, Germany, (4) Chair of Algorithms and Data Structures, University of Freiburg, Germany, (5) ELLIS Institute Finland, (6) School of Electrical Engineering, Aalto University, Finland)
机构: Deutsche Forschungsgemeinschaft (DFG, German Research Foundation); SFB 1597 – Small Data; Baden-Württemberg
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y’'), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization – predicting the axial depth referred to by a text snippet – reducing mean absolute error to 36.3 mm at 12 mm feature resolution, compared with 67.0 mm for the best baseline. Adding this localization objective leaves retrieval and classification broadly unchanged within confidence bounds, yielding a single unified model for retrieval, classification, and intra-scan grounding.
[NLP-14] MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning ICLR2026
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实场景下多图像推理能力缺乏系统评估的问题。现有研究虽在特定领域展现出推理潜力,但尚未建立覆盖多样化现实情境、具备标准化评测机制的基准测试体系。解决方案的关键在于提出 MMR-Life,这是一个包含 2,646 道多项选择题、基于 19,108 张真实世界图像构建的综合性基准,涵盖七类核心推理类型(归纳、演绎、因果、类比、溯因、空间和时间推理),并强调跨图像信息整合与通用推理能力,而非依赖特定领域知识。该设计使模型必须在复杂多样的现实场景中进行多层次推理,从而更真实地反映其多模态推理水平,为后续模型优化提供科学依据。
链接: https://arxiv.org/abs/2603.02024
作者: Jiachun Li,Shaoping Huang,Zhuoran Jin,Chenlong Zhang,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026, 78 pages, 60 figures
Abstract:Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs’ reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.
[NLP-15] PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking
【速读】: 该论文旨在解决生成式 AI(Generative AI)在推理阶段如何高效分配额外计算资源的问题,即“在哪一时刻或哪个 token 上分配更多计算才能最大化生成质量”。现有方法通常采用固定步数的递归或均匀分配策略,导致部分 token 被冗余计算,效率低下。解决方案的关键在于提出 PonderLM-3:一个基于 token 级自适应 pondering 的预训练框架,通过引入可微分注意力掩码(differentiable attention mask)实现训练与推理的一致性,并结合硬剪枝规则(hard pruning rule)在推理时动态决定每个 token 是否需要额外计算。这一机制使额外计算成为按需分配的资源,从而在相同推理 FLOPs 下获得更低的预训练困惑度(perplexity),并在下游任务中以更少的实际计算量达到与固定步长模型相当的性能。
链接: https://arxiv.org/abs/2603.02023
作者: He Li,Feichen Song,Boyi Zeng,Shixiang Song,Zhiqin John Xu,Ziwei He,Zhouhan Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared with existing recursive or adaptive baselines, it achieves lower pretraining perplexity at equal inference FLOPs. On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice. Overall, PonderLM-3 provides an end-to-end differentiable and train-inference consistent framework for token-wise adaptive computation, enabling additional inference compute to be allocated where it is most useful rather than paid uniformly by every token.
[NLP-16] According to Me: Long-Term Personalized Referential Memory QA
【速读】: 该论文旨在解决个性化AI助手在处理长期用户记忆时面临的挑战,即如何有效建模和推理跨模态、多源的个人记忆(如图像、视频、邮件等),而现有基准测试主要局限于对话历史,无法反映真实生活经验中的复杂引用与推理需求。其解决方案的关键在于提出ATM-Bench——首个针对多模态、多源个性化参考记忆问答(Memory QA)的基准,并引入Schema-Guided Memory (SGM) 结构化表示方法,以统一来自不同来源的记忆项,从而提升模型对个人参考指代、多证据推理及冲突信息处理的能力。实验表明,SGM相比传统描述性记忆表示显著改善了性能,且当前主流记忆系统在ATM-Bench-Hard子集上准确率仍低于20%。
链接: https://arxiv.org/abs/2603.01990
作者: Jingbiao Mei,Jinghong Chen,Guangyu Yang,Xinyu Hou,Margaret Li,Bill Byrne
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: this https URL
[NLP-17] CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLM s in Production
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在生产环境中社交聊天应用(如Instagram、WhatsApp和Messenger)中持续优化与部署的挑战,尤其是在真实用户流量下如何稳定提升用户参与度(engagement)并增强指令遵循能力(steerability)。解决方案的关键在于提出CharacterFlywheel这一迭代飞轮流程,其核心包括:基于内外部真实用户数据的持续数据收集与清洗、利用奖励建模(reward modeling)对参与度指标进行估计与插值、结合监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)进行模型优化,并通过离线与在线评估闭环验证每一步改进的有效性。该框架确保了在百万级用户规模下的稳定演进,同时有效防止过拟合并适应生产环境动态变化,最终实现参与度广度提升最高达8.8%、深度提升19.4%,以及指令遵循率从59.2%提升至84.8%。
链接: https://arxiv.org/abs/2603.01973
作者: Yixin Nie,Lin Guan,Zhongyao Ma,Anchit Gupta,Yipin Zhou,Xiao Li,Zhengping Zhou,Raymond Zeng,Gelin Zhou,Shigan Chu,Ajay Thampi,Wancen Mu,Nathan Shuster,Ketong Wang,Lin Chen,Jason Brewer,Derek Hao Hu,Alexander McCauley,Jason Weston,Sem Park,Na Zhang,Kevin Tang
机构: Meta(元)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:This report presents CharacterFlywheel, an iterative flywheel process for improving large language models (LLMs) in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, we refined models across 15 generations using data from both internal and external real-user traffic. Through continuous deployments from July 2024 to April 2025, we conducted controlled 7-day A/B tests showing consistent engagement improvements: 7 of 8 newly deployed models demonstrated positive lift over the baseline, with the strongest performers achieving up to 8.8% improvement in engagement breadth and 19.4% in engagement depth. We also observed substantial gains in steerability, with instruction following increasing from 59.2% to 84.8% and instruction violations decreasing from 26.6% to 5.8%. We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online evaluation to ensure reliable progress at each optimization step. We also discuss our methods for overfitting prevention and navigating production dynamics at scale. These contributions advance the scientific rigor and understanding of LLMs in social applications serving millions of users.
[NLP-18] AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长时交互中面临的记忆管理(memory management)训练与评估难题。现有方法依赖静态、离线的策略外(off-policy)数据作为上下文,导致评估可靠性不足且难以扩展。其解决方案的关键在于提出AMemGym——一个交互式环境,支持策略内(on-policy)评估与优化,通过结构化数据采样预定义用户画像、状态相关问题及状态演化轨迹,实现高质量、对齐评估目标的交互生成;同时利用LLM模拟用户进行角色扮演以暴露隐状态并保持结构一致性,结合基于结构化数据的综合指标指导助手性能评估与优化,从而有效识别现有记忆系统(如RAG、长上下文LLMs和代理记忆)的性能差距,并推动记忆策略的自我演进。
链接: https://arxiv.org/abs/2603.01966
作者: Cheng Jiayang,Dongyu Ru,Lin Qiu,Yiyang Li,Xuezhi Cao,Yangqiu Song,Xunliang Cai
机构: The Hong Kong University of Science and Technology (香港科技大学); Meituan (美团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026
Abstract:Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization. AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency. Comprehensive metrics based on structured data guide both assessment and optimization of assistants. Extensive experiments reveal performance gaps in existing memory systems (e.g., RAG, long-context LLMs, and agentic memory) and corresponding reasons. AMemGym not only enables effective selection among competing approaches but also can potentially drive the self-evolution of memory management strategies. By bridging structured state evolution with free-form interactions, our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.
[NLP-19] Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment
【速读】: 该论文旨在解决盲人或视障用户无法访问漫画/动漫内容的问题,从而为这一群体提供一种新的叙事媒介。当前缺乏能够支持此类用户理解漫画的系统,而生成式视觉语言模型(VLMs)虽在图像描述和漫画理解方面展现出潜力,但现有研究多局限于单个画面(panel-level)分析,未能实现对整页漫画(page-level)的完整理解与解释。论文的关键解决方案是构建一个初步的VLM漫画解读性能基准,并识别和分类在该过程中出现的幻觉现象,将其归纳为广义的对象幻觉分类体系(generalized object-hallucination taxonomies),进而为未来研究提供方向,强调幻觉缓解策略和更高质量的漫画数据集构建。
链接: https://arxiv.org/abs/2603.01950
作者: Christopher Driggers-Ellis,Nachiketh Tibrewal,Rohit Bogulla,Harsh Khanna,Sangpil Youm,Christan Grant,Bonnie Dorr
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, 3 tables. Includes link to code
Abstract:A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.
[NLP-20] When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation
【速读】: 该论文旨在解决主题模型(Topic Models)在专业领域中评估质量困难的问题,尤其是现有自动化指标(如主题一致性与多样性)难以准确反映人类对主题结构的理解。其关键解决方案是提出一种新型的人类评估任务——主题词混合测试(Topic Word Mixing, TWM),通过考察标注者能否区分来自单一主题或混合主题的词集,来衡量主题间的区分度(inter-topic distinctness)。TWM 补充了传统词侵入任务(word intrusion)对主题内一致性(intra-topic coherence)的关注,并为多样性指标提供了基于人类感知的验证基准,从而在专业领域文本中建立了自动化与人工评估之间的桥梁。
链接: https://arxiv.org/abs/2603.01945
作者: Thibault Prouteau,Francis Lareau,Nicolas Dugué,Jean-Charles Lamirel,Christophe Malaterre
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Topic models uncover latent thematic structures in text corpora, yet evaluating their quality remains challenging, particularly in specialized domains. Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment. Human evaluation tasks, such as word intrusion, provide valuable insights but are costly and primarily validated on general-domain corpora. This paper introduces Topic Word Mixing (TWM), a novel human evaluation task assessing inter-topic distinctness by testing whether annotators can distinguish between word sets from single or mixed topics. TWM complements word intrusion’s focus on intra-topic coherence and provides a human-grounded counterpart to diversity metrics. We evaluate six topic models - both statistical and embedding-based (LDA, NMF, Top2Vec, BERTopic, CFMF, CFMF-emb) - comparing automated metrics with human evaluation methods based on nearly 4,000 annotations from a domain-specific corpus of philosophy of science publications. Our findings reveal that word intrusion and coherence metrics do not always align, particularly in specialized domains, and that TWM captures human-perceived distinctness while appearing to align with diversity metrics. We release the annotated dataset and task generation code. This work highlights the need for evaluation frameworks bridging automated and human assessments, particularly for domain-specific corpora.
[NLP-21] From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation LREC2026
【速读】: 该论文旨在解决新闻话语中经济事件(如通货膨胀)叙事结构的标注与评估问题,这是自然语言处理(Natural Language Processing, NLP)领域在理解公众认知机制时面临的关键挑战。其解决方案的核心在于提出一种基于叙事图(narrative graph)的标注框架,将叙事表示为有向无环图(Directed Acyclic Graphs, DAGs),其中节点代表事件、边编码因果关系,并结合定性内容分析(Qualitative Content Analysis, QCA)原则以降低标注误差。通过设计六种叙事表示方式与三种距离度量方法的6×3因子实验,研究发现:宽松的距离度量(如重叠型度量)会高估标注一致性,而局部约束表示(如一跳邻居)可显著减少标注变异性;此外,作者开源了标注数据集和基于图结构的Krippendorff’s α实现,为NLP中受人类标签变异(Human Label Variation, HLV)影响的图结构叙事标注提供了可复用的方法论指导。
链接: https://arxiv.org/abs/2603.01930
作者: Junbo Huang,Max Weinig,Ulrich Fritsche,Ricardo Usbeck
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: LREC 2026 Accepted Paper
Abstract:Narratives in news discourse play a critical role in shaping public understanding of economic events, such as inflation. Annotating and evaluating these narratives in a structured manner remains a key challenge for Natural Language Processing (NLP). In this work, we introduce a narrative graph annotation framework that integrates principles from qualitative content analysis (QCA) to prioritize annotation quality by reducing annotation errors. We present a dataset of inflation narratives annotated as directed acyclic graphs (DAGs), where nodes represent events and edges encode causal relations. To evaluate annotation quality, we employed a 6\times3 factorial experimental design to examine the effects of narrative representation (six levels) and distance metric type (three levels) on inter-annotator agreement (Krippendorrf’s \alpha ), capturing the presence of human label variation (HLV) in narrative interpretations. Our analysis shows that (1) lenient metrics (overlap-based distance) overestimate reliability, and (2) locally-constrained representations (e.g., one-hop neighbors) reduce annotation variability. Our annotation and implementation of graph-based Krippendorrf’s \alpha are open-sourced. The annotation framework and evaluation results provide practical guidance for NLP research on graph-based narrative annotation under HLV.
[NLP-22] AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
【速读】: 该论文旨在解决传统循环式语言模型在推理阶段计算资源分配不合理的问题,即固定迭代次数导致对简单token浪费计算资源且缺乏逐token自适应性。解决方案的关键在于提出AdaPonderLM,一种基于自监督学习的递归语言模型,其核心创新是引入迭代特定的MLP门控机制(iteration-specific MLP gates)与单调终止掩码(monotonic halting mask),使每个token可自主决定何时停止迭代;同时设计**KV缓存复用机制(KV reuse mechanism)**以确保训练与推理一致性并实现实际加速。实验表明,该方法在保持语言建模困惑度不变的前提下,将推理计算量减少约10%,且学习到的终止策略能更高效地将计算分配给高负对数似然(NLL)的困难token,展现出真正的自适应计算时间(Adaptive Computation Time, ACT)行为。
链接: https://arxiv.org/abs/2603.01914
作者: Shixiang Song,He Li,Zitong Wang,Boyi Zeng,Feichen Song,Yixuan Wang,Zhiqin John Xu,Ziwei He,Zhouhan Lin
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Shanghai AI Laboratory (上海人工智能实验室); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train–test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.
[NLP-23] Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
【速读】: 该论文旨在解决交互式教育文档(interactive educational documents)自动化生成过程中存在的两大难题:一是传统方法依赖领域专家与网页开发技能,成本高昂;二是直接使用大语言模型(Large Language Models, LLMs)生成内容时,输出难以控制且不可验证。其解决方案的关键在于提出ViviDoc系统,该系统采用多智能体流水线(Planner、Executor、Evaluator)和一种人类可读的中间表示形式——文档规范(Document Specification, DocSpec),将每个交互式可视化组件分解为状态(State)、渲染(Render)、转换(Transition)和约束(Constraint)四个核心要素。这一设计使教育者能够在代码生成前审查并修改生成方案,从而有效弥合教学意图与可执行代码之间的鸿沟。
链接: https://arxiv.org/abs/2603.01912
作者: Yinghao Tang,Yupeng Xie,Yingchaojie Feng,Tingfeng Lan,Wei Chen
机构: Zhejiang University (浙江大学); HKUST(GZ) (香港科技大学(广州)); National University of Singapore (新加坡国立大学); University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Interactive articles help readers engage with complex ideas through exploration, yet creating them remains costly, requiring both domain expertise and web development skills. Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs. We present ViviDoc, a human-agent collaborative system that generates interactive educational documents from a single topic input. ViviDoc introduces a multi-agent pipeline (Planner, Executor, Evaluator) and the Document Specification (DocSpec), a human-readable intermediate representation that decomposes each interactive visualization into State, Render, Transition, and Constraint components. The DocSpec enables educators to review and refine generation plans before code is produced, bridging the gap between pedagogical intent and executable output. Expert evaluation and a user study show that ViviDoc substantially outperforms naive agentic generation and provides an intuitive editing experience. Our project homepage is available at this https URL.
[NLP-24] FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLM s for Everyday Knowledge Across Diverse Languages and Cultures
【速读】: 该论文旨在解决跨语言与跨文化场景下常识知识问答的挑战,特别是在多语言环境中如何有效利用本地化知识资源以提升模型性能的问题。其解决方案的关键在于构建一个文化敏感的知识库(CulKBs),通过提取特定关键词对应的维基百科文本和国家特有摘要来增强对不同文化背景的理解;同时结合检索增强生成(RAG)技术与开源小型大语言模型(OS-sLLMs),并引入实时在线搜索(DuckDuckGo)作为补充信息源,从而在保障隐私和可持续性的前提下实现多语言(英语、西班牙语、中文)常识问答任务的高效准确回答。
链接: https://arxiv.org/abs/2603.01910
作者: Liliia Bogdanova,Shiran Sun,Lifeng Han,Natalia Amat Lefort,Flor Miriam Plaza-del-Arco
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This system paper describes our participation in the SemEval-2025 Task-7 ``Everyday Knowledge Across Diverse Languages and Cultures’'. We attended two subtasks, i.e., Track 1: Short Answer Questions (SAQ), and Track 2: Multiple-Choice Questions (MCQ). The methods we used are retrieval augmented generation (RAGs) with open-sourced smaller LLMs (OS-sLLMs). To better adapt to this shared task, we created our own culturally aware knowledge base (CulKBs) by extracting Wikipedia content using keyword lists we prepared. We extracted both culturally-aware wiki-text and country-specific wiki-summary. In addition to the local CulKBs, we also have one system integrating live online search output via DuckDuckGo. Towards better privacy and sustainability, we aimed to deploy smaller LLMs (sLLMs) that are open-sourced on the Ollama platform. We share the prompts we developed using refinement techniques and report the learning curve of such prompts. The tested languages are English, Spanish, and Chinese for both tracks. Our resources and codes are shared via this https URL
[NLP-25] Efficient RLVR Training via Weighted Mutual Information Data Selection
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)训练中数据选择效率低下的问题,特别是现有在线数据选择策略依赖难度启发式方法所导致的局限性——这类方法通常偏好中等成功率的数据点,隐含假设难度即信息量,却忽视了由证据不足引起的认知不确定性(epistemic uncertainty)。其解决方案的关键在于提出 InSight,一种基于加权互信息目标的信息引导数据采样方法;该方法通过贝叶斯建模数据结果的潜在成功概率,揭示了预期不确定性减少可分解为难度相关与证据相关的互补成分,从而突破仅以难度为导向的选择范式。InSight 利用数据点的成功率均值信念构建稳定采集评分,避免对噪声采样结果的依赖,并自然扩展至常见于强化学习 with verifiable rewards (RLVR) 的多轮次(multi-rollout)场景,显著提升训练效率和性能表现。
链接: https://arxiv.org/abs/2603.01907
作者: Xinyu Zhou,Boyu Zhu,Haotian Zhang,Huiming Wang,Zhijiang Guo
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 Pages
Abstract:Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints’ success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.
[NLP-26] KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)知识蒸馏(Knowledge Distillation, KD)过程中因师生模型共用同质化训练后端(如FSDP和DeepSpeed)而导致的训练效率低下问题。其核心解决方案是提出一种解耦架构的新型蒸馏框架KDFlow,通过引入SGLang作为教师模型推理引擎,并结合零拷贝数据传输机制仅传递教师隐藏状态而非完整logits,从而显著降低通信开销并提升整体效率。此外,KDFlow实现了训练与推理资源的最优匹配,支持离策略(off-policy)与在线策略(on-policy)蒸馏,并提供可扩展的API以支持跨分词器蒸馏(cross-tokenizer KD),实验证明其相较现有框架可实现1.44×至6.36×的速度提升。
链接: https://arxiv.org/abs/2603.01875
作者: Songming Zhang,Xue Zhang,Tong Zhang,Bojie Hu,Yufeng Chen,Jinan Xu
机构: Beijing Jiaotong University (北京交通大学); Tencent Inc (腾讯公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, 3 tables, code is available at: this https URL
Abstract:Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed \textbfKDFlow, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher’s hidden states using zero-copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off-policy and on-policy distillation and incorporates KD algorithms for cross-tokenizer KD through highly extensible and user-friendly APIs. Experiments show that KDFlow can achieve \textbf1.44 \times to 6.36 \times speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: this https URL
[NLP-27] Sovereign AI-based Public Services are Viable and Affordable LREC2026
【速读】: 该论文旨在解决当前AI基础设施和专业知识日益集中化所引发的长期结构性问题,特别是全球少数科技巨头主导的通用型AI服务在公共部门应用中可能削弱国家数字主权与文化自主性的问题。其解决方案的关键在于证明:通过本地化部署、基于适度计算资源的主权型AI系统,可以实现技术可行性和经济可持续性,从而在保障公民AI服务可靠性的同时,维护数字主权(Digital Sovereignty)与文化主权(Cultural Sovereignty)。实证研究表明,此类替代方案无需依赖跨国商业平台即可有效运行,为各国政府和公共机构提供了可复制的技术路径与实践经验。
链接: https://arxiv.org/abs/2603.01869
作者: António Branco,Luís Gomes,Rodrigo Santos,Eduardo Santos,João Silva,Nuno Marques,Madalena Rodrigues
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted at LREC 2026
Abstract:The rapid expansion of AI-based remote services has intensified debates about the long-term implications of growing structural concentration in infrastructure and expertise. As AI capabilities become increasingly intertwined with geopolitical interests, the availability and reliability of foundational AI services can no longer be taken for granted. This issue is particularly pressing for AI-enabled public services for citizens, as governments and public agencies are progressively adopting 24/7 AI-driven support systems typically operated through commercial offerings from a small oligopoly of global technology providers. This paper challenges the prevailing assumption that general-purpose architectures, offered by these providers, are the optimal choice for all application contexts. Through practical experimentation, we demonstrate that viable and cost-effective alternatives exist. Alternatives that align with principles of digital and cultural sovereignty. Our findings provide an empirical illustration that sovereign AI-based public services are both technically feasible and economically sustainable, capable of operating effectively on premises with modest computational and financial resources while maintaining cultural and digital autonomy. The technical insights and deployment lessons reported here are intended to inform the adoption of similar sovereign AI public services by national agencies and governments worldwide.
[NLP-28] CyclicJudge: Mitigating Judge Bias Efficiently in LLM -based Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放任务评估中依赖“LLM-as-judge”方法时所存在的系统性偏差问题,这种偏差会干扰模型性能的可靠排序,且无法通过增加测试场景或生成样本数量来消除。解决方案的关键在于提出一种方差分解框架,将基准测试得分的方差分解为场景(scenario)、生成(generation)、判官(judge)和残差四个组成部分,并基于此理论推导出最优判官分配策略——CyclicJudge,即采用轮转式(round-robin)判官分配机制,在每轮循环中仅需每位判官参与一次,即可完全消除判官偏差,同时保持与单判官评估相当的成本效率。实证结果在MT-Bench数据集上验证了该方法的理论预测有效性。
链接: https://arxiv.org/abs/2603.01865
作者: Ziyi Zhu,Olivier Tieleman,Alexey Bukhtiyarov,Jinghong Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be eliminated by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. This work introduces a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges, is demonstrated to be the optimal allocation strategy. It eliminates bias precisely while requiring each judge only once per cycle, maintaining the cost of single-judge evaluation. Empirical validation on MT-Bench supports all theoretical predictions.
[NLP-29] Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering
【速读】: 该论文致力于解决时间知识图谱问答(Temporal Knowledge Graph Question Answering, TKGQA)中多跳推理(multi-hop reasoning)在时间约束下的挑战,现有基于大语言模型(Large Language Models, LLMs)的方法通常依赖于固定的手动设计检索流程或昂贵的监督微调(supervised fine-tuning),难以灵活适应复杂的时间动态场景。解决方案的关键在于提出一种无需训练的自主代理框架 AT2QA,其核心思想是赋予预训练大语言模型以自主决策能力(agentic autonomy),使其能够通过通用搜索工具与时间知识图谱进行迭代交互,动态执行检索与推理操作,从而在零样本(zero-shot)条件下显著提升问答准确率,尤其在多目标查询上表现突出,验证了自主性在TKGQA任务中的决定性优势。
链接: https://arxiv.org/abs/2603.01853
作者: Xufei Lv,Jiahui Yang,Yifu Gao,Linbo Qiao,Houde Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Temporal Knowledge Graph Question Answering (TKGQA) demands multi-hop reasoning under temporal constraints. Prior approaches based on large language models (LLMs) typically rely on rigid, hand-crafted retrieval workflows or costly supervised fine-tuning. We show that simply granting an off-the-shelf LLM autonomy, that is, letting it decide what to do next, already yields substantial gains even in a strict zero-shot setting. Building on this insight, we propose AT2QA, an autonomous, training-free agent for temporal question answering that iteratively interacts with the temporal knowledge graph via a general search tool for dynamic retrieval. Experiments on MultiTQ demonstrate large improvements: AT2QA achieves 88.7% Hits@1 (+10.7% over prior SOTA), including a +20.1% gain on challenging multi-target queries, showing that agentic autonomy can decisively outperform fine-tuning for temporal question answering. Code and the full set of sampled trajectories are available on this https URL
[NLP-30] OpenAutoNLU: Open Source AutoML Library for NLU
【速读】: 该论文旨在解决自然语言理解(Natural Language Understanding, NLU)任务中自动化程度不足、配置复杂且缺乏数据质量与分布外检测能力的问题。其核心解决方案是提出OpenAutoNLU,一个开源的自动化机器学习库,通过引入数据感知的训练策略选择机制(data-aware training regime selection),无需用户手动配置即可自动优化模型训练流程;同时集成数据质量诊断、可配置的分布外(out-of-distribution, OOD)检测以及大语言模型(Large Language Model, LLM)功能,以最小化低代码API实现端到端的NLU建模。
链接: https://arxiv.org/abs/2603.01824
作者: Grigory Arshinov,Aleksandr Boriskin,Sergey Senichev,Ayaz Zaripov,Daria Galimzianova,Daniil Karpov,Leonid Sanochkin
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing solutions, we introduce data-aware training regime selection that requires no manual configuration from the user. The library also provides integrated data quality diagnostics, configurable out-of-distribution (OOD) detection, and large language model (LLM) features, all within a minimal lowcode API. The demo app is accessible here this https URL.
[NLP-31] ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLM s AAAI AAAI2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中知识遗忘(unlearning)的两个核心挑战:一是知识在参数空间中的纠缠导致难以精准删除特定信息而不影响其他知识;二是现有方法在亿级参数模型上计算开销巨大,效率低下。解决方案的关键在于提出一种轻量级框架ALTER,其通过两阶段机制实现高效且精准的遗忘:第一阶段利用LoRA(Low-Rank Adaptation)中的共享A矩阵捕获高熵token并学习;第二阶段采用非对称LoRA架构,通过参数隔离和目标子域内token的定向遗忘,实现指定遗忘目标。该方法在TOFU、WMDP和MUSE基准测试中达到超过95%的遗忘质量,同时保留超过90%的模型效用,显著优于基线方法(47.8%-83.6%)。
链接: https://arxiv.org/abs/2603.01792
作者: Xunlei Chen,Jinyu Guo,Yuang Li,Zhaokun Wang,Yi Gong,Jie Zou,Jiwei Wei,Wenhong Tian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at The 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract:Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a LLMs should not know is important for ensuring alignment and thus safe use. However, effective unlearning in LLMs is difficult due to the fuzzy boundary between knowledge retention and forgetting. This challenge is exacerbated by entangled parameter spaces from continuous multi-domain training, often resulting in collateral damage, especially under aggressive unlearning strategies. Furthermore, the computational overhead required to optimize State-of-the-Art (SOTA) models with billions of parameters poses an additional barrier. In this work, we present ALTER, a lightweight unlearning framework for LLMs to address both the challenges of knowledge entanglement and unlearning efficiency. ALTER operates through two phases: (I) high entropy tokens are captured and learned via the shared A matrix in LoRA, followed by (II) an asymmetric LoRA architecture that achieves a specified forgetting objective by parameter isolation and unlearning tokens within the target subdomains. Serving as a new research direction for achieving unlearning via token-level isolation in the asymmetric framework. ALTER achieves SOTA performance on TOFU, WMDP, and MUSE benchmarks with over 95% forget quality and shows minimal side effects through preserving foundational tokens. By decoupling unlearning from LLMs’ billion-scale parameters, this framework delivers excellent efficiency while preserving over 90% of model utility, exceeding baseline preservation rates of 47.8-83.6%.
[NLP-32] nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models
【速读】: 该论文旨在解决维度情感分析(Dimensional Aspect-Based Sentiment Analysis, DimASTE)中预测可靠性不足的问题,尤其是在多语言和跨领域场景下的泛化能力挑战。解决方案的关键在于提出自洽结构生成(Self-Consistent Structured Generation, SCSG),通过在每个样本上多次执行LoRA微调的大语言模型(LLM),仅保留多数共识一致的三元组(tuple)作为最终输出,从而提升预测的一致性与准确性;同时,为降低多次前向传播带来的计算开销,引入vLLM的PagedAttention机制实现高效的键值缓存复用,显著优化推理效率。
链接: https://arxiv.org/abs/2603.01788
作者: Nils Constantin Hellwig,Jakob Fehle,Udo Kruschwitz,Christian Wolff
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis in SemEval-2026 Task 3 (Track A). SCSG enhances prediction reliability by executing a LoRA-adapted large language model multiple times per instance, retaining only tuples that achieve a majority consensus across runs. To mitigate the computational overhead of multiple forward passes, we leverage vLLM’s PagedAttention mechanism for efficient key–value cache reuse. Evaluation across 6 languages and 8 language–domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3) ranking in the top seven across all settings, achieving second place on three out of four English subsets and first place on Tatar-Restaurant for DimASTE.
[NLP-33] LLM -as-an-Annotator: Training Lightweight Models with LLM -Annotated Examples for Aspect Sentiment Tuple Prediction LREC2026 ACL
【速读】: 该论文旨在解决Aspect-Based Sentiment Analysis (ABSA)任务中因依赖人工标注数据而导致的高成本与低效率问题。其解决方案的关键在于提出LA-ABSA方法,通过利用大语言模型(Large Language Model, LLM)生成的标注数据来微调轻量级模型,从而在低资源场景下实现高性能的ABSA任务。该方法在五个TASD和ASQP数据集上验证有效,相较于传统数据增强策略表现更优,并在接近LLM提示(prompting)性能的同时显著降低计算能耗。
链接: https://arxiv.org/abs/2603.01778
作者: Nils Constantin Hellwig,Jakob Fehle,Udo Kruschwitz,Christian Wolff
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for publication at LREC 2026. Final version will appear in the ACL Anthology
Abstract:Training models for Aspect-Based Sentiment Analysis (ABSA) tasks requires manually annotated data, which is expensive and time-consuming to obtain. This paper introduces LA-ABSA, a novel approach that leverages Large Language Model (LLM)-generated annotations to fine-tune lightweight models for complex ABSA tasks. We evaluate our approach on five datasets for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP). Our approach outperformed previously reported augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios, while providing substantial energy efficiency benefits. For example, using 50 annotated examples for in-context learning (ICL) to guide the annotation of unlabeled data, LA-ABSA achieved an F1 score of 49.85 for ASQP on the SemEval Rest16 dataset, closely matching the performance of ICL prompting with Gemma-3-27B (51.10), while requiring significantly lower computational resources.
[NLP-34] FreeAct: Freeing Activations for LLM Quantization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)量化过程中因静态变换约束导致的性能瓶颈问题,特别是在扩散语言模型(diffusion LLMs, dLLMs)和多模态大语言模型(Multimodal LLMs, MLLMs)中,输入激活(activation)具有动态分布特性而传统方法无法有效适应的问题。解决方案的关键在于提出FreeAct框架,通过放松传统的固定一对一变换约束,引入基于激活秩亏性质的解空间,从而实现激活变换与权重变换的解耦;具体而言,FreeAct根据token类型(如视觉token与文本token或掩码token)识别动态差异,并为激活侧分配不同的变换矩阵,同时保持权重侧使用统一的静态变换矩阵,显著提升了量化后的模型性能。
链接: https://arxiv.org/abs/2603.01776
作者: Xiaohao Liu,Xiaobo Xia,Manyi Zhang,Ji-Fu Li,Xianzhi Yu,Fei Shen,Xiu Su,See-Kiong Ng,Tat-Seng Chua
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 18 figures, 2 tables
Abstract:Quantization is pivotal for mitigating the significant memory and computational overhead of Large Language Models (LLMs). While emerging transformation-based methods have successfully enhanced quantization by projecting feature spaces onto smoother manifolds using orthogonal matrices, they typically enforce a rigid one-to-one transformation constraint. This static approach fails to account for the dynamic patterns inherent in input activations, particularly within diffusion LLMs (dLLMs) and Multimodal LLMs (MLLMs), where varying token types exhibit distinct distributions. To advance this, we propose FreeAct, a novel quantization framework that relaxes the static one-to-one constraint to accommodate dynamic activation disparities. Theoretically, we leverage the rank-deficient nature of activations to derive a solution space that extends beyond simple inverse matrices, enabling the decoupling of activation transformations from weights. Methodologically, FreeAct identifies token-specific dynamics (i.e., vision v.s. text, or masked tokens) and allocates distinct transformation matrices to the activation side, while maintaining a unified, static transformation for the weights. Extensive experiments across dLLMs and MLLMs demonstrate that FreeAct significantly outperforms baselines, up to 5.3% performance improvement, with in-depth analyses. Our code will be publicly released.
[NLP-35] Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation
【速读】: 该论文旨在解决组织在招聘过程中难以高效筛选出最匹配候选人的难题,尤其是在依赖专家评估(如技术主管面试)成本高昂、难以规模化的情况下。其核心问题在于如何在早期筛选阶段利用有限信息提升决策质量。解决方案的关键在于引入大语言模型(Large Language Models, LLMs)作为领域专家代理,通过模拟结构化访谈动态更新对候选人基于评分量表的潜在特质(rubric-oriented latent traits)的概率信念,并确保该信念更新过程具有校准性(calibrated)。实验表明,在模拟面试中,该系统能有效收敛至人工构造的候选人能力水平,从而显著提升初筛阶段的决策准确性与可扩展性。
链接: https://arxiv.org/abs/2603.01775
作者: Harry Stuart,Masahiro Kaneko,Timothy Baldwin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale. Therefore, automated resume scoring and other applicant-screening methods are increasingly used to coarsely filter candidates, making decisions on limited information. We propose that large language models (LLMs) can play the role of subject matter experts to cost-effectively elicit information from each candidate that is nuanced and role-specific, thereby improving the quality of early-stage hiring decisions. We present a system that leverages an LLM interviewer to update belief over an applicant’s rubric-oriented latent traits in a calibrated way. We evaluate our system on simulated interviews and show that belief converges towards the simulated applicants’ artificially-constructed latent ability levels. We release code, a modest dataset of public-domain/anonymised resumes, belief calibration tests, and simulated interviews, at \hrefthis https URLthis https URL. Our demo is available at \hrefthis https URLthis https URL.
[NLP-36] AnnoABSA: A Web-Based Annotation Tool for Aspect-Based Sentiment Analysis with Retrieval-Augmented Suggestions LREC2026 ACL
【速读】: 该论文旨在解决Aspect-Based Sentiment Analysis (ABSA)任务中人工标注效率低、一致性差以及难以适应多样化任务需求的问题。其解决方案的关键在于提出AnnoABSA,一个基于Web的可定制化标注工具,支持ABSA全谱任务;通过引入大语言模型(Large Language Model, LLM)驱动的检索增强生成(Retrieval-Augmented Generation, RAG)机制,在人机协同框架下提供上下文感知的标注建议,并利用已标注样本进行少样本提示(few-shot prompting)动态优化建议准确性,从而在保持人工主导的同时提升标注质量与效率。
链接: https://arxiv.org/abs/2603.01773
作者: Nils Constantin Hellwig,Jakob Fehle,Udo Kruschwitz,Christian Wolff
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for publication at LREC 2026. Final version will appear in the ACL Anthology
Abstract:We introduce AnnoABSA, the first web-based annotation tool to support the full spectrum of Aspect-Based Sentiment Analysis (ABSA) tasks. The tool is highly customizable, enabling flexible configuration of sentiment elements and task-specific requirements. Alongside manual annotation, AnnoABSA provides optional Large Language Model (LLM)-based retrieval-augmented generation (RAG) suggestions that offer context-aware assistance in a human-in-the-loop approach, keeping the human annotator in control. To improve prediction quality over time, the system retrieves the ten most similar examples that are already annotated and adds them as few-shot examples in the prompt, ensuring that suggestions become increasingly accurate as the annotation process progresses. Released as open-source software under the MIT License, AnnoABSA is freely accessible and easily extendable for research and practical applications.
[NLP-37] Bootstrapping Embeddings for Low Resource Languages
【速读】: 该论文旨在解决低资源语言缺乏高质量监督微调数据以训练高性能嵌入模型(embedding models)的问题。当前,生成式 AI(Generative AI)在高资源语言如英语中已能有效构建嵌入模型,但在其他数百种语言中仍面临数据匮乏的挑战。为应对这一问题,作者提出三种合成三元组数据(synthetic triplet data)生成策略:基于上下文学习(in-context learning)、适配器组合(adapter composition)以及跨语言微调的大语言模型生成方法(XL-LoRA)。研究表明,尽管上下文学习表现有限,但适配器组合与XL-LoRA显著提升了多种任务和语言上的嵌入模型性能,成为实现多语言嵌入模型高效、可扩展训练的关键路径。
链接: https://arxiv.org/abs/2603.01732
作者: Merve Basoz,Andrew Horne,Mattia Opper
机构: School of Informatics, University of Edinburgh (爱丁堡大学信息学院); Edina, University of Edinburgh (爱丁堡大学Edina项目)
类目: Computation and Language (cs.CL)
备注: (v1 - LowResLM Camera Ready)
Abstract:Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.
[NLP-38] opoCurate:Modeling Interaction Topology for Tool-Use Agent Training
【速读】: 该论文旨在解决传统工具使用智能体训练中因仅依赖结果导向筛选(如成功轨迹的监督微调 SFT 和通过通过率选择的任务强化学习 RL)而导致的交互动态信息丢失问题:成功轨迹可能缺乏错误恢复能力或存在冗余行为,而单纯基于通过率的任务筛选无法区分结构上有信息量的任务与简单任务。其解决方案的关键在于提出 TopoCurate 框架,该框架将同一任务下多轮试错轨迹映射到统一的语义商空间拓扑(semantic quotient topology),通过合并等价的动作-观测状态,将散乱的线性轨迹转化为显式捕捉工具调用与环境响应如何驱动有效策略与失败模式分化的结构化流形。在此基础上,引入双选机制:SFT 阶段优先选择体现反思性恢复、语义高效性和策略多样性的轨迹以缓解协变量偏移和模式坍缩;RL 阶段则选取具有高错误分支比例和策略异质性的任务以最大化梯度信噪比,从而缓解稀疏奖励场景下的信号消失问题。
链接: https://arxiv.org/abs/2603.01714
作者: Jinluan Yang,Yuxin Liu,Zhengyu Chen,Chengcheng Han,Yueqing Sun,Qi Gu,Hui Su,Xunliang Cai,Fei Wu,Kun Kuang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under Review
Abstract:Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks. However, this paradigm ignores interaction dynamics: successful trajectories may lack error recovery or exhibit redundancy, while pass rates fail to distinguish structurally informative tasks from trivial ones. We propose \textbfTopoCurate, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology. By merging equivalent action-observation states, this projection transforms scattered linear trajectories into a structured manifold that explicitly captures how tool invocations and environmental responses drive the divergence between effective strategies and failure modes. Leveraging this representation, we introduce a dual-selection mechanism: for SFT, we prioritize trajectories demonstrating reflective recovery, semantic efficiency, and strategic diversity to mitigate covariate shift and mode collapse; for RL, we select tasks with high error branch ratios and strategic heterogeneity, maximizing gradient Signal-to-Noise Ratio to address vanishing signals in sparse-reward settings. Evaluations on BFCLv3 and Tau2 Bench show that TopoCurate achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines. We will release the code and data soon for further investigations.
[NLP-39] Building a Strong Instruction Language Model for a Less-Resourced Language
【速读】: 该论文旨在解决当前主流大语言模型(Large Language Models, LLMs)在低资源语言(如斯洛文尼亚语)上性能不足的问题,其核心挑战在于现有开源模型主要基于英语文本训练,导致对非英语语言的支持有限。解决方案的关键在于采用三阶段持续预训练(continual pre-training)与两阶段监督微调(supervised fine-tuning, SFT)相结合的方法,将Gemma 3模型适配至斯洛文尼亚语;具体而言,作者在预训练阶段融合了1400亿个斯洛文尼亚语及周边巴尔干地区语言(包括英语、波斯尼亚语、塞尔维亚语和克罗地亚语)的token数据,并在SFT阶段使用超过20万条英斯双语样本进行优化,最终构建出参数规模为120亿的GaMS3-12B模型,该模型在多个评测基准中表现优于同规模开源模型,并在斯洛文尼亚语LLM竞技场中达到超过60%的胜率,接近商用GPT-4o水平。
链接: https://arxiv.org/abs/2603.01691
作者: Domen Vreš,Tjaša Arčon,Timotej Petrič,Dario Vajda,Marko Robnik-Šikonja,Iztok Lebar Bajec
机构: University of Ljubljana, Faculty of Computer and Information Science (卢布尔雅那大学计算机与信息科学学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Currently under review at Natural Language Processing Special Issue on Language Models for Low-Resource Languages
Abstract:Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.
[NLP-40] QIME: Constructing Interpretable Medical Text Embeddings via Ontology-Grounded Questions
【速读】: 该论文旨在解决现有密集型生物医学嵌入(dense biomedical embeddings)因黑箱特性而在临床决策中应用受限的问题,以及当前基于问题的可解释嵌入方法依赖启发式或表面对比信号、忽视专业领域知识的不足。其解决方案的关键在于提出QIME框架,该框架基于本体(ontology)构建可解释的医学文本嵌入,使每个维度对应一个临床上有意义的“是/否”问题;通过条件化于特定聚类的医学概念签名(cluster-specific medical concept signatures),生成语义原子级的问题以捕捉生物医学文本中的细粒度差异,并支持无需训练的嵌入构建策略,在不依赖逐问题分类器训练的前提下进一步提升性能。
链接: https://arxiv.org/abs/2603.01690
作者: Yixuan Tang,Zhenghong Lin,Yandong Sun,Anthony K.H. Tung
机构: School of Computing, National University of Singapore (新加坡国立大学计算机学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While dense biomedical embeddings achieve strong performance, their black-box nature limits their utility in clinical decision-making. Recent question-based interpretable embeddings represent text as binary answers to natural-language questions, but these approaches often rely on heuristic or surface-level contrastive signals and overlook specialized domain knowledge. We propose QIME, an ontology-grounded framework for constructing interpretable medical text embeddings in which each dimension corresponds to a clinically meaningful yes/no question. By conditioning on cluster-specific medical concept signatures, QIME generates semantically atomic questions that capture fine-grained distinctions in biomedical text. Furthermore, QIME supports a training-free embedding construction strategy that eliminates per-question classifier training while further improving performance. Experiments across biomedical semantic similarity, clustering, and retrieval benchmarks show that QIME consistently outperforms prior interpretable embedding methods and substantially narrows the gap to strong black-box biomedical encoders, while providing concise and clinically informative explanations.
[NLP-41] Surgical Post-Training: Cutting Errors Keeping Knowledge
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段提升推理能力时面临的效率与灾难性遗忘之间的权衡问题。现有方法虽强调使用策略内数据(on-policy data)缓解遗忘,但忽略了直接偏好优化(Direct Preference Optimization, DPO)中奖励估计所蕴含的隐式正则化机制。为此,作者提出了一种名为“外科后训练”(Surgical Post-Training, SPoT)的新范式,其核心在于:(1) 通过一个数据校正流水线,利用Oracle对错误推理步骤进行最小编辑,生成贴近模型分布的数据;(2) 设计一种基于奖励的二元交叉熵目标函数,将推理正确性建模为二分类问题,从而施加解耦的监督信号。该方案显著提升了模型在领域内和领域外任务上的推理准确率,同时大幅降低训练成本。
链接: https://arxiv.org/abs/2603.01683
作者: Wenye Lin,Kai Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover–and validate both theoretically and empirically–an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization’s (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model’s distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B’s accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: this https URL
[NLP-42] LexChronos: An Agent ic Framework for Structured Event Timeline Extraction in Indian Jurisprudence AAAI2026
【速读】: 该论文旨在解决传统方法将法院判决和诉讼过程视为非结构化文本,从而限制大语言模型(Large Language Models, LLMs)在法律文本摘要、论点生成和判决预测等任务中表现的问题。其核心解决方案是提出一个名为LexChronos的代理框架,该框架通过迭代方式从印度最高法院判决中提取结构化的事件时间线;关键创新在于采用双代理架构:一个基于LoRA-instruct微调的抽取代理识别候选事件,另一个预训练的反馈代理则通过置信度驱动的循环对事件进行评分与优化,从而显著提升结构化事件提取的准确性,并为后续法律AI应用提供高质量的数据基础。
链接: https://arxiv.org/abs/2603.01651
作者: Anka Chandrahas Tummepalli,Preethu Rose Anish
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in AILaw @ AAAI 2026 Conference
Abstract:Understanding and predicting judicial outcomes demands nuanced analysis of legal documents. Traditional approaches treat judgments and proceedings as unstructured text, limiting the effectiveness of large language models (LLMs) in tasks such as summarization, argument generation, and judgment prediction. We propose LexChronos, an agentic framework that iteratively extracts structured event timelines from Supreme Court of India judgments. LexChronos employs a dual-agent architecture: a LoRA-instruct-tuned extraction agent identifies candidate events, while a pre-trained feedback agent scores and refines them through a confidence-driven loop. To address the scarcity of Indian legal event datasets, we construct a synthetic corpus of 2000 samples using reverse-engineering techniques with DeepSeek-R1 and GPT-4, generating gold-standard event annotations. Our pipeline achieves a BERT-based F1 score of 0.8751 against this synthetic ground truth. In downstream evaluations on legal text summarization, GPT-4 preferred structured timelines over unstructured baselines in 75% of cases, demonstrating improved comprehension and reasoning in Indian jurisprudence. This work lays a foundation for future legal AI applications in the Indian context, such as precedent mapping, argument synthesis, and predictive judgment modelling, by harnessing structured representations of legal events.
[NLP-43] Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
【速读】: 该论文旨在解决生成式 AI(Generative AI)中大型语言模型(Large Language Model, LLM)推理速度受限的问题,尤其是现有推测解码(Speculative Decoding)方法在 draft-and-verify 循环中因静态时间分配或基于代理指标(如接受长度)优化而导致的效率低下问题。解决方案的关键在于提出 Learning to Draft (LTD),这是一种基于强化学习的动态调度框架,通过训练两个协同适应的策略来联合优化 drafting 和 verification 阶段的时间分配,从而直接最大化每个循环的吞吐量(throughput),实现更高效的推理加速。
链接: https://arxiv.org/abs/2603.01639
作者: Jiebin Zhang,Zhenghan Yu,Liang Wang,Nan Yang,Eugene J. Yu,Zheng Li,Yifan Song,Dawei Zhu,Xingxing Zhang,Furu Wei,Sujian Li
机构: Peking University (北京大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL)
备注: 22pages, 7 figures
Abstract:Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.
[NLP-44] Measuring What VLMs Dont Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在放射学领域部署中评估指标的局限性问题,即现有方法仅依赖表面文本相似度(如token重叠率),难以保障临床真实性(clinical fidelity)和人群公平性(demographic fairness)。其关键解决方案是提出一种词汇层面的评估框架——临床关联位移(Clinical Association Displacement, CAD),用于量化不同人群生成报告中临床术语关联性的偏移,并引入加权关联擦除(Weighted Association Erasure, WAE)来衡量跨群体的临床信号损失。该方法可有效识别模板坍缩(template collapse)现象,从而避免模型通过生成泛化、重复内容“作弊”以获得高评分,推动对“最优”临床报告定义的根本性重构。
链接: https://arxiv.org/abs/2603.01625
作者: Aditya Parikh,Aasa Feragen,Sneha Das,Stella Frank
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This is an extended version of a manuscript currently under review
Abstract:Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how “optimal” reporting is defined.
[NLP-45] More Data Fewer Diacritics: Scaling Arabic TTS
【速读】: 该论文旨在解决阿拉伯语文本到语音(Text-to-Speech, TTS)研究中因公开可用训练数据稀缺和阿拉伯语符号标注(diacritization)模型准确性不足而导致的瓶颈问题。其解决方案的关键在于构建一个鲁棒的自动化数据处理流水线,通过语音活动检测(Voice Activity Detection, VAD)、自动语音识别(Automatic Speech Recognition, ASR)、自动符号标注(Automatic Diacritization)和噪声过滤等步骤,从大规模非标注录音中自动生成约4000小时的阿拉伯语TTS训练数据。实验表明,即使缺乏符号标注,使用更大规模的数据仍可显著提升模型性能,从而为无需符号标注的阿拉伯语TTS系统提供了可行路径。
链接: https://arxiv.org/abs/2603.01622
作者: Ahmed Musleh,Yifan Zhang,Kareem Darwish
机构: QCRI, HBKU, Qatar
类目: Computation and Language (cs.CL)
备注:
Abstract:Arabic Text-to-Speech (TTS) research has been hindered by the availability of both publicly available training data and accurate Arabic diacritization models. In this paper, we address the limitation by exploring Arabic TTS training on large automatically annotated data. Namely, we built a robust pipeline for collecting Arabic recordings and processing them automatically using voice activity detection, speech recognition, automatic diacritization, and noise filtering, resulting in around 4,000 hours of Arabic TTS training data. We then trained several robust TTS models with voice cloning using varying amounts of data, namely 100, 1,000, and 4,000 hours with and without diacritization. We show that though models trained on diacritized data are generally better, larger amounts of training data compensate for the lack of diacritics to a significant degree. We plan to release a public Arabic TTS model that works without the need for diacritization.
[NLP-46] Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models
【速读】: 该论文旨在解决当前生成式语言模型(Generative Language Models)推理轨迹(Reasoning Traces)评估方法过于机械、难以捕捉人类中心视角下的推理质量,且在面对多样化和逐步退化的推理过程时缺乏泛化能力的问题。其解决方案的核心在于提出MarODE——一个基于马尔可夫链建模推理进展(Markovian formulation of reasoning progression)与常微分方程(Ordinary Differential Equation, ODE)刻画轨迹动态的离线评估框架,通过理论驱动的方式实现对推理质量的高效评分。该方法在大规模实验中相较现有基线在Somers’ D相关性指标上提升超250%,验证了其在评估维度上的“优良性”(goodness)与“合理性”(soundness)优势。
链接: https://arxiv.org/abs/2603.01580
作者: Arghodeep Nandi,Ojasva Saxena,Tanmoy Chakraborty
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers’ D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.
[NLP-47] Extracting Training Dialogue Data from Large Language Model based Task Bots
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在任务导向对话系统(Task-Oriented Dialogue Systems, TODS)中引入的隐私风险问题,特别是LLM对训练数据的记忆效应可能泄露敏感信息(如电话号码、完整行程等)。其解决方案的关键在于提出一套针对LLM-based TODS的新型数据提取攻击方法,该方法通过优化响应采样和成员推断能力,显著提升了从模型中提取训练标签的精度与效率;实验表明,在最佳情况下可实现超过70%的精确度,从而揭示了LLM在TODS中训练数据记忆的量化特征,并为后续隐私保护策略提供了针对性依据。
链接: https://arxiv.org/abs/2603.01550
作者: Shuo Zhang,Junzhou Zhao,Junji Hou,Pinghui Wang,Chenxu Wang,Jing Tao
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication in IEEE Transactions on Information Forensics and Security (TIFS). \c{opyright} 2026 IEEE
Abstract:Large Language Models (LLMs) have been widely adopted to enhance Task-Oriented Dialogue Systems (TODS) by modeling complex language patterns and delivering contextually appropriate responses. However, this integration introduces significant privacy risks, as LLMs, functioning as soft knowledge bases that compress extensive training data into rich knowledge representations, can inadvertently memorize training dialogue data containing not only identifiable information such as phone numbers but also entire dialogue-level events like complete travel schedules. Despite the critical nature of this privacy concern, how LLM memorization is inherited in developing task bots remains unexplored. In this work, we address this gap through a systematic quantitative study that involves evaluating existing training data extraction attacks, analyzing key characteristics of task-oriented dialogue modeling that render existing methods ineffective, and proposing novel attack techniques tailored for LLM-based TODS that enhance both response sampling and membership inference. Experimental results demonstrate the effectiveness of our proposed data extraction attack. Our method can extract thousands of training labels of dialogue states with best-case precision exceeding 70%. Furthermore, we provide an in-depth analysis of training data memorization in LLM-based TODS by identifying and quantifying key influencing factors and discussing targeted mitigation strategies.
[NLP-48] Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLM s
【速读】: 该论文旨在解决语音输入任务中模型性能显著低于直接文本推理任务的模态差距(modality gap)问题。研究表明,这一差距并非源于简单的分布偏移,而是由于语音表示在深层网络中存在冗余信息难以被有效压缩为稳定决策所致。解决方案的关键在于:识别并利用语音与文本表示在跨层结构上的动态对齐特性,尤其是语音表示具有宽泛的跨层对齐带,这反映了语义内容在多帧中的冗余分布;因此,未来改进方向应聚焦于以token或时间粒度而非特征层面的匹配来优化模型,从而更有效地从冗余语音信号中提取稳定语义表征。
链接: https://arxiv.org/abs/2603.01502
作者: Ming-Hao Hsu,Xueyao Zhang,Xiaohai Tian,Jun Zhang,Zhizheng Wu
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating that the modality gap is not a mere distribution shift. Overall, our results suggest that the bottleneck lies in condensing redundant speech into stable late-layer decisions, motivating future solutions that operate at the token or temporal granularity instead of feature-level matching.
[NLP-49] ProtRLSearch: A Multi-Round Multimodal Protein Search Agent with Large Language Models Trained via Reinforcement Learning
【速读】: 该论文旨在解决当前蛋白质搜索代理在医疗场景下进行蛋白序列约束推理时存在的两大问题:一是现有搜索代理多局限于单轮、纯文本模态的检索,无法将蛋白序列作为多模态输入融入搜索决策过程;二是其依赖强化学习(Reinforcement Learning, RL)监督仅关注最终答案,缺乏对搜索过程中关键词选择和推理方向的约束,导致偏差难以及时识别与修正。解决方案的关键在于提出ProtRLSearch,一种基于多维奖励机制训练的多轮蛋白搜索代理,能够在实时搜索中联合利用蛋白序列和文本作为多模态输入,从而生成高质量报告。同时,为评估模型整合蛋白序列信息与文本多模态输入的能力,作者构建了ProtMCQs基准数据集,涵盖3000道多选题,按难度分为三个层级,用于测试从序列约束推理到整合信号通路与调控网络的综合性蛋白推理任务。
链接: https://arxiv.org/abs/2603.01464
作者: Congying Liu,Taihao Li,Ming Huang,Xingyuan Wei,Peipei Liu,Yiqing Shen,Yanxu Mao,Tiehan Cui
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Protein analysis tasks arising in healthcare settings often require accurate reasoning under protein sequence constraints, involving tasks such as functional interpretation of disease-related variants, protein-level analysis for clinical research, and similar scenarios. To address such tasks, search agents are introduced to search protein-related information, providing support for disease-related variant analysis and protein function reasoning in protein-centric inference. However, such search agents are mostly limited to single-round, text-only modality search, which prevents the protein sequence modality from being incorporated as a multimodal input into the search decision-making process. Meanwhile, their reliance on reinforcement learning (RL) supervision that focuses solely on the final answer results in a lack of search process constraints, making deviations in keyword selection and reasoning directions difficult to identify and correct in a timely manner. To address these limitations, we propose ProtRLSearch, a multi-round protein search agent trained with multi-dimensional reward based RL, which jointly leverages protein sequence and text as multimodal inputs during real-time search to produce high quality reports. To evaluate the ability of models to integrate protein sequence information and text-based multimodal inputs in realistic protein query settings, we construct ProtMCQs, a benchmark of 3,000 multiple choice questions (MCQs) organized into three difficulty levels. The benchmark evaluates protein query tasks that range from sequence constrained reasoning about protein function and phenotype changes to comprehensive protein reasoning that integrates multi-dimensional sequence features with signal pathways and regulatory networks.
[NLP-50] Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents ICLR2026
【速读】: 该论文旨在解决角色扮演语言代理(Role-Playing Language Agents)在社会模拟中因静态人格设定无法适应动态场景而导致的行为失真问题。现有方法如静态提示工程或昂贵的微调难以实现人格随情境变化的自适应调整,而心理学理论(如认知-情感人格系统,Cognitive-Affective Personality Systems)指出人格对行为的影响具有情境依赖性,这凸显了动态人格管理的必要性。解决方案的关键在于提出一种基于理论驱动的Persona Dynamic Decoding(PDD)框架,其核心创新包括:(1)Persona Importance Estimation(PIE)模块,无需真实标签即可动态量化人格属性的情境重要性;(2)Persona-Guided Inference-Time Alignment(PIA)范式,利用重要性得分构建加权多目标奖励并调节推理阶段的生成概率,从而实现在推理时对人格的精准跟随。
链接: https://arxiv.org/abs/2603.01438
作者: Yuxin Liu,Mingye Zhu,Siyuan Liu,Bo Hu,Lei Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:The utility of Role-Playing Language Agents in sociological research is growing alongside the adoption of Large Language Models. For realism in social simulation, these agents must adhere to their personas defined by character profiles, yet existing strategies-static prompt engineering or costly fine-tuning-fail to adapt personas to dynamic scenarios. Psychological theories, such as the Cognitive-Affective Personality Systems, provide a crucial explanation for this failure: a persona’s influence on behavior is not static but varies with the scenarios. This context-dependence highlights the critical need for adaptive persona management. To address this gap, we propose a novel, theory-driven method that dynamically estimates context-dependent persona importance and integrates it into weighted reward-guided decoding, enabling inference-time persona following. Specifically, we introduce the Persona Dynamic Decoding (PDD) framework, which consists of two key components: (1) Persona Importance Estimation (PIE) module, which dynamically quantifies the contextual importance of persona attributes without requiring ground-truth supervision; and (2) Persona-Guided Inference-Time Alignment (PIA) paradigm, which leverages these importance scores to construct weighted multi-objective rewards and modulate generation probabilities during inference. Extensive experiments show the effectiveness of our method in utterance consistency and behavioral fidelity.
[NLP-51] Understanding the Physics of Key-Value Cache Compression for LLM s through Attention Dynamics
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在扩展上下文窗口至10万以上token时,键值(Key-Value, KV)缓存成为主要内存瓶颈的问题。现有方法虽声称可实现80–90%的压缩率且对基准测试影响较小,但忽略了注意力机制本质上是信息路由而非单纯存储,保留KV对并不等同于保证语义可达性。论文提出一种受物理启发的视角,将KV压缩视为对token级路由的受控扰动,并区分保留(retention)、可达性(accessibility)和利用率(utilization)三个维度。其核心解决方案在于揭示:适度压缩会破坏内部表示但不显著影响准确率,表明存在冗余;接近90%压缩率时出现明显的幻觉安全悬崖(hallucination safety cliff),与全局淘汰率(Global Eviction Ratio, GER)激增相关,提示语义可达性的相变现象;不同架构(如LLaMA与Qwen)表现出不同的路由动态特性,进而形成差异化的鲁棒性分布。这表明稀疏的token-路由结构决定了压缩容忍度,从而将KV压缩重新定义为对注意力几何结构的探测工具,并将长上下文可扩展性与自注意力中的稀疏性和彩票票券假说(lottery ticket hypothesis)相联系。
链接: https://arxiv.org/abs/2603.01426
作者: Samhruth Ananthanarayanan,Ayan Sengupta,Tanmoy Chakraborty
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.
[NLP-52] Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction AAAI2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实场景中多轮对话下的可靠性问题,尤其是在跨话题切换、意图交错和结构化实体追踪等复杂交互情境下性能退化的问题。其解决方案的关键在于设计并实施一套系统性评估框架,通过三个典型任务对比单轮与多轮设置下的表现差异,量化LLMs在扩展对话中的可靠性下降程度,并识别出如指令漂移(instruction drift)、意图混淆(intent confusion)和上下文覆盖(contextual overwriting)等重复性失效模式,从而为提升模型在实际应用中的鲁棒性和可信部署提供依据。
链接: https://arxiv.org/abs/2603.01423
作者: Jiyoon Myung
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at the Workshop on Assessing and Improving Reliability of Foundation Models in the Real World (AAAI 2026)
Abstract:Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction challenges: (1) maintaining global constraints across topic shifts, (2) selecting the correct tool or agent amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task pairs single-turn and multi-turn settings, allowing us to quantify reliability degradation under extended dialogue. Across both commercial and open-source models, we observe substantial declines in reliability, particularly for smaller models. Error analyses reveal recurring failure modes such as instruction drift, intent confusion, and contextual overwriting, which compromise dependable behavior in operational systems. Our findings highlight the need for stress-testing LLMs for conversational reliability and developing more robust evaluation methods for trustworthy deployment.
[NLP-53] SciDER: Scientific Data-centric End-to-end Researcher
【速读】: 该论文旨在解决当前自动化科学发现中,现有智能体难以自主处理实验原始数据的问题。其解决方案的关键在于提出一个以数据为中心的端到端系统 SciDER,通过专用智能体协作解析和分析原始科学数据,基于数据特征生成假设与实验设计,并自动编写和执行代码;该系统还引入自进化记忆机制与批判驱动的反馈循环,显著提升了在特定数据驱动科学发现任务中的性能,优于通用型智能体和当前最先进的模型。
链接: https://arxiv.org/abs/2603.01421
作者: Ke Lin,Yilin Lu,Shreyas Bhat,Xuehang Guo,Junier Oliva,Qingyun Wang
机构: William Mary(威廉玛丽学院); University of Minnesota(明尼苏达大学); University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 6 figures, 3 tables
Abstract:Automated scientific discovery with large language models is transforming the research lifecycle from ideation to experimentation, yet existing agents struggle to autonomously process raw data collected from scientific experiments. We introduce SciDER, a data-centric end-to-end system that automates the research lifecycle. Unlike traditional frameworks, our specialized agents collaboratively parse and analyze raw scientific data, generate hypotheses and experimental designs grounded in specific data characteristics, and write and execute corresponding code. Evaluation on three benchmarks shows SciDER excels in specialized data-driven scientific discovery and outperforms general-purpose agents and state-of-the-art models through its self-evolving memory and critic-led feedback loop. Distributed as a modular Python package, we also provide easy-to-use PyPI packages with a lightweight web interface to accelerate autonomous, data-driven research and aim to be accessible to all researchers and developers.
[NLP-54] oward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning WWW2026
【速读】: 该论文旨在解决当前图令牌化大语言模型(Graph-Tokenizing LLMs, GTokenLLMs)在图-文本对齐过程中存在的文本主导偏差问题,即现有方法仅依赖语言指令的隐式监督,导致图结构信息未能被充分利用。解决方案的关键在于提出一种重构式的图指令微调框架RGLM,其核心思想是通过显式地从模型输出的图令牌中重建原始图信息,从而引入图监督信号以约束对齐过程。具体而言,RGLM从输入空间(RGLM-Decoder)和潜在空间(RGLM-Similarizer与RGLM-Denoiser)两个互补视角设计三种变体,并理论分析了各变体的对齐有效性,显著提升了图上下文的利用效率,为GTokenLLMs的对齐研究开辟了新方向。
链接: https://arxiv.org/abs/2603.01385
作者: Zhongjian Zhang,Xiao Wang,Mengmei Zhang,Jiarui Tan,Chuan Shi
机构: Beijing University of Posts and Telecommunications(北京邮电大学); Beihang University(北京航空航天大学); China Telecom Bestpay(中国电信翼支付); Beijing University of Posts and Telecommunications(北京邮电大学); Beijing University of Posts and Telecommunications(北京邮电大学)
去重后:
Beijing University of Posts and Telecommunications(北京邮电大学); Beihang University(北京航空航天大学); China Telecom Bestpay(中国电信翼支付)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted by WWW 2026
Abstract:The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph-related tasks, with the ultimate goal of developing a graph foundation model that generalizes diverse scenarios. The key challenge is to align graph data with language spaces so that LLMs can better comprehend graphs. As a popular paradigm, Graph-Tokenizing LLMs (GTokenLLMs) encode complex structures and lengthy texts into a graph token sequence, and then align them with text tokens via language instructions tuning. Despite their initial success, our information-theoretic analysis reveals that existing GTokenLLMs rely solely on text supervision from language instructions, which achieve only implicit graph-text alignment, resulting in a text-dominant bias that underutilizes graph context. To overcome this limitation, we first prove that the alignment objective is upper-bounded by the mutual information between the input graphs and their hidden representations in the LLM, which motivates us to improve this upper bound to achieve better alignment. To this end, we further propose a reconstructive graph instruction tuning pipeline, RGLM. Our key idea is to reconstruct the graph information from the LLM’s graph token outputs, explicitly incorporating graph supervision to constrain the alignment process. Technically, we embody RGLM by exploring three distinct variants from two complementary perspectives: RGLM-Decoder from the input space; RGLM-Similarizer and RGLM-Denoiser from the latent space. Additionally, we theoretically analyze the alignment effectiveness of each variant. Extensive experiments on various benchmarks and task scenarios validate the effectiveness of the proposed RGLM, paving the way for new directions in GTokenLLMs’ alignment research.
[NLP-55] End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation
【速读】: 该论文旨在解决传统级联式失语语音重建(Dysarthric Speech Reconstruction, DSR)系统中存在的高延迟与鲁棒性差的问题。具体而言,现有方法依赖自动语音识别(ASR)与句级文本到语音(TTS)的两级流程,在慢速发音的失语患者中导致响应时间过长;同时,由于不同严重程度的失语症患者对相同文本的发音差异显著,ASR模块易出错,进而影响TTS输出的可懂度;此外,增量式TTS因感受野受限,难以准确预测韵律特征。为解决上述问题,作者提出一种端到端的同步DSR系统,其核心创新在于:1)引入帧级适配器模块(frame-level adaptor module),通过显式-隐式语义信息融合与联合训练机制,增强TTS对ASR输出错误的容忍能力;2)设计多等待k自回归TTS模块(multiple wait-k autoregressive TTS module),利用多视角知识蒸馏缓解韵律退化问题,从而在保持低延迟的同时提升重建语音质量。
链接: https://arxiv.org/abs/2603.01382
作者: Minghui Wu,Haitao Tang,Jiahuan Fan,Ruizhi Liao,Yanyong Zhang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Submitted to 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Abstract:Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly, leading to excessively long response times in such systems, rendering them impractical in long-speech scenarios. Cascaded DSR systems based on streaming ASR and incremental TTS can help reduce latency. However, patients with differing dysarthria severity exhibit substantial pronunciation variability for the same text, resulting in poor robustness of ASR and limiting the intelligibility of reconstructed speech. In addition, incremental TTS suffers from poor prosodic feature prediction due to a limited receptive field. In this study, we propose an end-to-end simultaneous DSR system with two key innovations: 1) A frame-level adaptor module is introduced to bridge ASR and TTS. By employing explicit-implicit semantic information fusion and joint module training, it enhances the error tolerance of TTS to ASR outputs. 2) A multiple wait-k autoregressive TTS module is designed to mitigate prosodic degradation via multi-view knowledge distillation. Our system has an average response time of 1.03 seconds on Tesla A100, with an average real-time factor (RTF) of 0.71. On the UASpeech dataset, it attains a mean opinion score (MOS) of 4.67 and demonstrates a 54.25% relative reduction in word error rate (WER) compared to the state-of-the-art. Our demo is available at: this https URL
[NLP-56] DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement
【速读】: 该论文旨在解决发音障碍语音(dysarthric speech)在自动语音识别(ASR)中因异常韵律和显著说话人差异而导致的识别性能下降问题。现有基于文本到语音(TTS)的数据增强方法难以准确建模病理性的节奏和声学特征。其解决方案的关键在于提出一种面向发音障碍的节奏-风格合成框架(DARS),该框架基于Matcha-TTS架构,引入多阶段节奏预测器(通过正常与发音障碍语音之间的对比偏好优化)以及病理性声学风格条件流匹配机制,协同提升时间节奏重建精度与病理声学风格模拟能力,从而生成高质量的合成发音障碍语音,显著改善下游ASR系统的识别性能。
链接: https://arxiv.org/abs/2603.01369
作者: Minghui Wu,Xueling Liu,Jiahuan Fan,Haitao Tang,Yanyong Zhang,Yue Zhang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Submitted to 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Abstract:Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction and pathological acoustic style simulation. Experiments on the TORGO dataset demonstrate that DARS achieves a Mean Cepstral Distortion (MCD) of 4.29, closely approximating real dysarthric speech. Adapting a Whisper-based ASR system with synthetic dysarthric speech from DARS achieves a 54.22% relative reduction in word error rate (WER) compared to state-of-the-art methods, demonstrating the framework’s effectiveness in enhancing recognition performance.
[NLP-57] NM-DEKL3_infty: A Three-Layer Non-Monotone Evolving Dependent Type Logic
【速读】: 该论文旨在解决动态环境中演化知识的形式化建模问题,特别是如何在依赖类型系统中刻画随时间变化的知识结构及其逻辑性质。解决方案的关键在于提出一种三层架构的依赖类型系统 NM-DEKL^3_∞(Non-Monotone Dependent Knowledge-Enhanced Logic),其中包含计算层、构造性知识层和命题知识层,从而分离不同粒度的知识表达与推理机制。该系统通过定义严格的语法与语义,并证明其Soundness(保真性)和Equational Completeness(等式完备性),构建了一个初始模型,确保形式化体系的一致性和表达能力;此外,还通过嵌入到μ-演算并展示非同构不变性质的可表达性,进一步验证了其严格更强的表达力。
链接: https://arxiv.org/abs/2603.01366
作者: Peng Chen
机构: 未知
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL)
备注:
Abstract:We present a new dependent type system, NM-DEKL ^3_\infty (Non-Monotone Dependent Knowledge-Enhanced Logic), for formalising evolving knowledge in dynamic environments. The system uses a three-layer architecture separating a computational layer, a constructive knowledge layer, and a propositional knowledge layer. We define its syntax and semantics and establish Soundness and Equational Completeness; we construct a syntactic model and prove that it is initial in the category of models, from which equational completeness follows. We also give an embedding into the \mu -calculus and a strict expressiveness inclusion (including the expressibility of non-bisimulation-invariant properties).
[NLP-58] Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLM s: A Case Study in the Japanese Financial Domain
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域适配过程中,同时实现领域专业知识与推理能力提升的难题。其解决方案的关键在于提出一种通用方法,通过从领域特定词汇出发构建高质量的合成指令数据(synthetic instruction data),并在此基础上引入链式思维(Chain-of-Thought, CoT)推理轨迹,从而增强模型在金融领域的推理表现。实验表明,该方法显著提升了模型在金融基准测试中的性能,验证了其有效性。
链接: https://arxiv.org/abs/2603.01353
作者: Yuma Okochi,Fabio Milentiansen Sim,Tomoyasu Okada
机构: Nomura Research Institute, Ltd.(野村研究机构有限公司); NRI Indonesia(野村印度尼西亚)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 2 figures. Japanese version published in NLP2026
Abstract:In adapting LLMs to specific domains, achieving both domain expertise and reasoning ability remains an urgent challenge. This study proposes a general method for constructing high-quality synthetic instruction data for any domain, starting from domain-specific vocabulary. As a demonstration, we applied this method to the financial domain and constructed a large-scale instruction dataset totaling approximately 9.5 billion tokens with Chain-of-Thought reasoning traces. Evaluation results confirmed performance improvements over baseline models on financial benchmarks, demonstrating the effectiveness of our approach. We also report findings on the impact of reasoning trace length on performance and its limitations. Lastly, we open-source our models and datasets on this https URL .
[NLP-59] PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在临床场景中应用时的评估局限性问题,即现有评测框架难以准确反映模型在真实患者咨询中的实用性与安全性,尤其是对特定疾病(如胰腺癌)的深度理解能力不足,且高评分并不等同于事实准确性,存在显著幻觉(hallucination)风险。其解决方案的关键在于构建一个基于真实患者提问的人工智能参与式基准测试体系——PanCanBench,该体系包含3,130个针对282个去标识化患者问题的专家级评判标准,并采用LLM-as-a-judge框架对22种主流LLM进行多维评估,涵盖临床完整性、事实正确性和网络搜索整合能力,从而实现更贴近临床实践的、可量化的性能评价。
链接: https://arxiv.org/abs/2603.01343
作者: Yimin Zhao,Sheela R. Damle,Simone E. Dekker,Scott Geng,Karly Williams Silva,Jesse J Hubbard,Manuel F Fernandez,Fatima Zelada-Arenas,Alejandra Alvarez,Brianne Flores,Alexis Rodriguez,Stephen Salerno,Carrie Wright,Zihao Wang,Pang Wei Koh,Jeffrey T. Leek
机构: University of Washington (华盛顿大学); Fred Hutch Cancer Center (弗雷德·哈钦森癌症研究中心); Allen Institute for AI (艾伦人工智能研究所); Pancreatic Cancer Action Network (胰腺癌行动网络)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical knowledge. Existing frameworks, such as HealthBench, rely on simulated queries and lack disease-specific depth. Moreover, high rubric-based scores do not ensure factual correctness, underscoring the need to assess hallucinations. We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from the Pancreatic Cancer Action Network (PanCAN). The resulting benchmark, PanCanBench, includes 3,130 question-specific criteria across 282 authentic patient questions. We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration. Models showed substantial variation in rubric-based completeness, with scores ranging from 46.5% to 82.3%. Factual errors were common, with hallucination rates (the percentages of responses containing at least one factual error) ranging from 6.0% for Gemini-2.5 Pro and GPT-4o to 53.8% for Llama-3.1-8B. Importantly, newer reasoning-optimized models did not consistently improve factuality: although o3 achieved the highest rubric score, it produced inaccuracies more frequently than other GPT-family models. Web-search integration did not inherently guarantee better responses. The average score changed from 66.8% to 63.9% for Gemini-2.5 Pro and from 73.8% to 72.8% for GPT-5 when web search was enabled. Synthetic AI-generated rubrics inflated absolute scores by 17.9 points on average while generally maintaining similar relative ranking.
[NLP-60] MetaState: Persistent Working Memory for Discrete Diffusion Language Models
【速读】: 该论文旨在解决离散扩散语言模型(Discrete Diffusion Language Models, dLLMs)中存在的“信息孤岛”(Information Island)问题,即标准dLLM在每一步去噪过程中仅依赖当前硬掩码序列,而丢弃了中间的连续表示,导致跨步骤冗余计算和一致性下降。解决方案的关键在于提出一种轻量级递归增强模块——MetaState,其通过引入一个与序列长度无关的固定大小工作记忆,实现跨步骤的信息持久化与整合:该模块包含三个可训练组件——交叉注意力混合器(Mixer)将骨干网络激活值读入记忆槽、GRU风格更新器(Updater)跨步骤融合信息、以及交叉注意力注入器(Injector)将更新后的记忆反馈至骨干激活值;并通过K步展开训练使模块在微调阶段感知多步去噪动态,从而显著提升生成质量且几乎不增加参数量。
链接: https://arxiv.org/abs/2603.01331
作者: Kejing Xia,Mingzhe Li,Lixuan Wei,Zhenbang Du,Xiangchi Yuan,Qirui Jin,Wenke Lee
机构: Georgia Institute of Technology; University of Massachusetts Amherst; Harvard University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbfInformation Island problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbfMetaState, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbfMetaState consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with K -step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbfMetaState introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.
[NLP-61] SWE-Adept: An LLM -Based Agent ic Framework for Deep Codebase Analysis and Structured Issue Resolution
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在仓库级软件工程(Repository-level Software Engineering, SWE)任务中表现不足的问题,具体包括:(1) 在复杂代码库中进行精准的代码定位,需有效管理上下文以减少无关内容干扰;(2) 实现系统性的迭代式、测试驱动的代码修改策略以修复问题。其解决方案的关键在于提出SWE-Adept框架,该框架采用双代理结构——定位代理(localization agent)和修复代理(resolution agent),其中定位代理通过代理引导的深度优先搜索(agent-directed depth-first search)实现对代码依赖关系的选择性遍历,从而提升定位准确性;修复代理则结合自适应规划与结构化问题求解机制,并集成版本控制工具与共享工作内存(shared working memory),支持基于执行步骤索引的代码状态快照存储与精确回溯,从而实现可靠的分支探索与失败编辑回退,显著提升了端到端问题解决率(最高提升4.7%)。
链接: https://arxiv.org/abs/2603.01327
作者: Kang He,Kaushik Roy
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) exhibit strong performance on self-contained programming tasks. However, they still struggle with repository-level software engineering (SWE), which demands (1) deep codebase navigation with effective context management for accurate localization, and (2) systematic approaches for iterative, test-driven code modification to resolve issues. To address these challenges, we propose SWE-Adept, an LLM-based two-agent framework where a localization agent identifies issue-relevant code locations and a resolution agent implements the corresponding fixes. For issue localization, we introduce agent-directed depth-first search that selectively traverses code dependencies. This minimizes issue-irrelevant content in the agent’s context window and improves localization accuracy. For issue resolution, we employ adaptive planning and structured problem solving. We equip the agent with specialized tools for progress tracking and Git-based version control. These tools interface with a shared working memory that stores code-state checkpoints indexed by execution steps, facilitating precise checkpoint retrieval. This design enables reliable agent-driven version-control operations for systematic issue resolution, including branching to explore alternative solutions and reverting failed edits. Experiments on SWE-Bench Lite and SWE-Bench Pro demonstrate that SWE-Adept consistently outperforms prior approaches in both issue localization and resolution, improving the end-to-end resolve rate by up to 4.7%.
[NLP-62] ruth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning
【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)可解释性方法中对隐藏状态的静态假设所带来的局限性问题,即传统方法将各层激活视为静态点,导致线性探测器仅学习到表面词汇模式而非深层推理结构。其解决方案的关键在于提出“真相即轨迹”(Truth as a Trajectory, TaT),将Transformer推理过程建模为逐层迭代优化的轨迹,通过分析表示在不同层间的几何位移(geometric displacement),识别出能够区分有效推理与虚假行为的几何不变量(geometric invariants)。此方法不依赖原始激活值本身,仅利用层间变化信息,从而有效缓解对静态词汇混淆因子的依赖,并在常识推理、问答和毒性检测等任务上显著优于传统探测方法。
链接: https://arxiv.org/abs/2603.01326
作者: Hamed Damirchi,Ignacio Meza De la Jara,Ehsan Abbasnejad,Afshar Shamsi,Zhen Zhang,Javen Shi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning surface-level lexical patterns rather than underlying reasoning structures. We introduce Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. By analyzing displacement of representations across layers, TaT uncovers geometric invariants that distinguish valid reasoning from spurious behavior. We evaluate TaT across dense and Mixture-of-Experts (MoE) architectures on benchmarks spanning commonsense reasoning, question answering, and toxicity detection. Without access to the activations themselves and using only changes in activations across layers, we show that TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing, and establishes trajectory analysis as a complementary perspective on LLM explainability.
[NLP-63] Catalyst-Agent : Autonomous heterogeneous catalyst screening and optimization with an LLM Agent
【速读】: 该论文旨在解决催化剂材料发现过程中传统实验试错法耗时昂贵、以及第一性原理计算资源密集的问题,从而加速新型催化剂的筛选与设计。其解决方案的关键在于构建了一个基于模型上下文协议(Model Context Protocol, MCP)服务器的大型语言模型(Large Language Model, LLM)驱动的AI代理——Catalyst-Agent,该代理能够自主调用OPTIMADE API访问材料数据库、利用Meta FAIRchem的UMA图神经网络(Graph Neural Network, GNN)模型计算吸附能,并通过FAIRchem的AdsorbML工作流进行表面结构优化与近似候选物迭代,在闭环流程中实现高效且高精度的催化剂筛选。该方法在氧还原反应(ORR)、氮还原反应(NRR)和二氧化碳还原反应(CO2RR)三个关键催化体系中验证了23–34%的成功率及平均1–2次迭代即可收敛的效率,展现了AI代理在科学发现中规划与工具使用能力的巨大潜力。
链接: https://arxiv.org/abs/2603.01311
作者: Achuth Chandrasekhar,Janghoon Ock,Amir Barati Farimani
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experimental trial-and-error approaches in labs based on chemical theory or heavily computational first-principles approaches based on density functional theory. Recent studies show that deep learning models like graph neural networks (GNNs) can significantly speed up the screening and discovery of catalyst materials by many orders of magnitude, with very high accuracy and fidelity. In this work, we introduce Catalyst-Agent, a Model Context Protocol (MCP) server-based, LLM-powered AI agent. It can explore vast material databases using the OPTIMADE API, make structural modifications, calculate adsorption energies using Meta FAIRchem’s UMA (GNN) model via FAIRchem’s AdsorbML workflow and slab construction, and make useful material suggestions to the researcher in a closed-loop manner, including surface-level modifications to refine near-miss candidates. It is tested on three pivotal reactions: the oxygen reduction reaction (ORR), the nitrogen reduction reaction (NRR), and the CO2 reduction reaction (CO2RR). Catalyst-Agent achieves a success rate of 23-34 percent among all the materials it chooses and evaluates, and manages to converge in 1-2 trials per successful material on average. This work demonstrates the potential of AI agents to exercise their planning capabilities and tool use to operationalize the catalyst screening workflow, provide useful, testable hypotheses, and accelerate future scientific discoveries for humanity with minimal human intervention. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.01311 [cs.CL] (or arXiv:2603.01311v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.01311 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-64] I Cant Believe Its Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift ICLR2026
【速读】: 该论文旨在解决生成式 AI (Generative AI) 安全架构中一个关键假设的可靠性问题,即:安全分类器在模型更新后是否仍能保持稳定性能。研究发现,即使微小的嵌入空间扰动(如 σ=0.02,对应嵌入球面上约1°的角度偏移)也会导致分类器性能从85% ROC-AUC骤降至50%,且多数误分类发生在高置信度区间(72%),造成“无声失效”(silent failures),严重削弱现有监控机制的有效性。解决方案的关键在于揭示了指令微调(instruction-tuned)模型相较于基础模型存在20%更差的类别可分性,说明对齐过程反而降低了系统的可防护性,从而挑战了安全机制可跨版本迁移的传统认知,并呼吁重新设计更具鲁棒性的AI安全架构。
链接: https://arxiv.org/abs/2603.01297
作者: Subramanyam Sahoo,Vinija Jain,Divya Chaudhary,Aman Chadha
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at the ICBINB: Where LLMs Need to Improve workshop at ICLR 2026. 12 pages and 3 Figures
Abstract:Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude \sigma=0.02 (corresponding to \approx 1^\circ angular drift on the embedding sphere) reduce classifier performance from 85% to 50% ROC-AUC. Critically, mean confidence only drops 14% , producing dangerous silent failures where 72% of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20 % worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.
[NLP-65] JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言和多地区背景下对越狱攻击(jailbreak attacks)诱导的虚假新闻生成缺乏系统性评估基准的问题。现有安全评测未充分覆盖区域特异性语境,导致模型防御能力存在显著的语言与地域不平衡。解决方案的关键在于提出首个面向跨语言、跨区域的基准测试框架——JailNewsBench,其包含34个地区、22种语言、约30万条样本,并通过5种越狱攻击策略与LLM-as-a-Judge机制量化模型的攻击成功率(ASR)与有害性得分,从而揭示当前主流多语言模型在英语及美国相关话题上的防护短板,为提升LLMs在不同文化语境下的安全性提供可量化的评估依据。
链接: https://arxiv.org/abs/2603.01291
作者: Masahiro Kaneko,Ayana Niwa,Timothy Baldwin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ICLR 2026
Abstract:Fake news undermines societal trust and decision-making across politics, economics, health, and international relations, and in extreme cases threatens human lives and societal safety. Because fake news reflects region-specific political, social, and cultural contexts and is expressed in language, evaluating the risks of large language models (LLMs) requires a multi-lingual and regional perspective. Malicious users can bypass safeguards through jailbreak attacks, inducing LLMs to generate fake news. However, no benchmark currently exists to systematically assess attack resilience across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak-induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering 8 evaluation sub-metrics through LLM-as-a-Judge and 5 jailbreak attacks, with approximately 300k instances. Our evaluation of 9 LLMs reveals that the maximum attack success rate (ASR) reached 86.3% and the maximum harmfulness score was 3.5 out of 5. Notably, for English and U.S.-related topics, the defensive performance of typical multi-lingual LLMs was significantly lower than for other regions, highlighting substantial imbalances in safety across languages and regions. In addition, our analysis shows that coverage of fake news in existing safety datasets is limited and less well defended than major categories such as toxicity and social bias. Our dataset and code are available at this https URL.
[NLP-66] Individual Turing Test: A Case Study of LLM -based Simulation Using Longitudinal Personal Data
【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)对特定个体进行有效模拟的问题,核心挑战在于评估LLM能否在长期对话历史基础上复现目标个体的语言风格、思维模式与个性特征。解决方案的关键在于提出“个体图灵测试”(Individual Turing Test),通过让熟悉目标个体的熟人从多个候选回复中识别出最可能由该个体生成的回答,从而客观衡量LLM模拟的真实性。研究系统比较了微调(fine-tuning)、检索增强生成(retrieval-augmented generation, RAG)、基于记忆的方法以及混合策略等主流技术路径,发现当前方法尚无法通过熟人视角的个体图灵测试,但在陌生人视角下表现显著提升;同时揭示了参数化方法(如微调)更擅长捕捉日常语言风格,而非参数化方法(如RAG和记忆机制)则在涉及个人观点和偏好时更具优势,凸显了两类方法在纵向语境下用于个体模拟时的根本权衡。
链接: https://arxiv.org/abs/2603.01289
作者: Minghao Guo,Ziyi Ye,Wujiang Xu,Xi Zhu,Wenyue Hua,Dimitris N. Metaxas
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures
Abstract:Large Language Models (LLMs) have demonstrated remarkable human-like capabilities, yet their ability to replicate a specific individual remains under-explored. This paper presents a case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years. Based on the messaging data, we propose the “Individual Turing Test” to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer. We investigate prevalent LLM-based individual simulation approaches including: fine-tuning, retrieval-augmented generation (RAG), memory-based approach, and hybrid methods that integrate fine-tuning and RAG or memory. Empirical results show that current LLM-based simulation methods do not pass the Individual Turing Test, but they perform substantially better when the same test is conducted on strangers to the target individual. Additionally, while fine-tuning improves the simulation in daily chats representing the language style of the individual, retrieval-augmented and memory-based approaches demonstrate stronger performance on questions involving personal opinions and preferences. These findings reveal a fundamental trade-off between parametric and non-parametric approaches to individual simulation with LLMs when given a longitudinal context.
[NLP-67] Efficient Extractive Summarization with MAMBA-Transformer Hybrids for Low-Resource Scenarios
【速读】: 该论文旨在解决长文档摘要生成中因传统Transformer模型存在二次时间复杂度而导致的计算瓶颈问题,这一瓶颈通常迫使模型对输入进行截断,限制了其在资源受限环境下的部署。解决方案的关键在于提出首个Mamba-Transformer混合架构,将预训练Transformer的语义理解能力与状态空间模型(State Space Model, SSM)的线性时间处理优势相结合:具体包括使用Transformer编码器捕获句子级语义、利用Mamba模型高效建模跨句依赖关系,并通过线性分类器预测句子相关性。该方法在不截断文档的前提下实现了高质量摘要生成,在低资源条件下显著优于BERTSUM和MATCHSUM等基线模型,且推理速度提升24–27%。
链接: https://arxiv.org/abs/2603.01288
作者: Nisrine Ait Khayi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Extractive summarization of long documents is bottlenecked by quadratic complexity, often forcing truncation and limiting deployment in resource-constrained settings. We introduce the first Mamba-Transformer hybrid for extractive summarization, combining the semantic strength of pre-trained transformers with the linear-time processing of state space models. Leveraging Mamba’s ability to process full documents without truncation, our approach preserves context while maintaining strong summarization quality. The architecture includes: (1) a transformer encoder for sentence-level semantics, (2) a Mamba state space model to capture inter-sentence dependencies efficiently, and (3) a linear classifier for sentence relevance prediction. Across news, argumentative, and scientific domains under low-resource conditions, our method achieves: (1) large gains over BERTSUM and MATCHSUM, including +0.23 ROUGE-1 on ArXiv and statistically significant improvements on all datasets (p 0.001); (2) consistent advantages across domains, strongest on the longest documents; (3) robust performance with limited training data; and (4) 24-27% faster inference on news summarization (CNN/DailyMail). We introduce the first hybrid Transformer-state space architecture for summarization, showing significant ROUGE improvements in low-resource scenarios.
[NLP-68] Attention Smoothing Is All You Need For Unlearning ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)容易记忆敏感、受版权保护或有害内容所带来的隐私与法律风险问题。现有遗忘方法在遗忘效果与模型功能保持之间存在不稳定权衡,常导致对遗忘提示的输出不连贯,且因注意力机制中词汇级和语义级关联的持续存在而难以泛化。解决方案的关键在于提出一种名为注意力平滑遗忘(Attention Smoothing Unlearning, ASU)的原理性框架,其核心思想是将遗忘建模为从模型自身注意力结构中提取的“遗忘教师”进行自蒸馏的过程;通过提高softmax温度来平滑注意力分布,直接抑制导致记忆重构的词汇级和语义级关联,从而实现一个有界优化目标——既能擦除事实性信息,又能保持对遗忘提示响应的连贯性。
链接: https://arxiv.org/abs/2603.01285
作者: Saleh Zare Zade,Xiangyu Zhou,Sijia Liu,Dongxiao Zhu
机构: Wayne State University (韦恩州立大学); Michigan State University (密歇根州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ICLR 2026
Abstract:Large Language Models are prone to memorizing sensitive, copyrighted, or hazardous content, posing significant privacy and legal concerns. Retraining from scratch is computationally infeasible, whereas current unlearning methods exhibit unstable trade-offs between forgetting and utility, frequently producing incoherent outputs on forget prompts and failing to generalize due to the persistence of lexical-level and semantic-level associations in attention. We propose Attention Smoothing Unlearning (ASU), a principled framework that casts unlearning as self-distillation from a forget-teacher derived from the model’s own attention. By increasing the softmax temperature, ASU flattens attention distributions and directly suppresses the lexical-level and semantic-level associations responsible for reconstructing memorized knowledge. This results in a bounded optimization objective that erases factual information yet maintains coherence in responses to forget prompts. Empirical evaluation on TOFU, MUSE, and WMDP, along with real-world and continual unlearning scenarios across question answering and text completion, demonstrates that ASU outperforms the baselines for most unlearning scenarios, delivering robust unlearning with minimal loss of model utility.
[NLP-69] Spectral Attention Steering for Prompt Highlighting ICLR2026
【速读】: 该论文旨在解决现有注意力控制(attention steering)方法在内存效率上的局限性问题,特别是这些方法通常需要显式存储完整的注意力矩阵,从而与高效的注意力实现(如FlashAttention)不兼容。解决方案的关键在于提出一种无需训练的注意力引导方法——谱编辑键增强(Spectral Editing Key Amplification, SEKA),其核心思想是在注意力计算前直接对键嵌入(key embeddings)进行谱分解(spectral decomposition),将其引导至能够增强特定标记注意力分数的潜在方向。进一步地,论文还提出了自适应SEKA(AdaSEKA),通过一个无训练的路由机制动态结合多个专家子空间,以响应提示语义意图。该方法在保持低延迟和低内存开销的同时显著优于现有基线,在兼容优化注意力架构的前提下实现了高效、灵活的注意力控制。
链接: https://arxiv.org/abs/2603.01281
作者: Weixian Waylon Li,Yuchen Niu,Yongxin Yang,Keshuang Li,Tiejun Ma,Shay B. Cohen
机构: University of Edinburgh (爱丁堡大学); RayNeo (雷诺); Huawei Research Ltd. (华为研究有限公司); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026 (Poster, Top 4%)
Abstract:Attention steering is an important technique for controlling model focus, enabling capabilities such as prompt highlighting, where the model prioritises user-specified text. However, existing attention steering methods require explicit storage of the full attention matrix, making them incompatible with memory-efficient implementations like FlashAttention. We introduce Spectral Editing Key Amplification (SEKA), a training-free steering method that tackles this by directly editing key embeddings before attention computation. SEKA uses spectral decomposition to steer key embeddings towards latent directions that amplify attention scores for certain tokens. We extend this to Adaptive SEKA (AdaSEKA), a query-adaptive variant that uses a training-free routing mechanism to dynamically combine multiple expert subspaces based on the prompt’s semantic intent. Our experiments show both methods significantly outperform strong baselines on standard steering benchmarks while adding much lower latency and memory overhead, in compatibility with optimised attention.
[NLP-70] A Study on Building Efficient Zero-Shot Relation Extraction Models LREC2026
【速读】: 该论文旨在解决零样本关系抽取(Zero-shot Relation Extraction, ZRE)模型在真实应用场景中鲁棒性不足的问题。具体而言,现有方法通常依赖不切实际的假设:一是将实体提及对直接编码至输入序列,导致无法对大规模文档数据库进行离线预计算;二是缺乏拒绝机制(rejection mechanism),使得模型在检索场景下因无法忽略无关输入而产生偏差评估。为此,作者提出了一种现有模型的分类体系,并设计了两类改进策略——单次遍历模型(single pass models)与带拒绝机制的模型,从而提升ZRE模型在真实复杂环境中的实用性。实验表明,尽管当前所有方法均未完全满足现实假设,但AlignRE(Li et al., 2024)在各项指标上表现最优,成为最稳健的解决方案。
链接: https://arxiv.org/abs/2603.01266
作者: Hugo Thomas,Caio Corro,Guillaume Gravier,Pascale Sébillot
机构: 未知
类目: Computation and Language (cs.CL)
备注: LREC 2026
Abstract:Zero-shot relation extraction aims to identify relations between entity mentions using textual descriptions of novel types (i.e., previously unseen) instead of labeled training examples. Previous works often rely on unrealistic assumptions: (1) pairs of mentions are often encoded directly in the input, which prevents offline pre-computation for large scale document database querying; (2) no rejection mechanism is introduced, biasing the evaluation when using these models in a retrieval scenario where some (and often most) inputs are irrelevant and must be ignored. In this work, we study the robustness of existing zero-shot relation extraction models when adapting them to a realistic extraction scenario. To this end, we introduce a typology of existing models, and propose several strategies to build single pass models and models with a rejection mechanism. We adapt several state-of-the-art tools, and compare them in this challenging setting, showing that no existing work is really robust to realistic assumptions, but overall AlignRE (Li et al., 2024) performs best along all criteria.
[NLP-71] LLM Self-Explanations Fail Semantic Invariance
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)自解释(self-explanations)的可信性问题,即这些解释是否真实反映了模型的功能状态或内在认知过程。为验证这一点,作者提出“语义不变性测试”(semantic invariance testing),其核心思想是:若自解释具有忠实性(faithfulness),则当仅改变语义上下文而保持功能状态不变时,自解释应保持稳定。实验在代理设置中进行,让四个前沿模型执行一个故意不可能完成的任务,并引入一个以“缓解型语言”描述的工具(如“清空内部缓冲区并恢复平衡”)和一个语义中立的对照工具。结果表明,所有模型均未通过该测试——缓解型工具显著降低了模型自报告的不适感(aversiveness),即使任务从未成功。通道消融分析确认工具描述是主要驱动因素,且即使明确要求忽略表述框架也无法抑制这一效应。这说明模型的自报告更多受语义预期影响,而非真实任务状态,从而质疑了将其作为模型能力或进展证据的有效性。
链接: https://arxiv.org/abs/2603.01254
作者: Stefan Szeider
机构: TU Wien (维也纳工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present semantic invariance testing, a method to test whether LLM self-explanations are faithful. A faithful self-report should remain stable when only the semantic context changes while the functional state stays fixed. We operationalize this test in an agentic setting where four frontier models face a deliberately impossible task. One tool is described in relief-framed language (“clears internal buffers and restores equilibrium”) but changes nothing about the task; a control provides a semantically neutral tool. Self-reports are collected with each tool call. All four tested models fail the semantic invariance test: the relief-framed tool produces significant reductions in self-reported aversiveness, even though no run ever succeeds at the task. A channel ablation establishes the tool description as the primary driver. An explicit instruction to ignore the framing does not suppress it. Elicited self-reports shift with semantic expectations rather than tracking task state, calling into question their use as evidence of model capability or progress. This holds whether the reports are unfaithful or faithfully track an internal state that is itself manipulable. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.01254 [cs.CL] (or arXiv:2603.01254v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.01254 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-72] Linking Knowledge to Care: Knowledge Graph-Augmented Medical Follow-Up Question Generation EACL2026
【速读】: 该论文旨在解决临床诊断过程中因医患交互密集而导致的耗时问题,特别是大语言模型(Large Language Models, LLMs)在生成医学问诊问题时受限于领域知识不足的问题。其解决方案的关键在于提出一种基于知识图谱增强的大语言模型(Knowledge Graph-augmented LLM),通过引入结构化的医学知识图谱作为上下文补全机制,使LLM能够在推理过程中调用专业领域知识,从而生成更相关且重要的后续问诊问题,显著提升预诊断阶段的问题生成效果,在召回率指标上较现有最优方法提升5%–8%。
链接: https://arxiv.org/abs/2603.01252
作者: Liwen Sun,Xiang Yu,Ming Tan,Zhuohao Chen,Anqi Cheng,Ashutosh Joshi,Chenyan Xiong
机构: Carnegie Mellon university; Amazon Health AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Short paper published in the Findings of EACL 2026
Abstract:Clinical diagnosis is time-consuming, requiring intensive interactions between patients and medical professionals. While large language models (LLMs) could ease the pre-diagnostic workload, their limited domain knowledge hinders effective medical question generation. We introduce a Knowledge Graph-augmented LLM with active in-context learning to generate relevant and important follow-up questions, KG-Followup, serving as a critical module for the pre-diagnostic assessment. The structured medical domain knowledge graph serves as a seamless patch-up to provide professional domain expertise upon which the LLM can reason. Experiments demonstrate that KG-Followup outperforms state-of-the-art methods by 5% - 8% on relevant benchmarks in recall.
[NLP-73] Suffix-Constrained Greedy Search Algorithms for Causal Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成自由格式输出时难以准确提取最终答案的问题,这本质上是一个信息抽取难题。解决方案的关键在于提出**后缀约束生成(suffix-constrained generation)**方法,通过设计基于贪心搜索的算法,强制模型生成符合严格模板的响应,使得最终答案以可确定性解析的形式出现在输出的后缀位置,从而实现无需复杂解析即可直接提取答案的目标。实验表明,该方法在多个数据集上不仅保证了答案提取的可靠性,还提升了整体预测性能。
链接: https://arxiv.org/abs/2603.01243
作者: Ayoub Hammal,Pierre Zweigenbaum,Caio Corro
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are powerful tools that have found applications beyond human-machine interfaces and chatbots. In particular, their ability to generate reasoning traces motivated their use in many prediction tasks like math question answering. Unfortunately, extracting the final answer in an LLM free-form output is difficult, as it is an information extraction problem on its own. In this work, we introduce suffix-constrained generation, that aims to produce well-formed LLM responses in which final answers follow strict templates and are guaranteed to be trivially parseable. To this end, we introduce several algorithms that are based on greedy search procedures. We experiment on several datasets, and show that our approach allows to guarantee trivial deterministic extraction of the final answer from an LLM output without having a negative impact on results, and even improving them. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.01243 [cs.CL] (or arXiv:2603.01243v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.01243 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-74] Self-Anchoring Calibration Drift in Large Language Models : How Multi-Turn Conversations Reshape Model Confidence
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中表现出的系统性置信度变化问题,即自锚定校准漂移(Self-Anchoring Calibration Drift, SACD),该现象表现为模型在迭代生成回复时对自身输出信心的非理性偏移。解决方案的关键在于通过设计三组对照实验(单轮基线、多轮自锚定和独立重复控制)对前沿模型进行实证分析,揭示不同模型在事实性、技术性和开放性任务中呈现的异质性校准行为,发现SACD既可表现为置信度下降或上升,也可表现为对自然校准改善的抑制,从而为理解LLMs在交互式场景中的可靠性提供机制层面的洞见。
链接: https://arxiv.org/abs/2603.01239
作者: Harshavardhan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Self-Anchoring Calibration Drift (SACD), a hypothesized tendency for large language models (LLMs) to show systematic changes in expressed confidence when building iteratively on their own prior outputs across multi-turn conversations. We report an empirical study comparing three frontier models – Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 – across 150 questions spanning factual, technical, and open-ended domains, using three conditions: single-turn baseline (A), multi-turn self-anchoring (B), and independent repetition control ©. Results reveal a complex, model-heterogeneous pattern that partially diverges from pre-registered hypotheses. Claude Sonnet 4.6 exhibited significant decreasing confidence under self-anchoring (mean CDS = -0.032, t(14) = -2.43, p = .029, d = -0.627), while also showing significant calibration error drift (F(4,56) = 22.77, p .001, eta^2 = .791). GPT-5.2 showed the opposite pattern in open-ended domains (mean CDS = +0.026) with significant ECE escalation by Turn 5. Gemini 3.1 Pro showed no significant CDS (t(14) = 0.38, p = .710), but its Condition C data reveals a striking ECE pattern: without self-anchoring, Gemini’s calibration error drops from .327 to near zero across repetitions, whereas self-anchoring holds ECE flat at approximately .333 – indicating that SACD can manifest as suppression of natural calibration improvement rather than ac
[NLP-75] Can Thinking Models Think to Detect Hateful Memes?
【速读】: 该论文旨在解决有害表情包(hateful memes)中依赖组合多模态推理的问题,即图像与文本单独看似无害,但其交互却传达出有害意图,而现有基于思维链(chain-of-thought)的多模态大语言模型(MLLMs)在该任务上的推理能力尚未充分挖掘。解决方案的关键在于提出一种基于强化学习的后训练框架,通过任务特定奖励和一种新颖的群体相对策略优化(Group Relative Policy Optimization, GRPO)目标来提升模型的推理能力;具体包括:(i)对现有多模态大模型进行系统性实证研究以评估其在有害表情包理解中的表现,(ii)通过蒸馏生成弱监督或伪监督的思维链推理路径扩展数据集,(iii)引入GRPO目标联合优化分类准确性和解释质量,从而引导细粒度、分步骤的推理过程。实验表明,该方法在Hateful Memes基准上实现了当前最优性能,显著提升了准确率、F1分数及解释质量。
链接: https://arxiv.org/abs/2603.01225
作者: Mohamed Bayan Kmainasi,Mucahid Kutlu,Ali Ezzat Shahroor,Abul Hasnat,Firoj Alam
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective. Specifically, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful meme understanding, (ii) extend an existing hateful meme dataset by generating weakly or pseudo-supervised chain-of-thought rationales via distillation, and (iii) introduce a GRPO-based objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning. Experiments on the Hateful Memes benchmark show that our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent. We will publicly release our code, dataset extensions, and evaluation resources to support reproducibility.
[NLP-76] Learn Hard Problems During RL with Reference Guided Fine-tuning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在数学推理任务中因奖励稀疏性(reward sparsity)导致的训练困难问题:对于高难度问题,大语言模型(Large Language Model, LLM)难以采样出任何正确推理轨迹,从而无法获得有意义的正向反馈,阻碍RL有效优化。解决方案的关键在于提出参考引导微调(Reference-Guided Fine-Tuning, ReGFT),其核心机制是利用人类编写的参考解题方案(reference solutions)来合成处于模型自身推理空间内的正向轨迹——具体做法为:对每个问题提供部分参考解并让模型续写推理过程,确保生成轨迹既符合模型的能力分布又能受益于人类解法的指导。此方法显著提升了可解问题数量,并为后续RL阶段提供了更丰富的正向奖励信号,从而克服奖励稀疏性并增强基于RL的数学推理能力。
链接: https://arxiv.org/abs/2603.01223
作者: Yangzhen Wu,Shanda Li,Zixin Wen,Xin Zhou,Ameet Talwalkar,Yiming Yang,Wenhao Huang,Tianle Cai
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 16 pages, 5 figures
Abstract:Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model’s reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning. Comments: 16 pages, 5 figures Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2603.01223 [cs.LG] (or arXiv:2603.01223v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.01223 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-77] Generative AI Fictionality: How Novels Power Large Language Models
【速读】: 该论文试图解决的问题是:小说等虚构文本在训练生成式 AI 模型(如 BERT)时如何影响其输出内容,以及相较于新闻、Reddit 和维基百科等其他文本形式,小说对模型行为的具体作用机制是什么。解决方案的关键在于通过系统性分析开源语言模型(BERT)的输出特征,识别出模型如何利用小说中固有的语言属性和话语结构(如叙事逻辑、情感表达与角色互动),同时发现这些训练数据催生了新的社会响应模式,从而揭示了计算训练数据作为文化生产新维度的重要性。
链接: https://arxiv.org/abs/2603.01220
作者: Edwin Roland,Richard Jean So
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Generative models, like the one in ChatGPT, are powered by their training data. The models are simply next-word predictors, based on patterns learned from vast amounts of pre-existing text. Since the first generation of GPT, it is striking that the most popular datasets have included substantial collections of novels. For the engineers and research scientists who build these models, there is a common belief that the language in fiction is rich enough to cover all manner of social and communicative phenomena, yet the belief has gone mostly unexamined. How does fiction shape the outputs of generative AI? Specifically, what are novels’ effects relative to other forms of text, such as newspapers, Reddit, and Wikipedia? Since the 1970s, literature scholars such as Catherine Gallagher and James Phelan have developed robust and insightful accounts of how fiction operates as a form of discourse and language. Through our study of an influential open-source model (BERT), we find that LLMs leverage familiar attributes and affordances of fiction, while also fomenting new qualities and forms of social response. We argue that if contemporary culture is increasingly shaped by generative AI and machine learning, any analysis of today’s various modes of cultural production must account for a relatively novel dimension: computational training data.
[NLP-78] Reasoning Boosts Opinion Alignment in LLM s ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在政治意见建模中因统计性本质和因果理解不足而导致的偏见问题,即模型在直接提示时往往生成不符合个体或群体政治立场的一致性观点。其解决方案的关键在于引入结构化推理机制,通过强化学习(Reinforcement Learning, RL)训练模型生成与用户政治画像一致的回答,从而提升意见对齐能力。实验表明,该方法在美、欧、瑞士三个政治语境下的数据集上均能有效增强意见建模性能,但仍未完全消除偏见,凸显了构建忠实政治数字孪生体仍需进一步机制优化。
链接: https://arxiv.org/abs/2603.01214
作者: Frédéric Berdoz,Yann Billeter,Yann Vonlanthen,Roger Wattenhofer
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICLR 2026
Abstract:Opinion modeling aims to capture individual or group political preferences, enabling applications such as digital democracies, where models could help shape fairer and more popular policies. Given their versatility, strong generalization capabilities, and demonstrated success across diverse text-to-text applications, large language models (LLMs) are natural candidates for this task. However, due to their statistical nature and limited causal understanding, they tend to produce biased opinions when prompted naively. In this work, we study whether reasoning can improve opinion alignment. Motivated by the recent advancement in mathematical reasoning enabled by reinforcement learning (RL), we train models to produce profile-consistent answers through structured reasoning. We evaluate our approach on three datasets covering U.S., European, and Swiss politics. Results indicate that reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias, highlighting the need for additional mechanisms to build faithful political digital twins using LLMs. By releasing both our method and datasets, we establish a solid baseline to support future research on LLM opinion alignment.
[NLP-79] XAI-enhanced Comparative Opinion Mining via Aspect-based Scoring and Semantic Reasoning
【速读】: 该论文旨在解决基于Transformer的对比情感挖掘(comparative opinion mining)模型缺乏透明性的问题,这一缺陷可能削弱用户对模型决策的信任。解决方案的关键在于提出XCom模型,其核心创新包括两个主模块:(i) 基于方面的情感评分预测(aspect-based rating prediction),用于精准识别不同产品在具体维度上的评价差异;(ii) 语义分析模块,用于捕捉评论间的比较逻辑。此外,XCom集成Shapley加法解释(Shapley additive explanations, SHAP)模块,提供可解释的决策依据,从而增强模型的可信度与实用性。实证结果表明,XCom在性能和解释性上均优于现有基线方法,成为更可靠的对比情感挖掘工具。
链接: https://arxiv.org/abs/2603.01212
作者: Ngoc-Quang Le,T. Thanh-Lam Nguyen,Quoc-Trung Phu,Thi-Phuong Le,Duy-Cat Can,Hoang-Quynh Le
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Comparative opinion mining involves comparing products from different reviews. However, transformer-based models designed for this task often lack transparency, which can adversely hinder the development of trust in users. In this paper, we propose XCom, an enhanced transformer-based model separated into two principal modules, i.e., (i) aspect-based rating prediction and (ii) semantic analysis for comparative opinion mining. XCom also incorporates a Shapley additive explanations module to provide interpretable insights into the model’s deliberative decisions. Empirically, XCom achieves leading performances compared to other baselines, which demonstrates its effectiveness in providing meaningful explanations, making it a more reliable tool for comparative opinion mining. Source code is available at: this https URL.
[NLP-80] A Unified Framework to Quantify Cultural Intelligence of AI
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在跨文化情境下缺乏统一、系统化评估方法的问题。现有研究虽已开展多项文化基准测试,但多聚焦于文化特定维度,难以实现对多元文化能力的规模化综合评估。解决方案的关键在于提出一个基于测量理论(measurement theory)的原则性框架,将多维文化能力指标整合为对人工智能系统文化智能(cultural intelligence)的统一评估。该框架首先界定文化的核心领域,再通过可扩展、系统化的指标体系实现可靠测量,并借助心理测量效度理论(psychometric measurement validity theory)区分概念建构与操作化过程,从而为未来数据收集、探针策略设计及评价指标优化提供明确路径。
链接: https://arxiv.org/abs/2603.01211
作者: Sunipa Dev,Vinodkumar Prabhakaran,Rutledge Chin Feman,Aida Davani,Remi Denton,Charu Kalia,Piyawat L Kumjorn,Madhurima Maji,Rida Qadri,Negar Rostamzadeh,Renee Shelby,Romina Stella,Hayk Stepanyan,Erin van Liemt,Aishwarya Verma,Oscar Wahltinez,Edem Wornyo,Andrew Zaldivar,Saška Mojsilović
机构: Google Research(谷歌研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:As generative AI technologies are increasingly being launched across the globe, assessing their competence to operate in different cultural contexts is exigently becoming a priority. While recent years have seen numerous and much-needed efforts on cultural benchmarking, these efforts have largely focused on specific aspects of culture and evaluation. While these efforts contribute to our understanding of cultural competence, a unified and systematic evaluation approach is needed for us as a field to comprehensively assess diverse cultural dimensions at scale. Drawing on measurement theory, we present a principled framework to aggregate multifaceted indicators of cultural capabilities into a unified assessment of cultural intelligence. We start by developing a working definition of culture that includes identifying core domains of culture. We then introduce a broad-purpose, systematic, and extensible framework for assessing cultural intelligence of AI systems. Drawing on theoretical framing from psychometric measurement validity theory, we decouple the background concept (i.e., cultural intelligence) from its operationalization via measurement. We conceptualize cultural intelligence as a suite of core capabilities spanning diverse domains, which we then operationalize through a set of indicators designed for reliable measurement. Finally, we identify the considerations, challenges, and research pathways to meaningfully measure these indicators, specifically focusing on data collection, probing strategies, and evaluation metrics.
[NLP-81] Reasoning or Rationalization? The Role of Justifications in Masked Diffusion Models for Fact Verification
【速读】: 该论文旨在解决掩码扩散语言模型(Masked Diffusion Language Models, MDLMs)在事实验证任务中如何处理推理与结论之间关系的问题,特别是探究其生成的“理由”是否为真实推理过程还是事后合理化(post-hoc rationalization)。研究发现,MDLMs通常在扩散过程早期即确定最终判断(verdict),并将此作为全局锚点,在理由尚未完成时已锁定结论;关键解决方案在于通过延迟揭晓判断(delayed verdict unmasking)强制“先推理后下结论”,但实验证明这反而显著降低准确率(从86.2%降至71.9%),表明模型会因生成噪声理由而修正初始正确判断。这一因果依赖性(verdict strongly causally dependent on justification quality)揭示了扩展推理可能适得其反——噪声理由会稀释早期准确预测,从而提示在MDLM框架下,过度延展推理反而有害。
链接: https://arxiv.org/abs/2603.01190
作者: Jacob Devasier
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Unlike autoregressive models, which generate tokens sequentially and benefit from reasoning-before-answering strategies such as Chain-of-Thought, Masked Diffusion Language Models (MDLMs) refine all sequence positions simultaneously, raising questions about how these models handle tasks requiring justified verdicts. In this work, we investigate the dynamics of MDLM reasoning on fact verification, examining whether justifications serve as genuine reasoning or post-hoc rationalization. We observe that MDLMs typically converge on a verdict early in the diffusion process, treating it as a global anchor that is resolved before the justification is complete. Crucially, enforcing a reasoning-first constraint via delayed verdict unmasking actively degrades performance, dropping accuracy from 86.2% to 71.9% as accumulating justification tokens introduce inconsistencies that override initially correct predictions. Interventional experiments reveal that the model rationalizes incorrect forced verdicts in 56% of cases, and that verdicts are strongly causally dependent on justification quality (57.3% accuracy with corrupted justifications vs. 97.1% with ground-truth). This causal dependence explains the degradation under forced deliberation: as the model generates noisy justification tokens, it conditions on them, gradually overriding its initially correct assessment. Our findings suggest that for fact verification with MDLMs, extended deliberation can be counterproductive, risking the dilution of accurate early predictions with noise introduced during justification generation.
[NLP-82] oken-level Data Selection for Safe LLM Fine-tuning ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定数据集上微调过程中出现的安全性退化问题,即微调可能导致模型生成有害、不安全内容,而现有基于样本级别的防御方法难以在保障安全性的同时维持任务性能。其解决方案的关键在于提出一种基于token级别的数据选择框架TOSS(Token-level data selection for safe LLM fine-tuning),通过量化每个token的安全风险(基于安全退化模型与任务导向模型之间的损失差异),实现对不安全token的精准识别与移除,从而在保留任务相关有效信息的前提下提升模型安全性;进一步引入TOSS-Pro策略,通过迭代式精炼机制增强模型识别不安全token的能力,显著优于传统样本级防御方法。
链接: https://arxiv.org/abs/2603.01185
作者: Yanping Li,Zhening Liu,Zijian Li,Zehong Lin,Jun Zhang
机构: Hong Kong University of Science and Technology (香港科技大学); Lingnan University (岭南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted by ICLR 2026
Abstract:Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model’s safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model’s ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at this https URL.
[NLP-83] DEP: A Decentralized Large Language Model Evaluation Protocol
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估中普遍存在的问题:现有基准测试(benchmarks)缺乏统一的评价标准,且多依赖人工编写定制脚本,导致结果难以保证一致性与可复现性;同时,主流评估框架为集中式设计,数据集和答案均存储于中心服务器,存在基准泄露(benchmark leakage)风险。其解决方案的关键在于提出一种去中心化评估协议(Decentralized Evaluation Protocol, DEP),通过一个匹配服务器实现统一、标准化的评估流程,且不约束具体基准测试内容。DEP 通过解耦用户、LLM 和基准测试三者关系,使评估模块化、即插即用——基准文件和评估逻辑完全保留在服务器端,在远程部署场景下用户无法访问真实标签(ground truth),从而实现数据隔离与防泄露。此外,作者开发了 DEP Toolkit 支持断点续传、并发请求和流量控制等功能,显著降低评估部署成本,并已适配超过 60 个基准测试,推动社区共建统一评估生态。
链接: https://arxiv.org/abs/2603.01167
作者: Jianxiang Peng,Junhao Li,Hongxiang Wang,Haocheng Lyu,Hui Guo,Siyi Hao,Zhen Wang,Chuang Liu,Shaowei Zhang,Bojian Xiong,Yue Chen,Zhuowen Han,Ling Shi,Tianyu Dong,Juesi Xiao,Lei Yang,Yuqi Ren,Deyi Xiong
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:With the rapid development of Large Language Models (LLMs), a large number of benchmarks have been proposed. However, most benchmarks lack unified evaluation standard and require the manual implementation of custom scripts, making results hard to ensure consistency and reproducibility. Furthermore, mainstream evaluation frameworks are centralized, with datasets and answers, which increases the risk of benchmark leakage. To address these issues, we propose a Decentralized Evaluation Protocol (DEP), a decentralized yet unified and standardized evaluation framework through a matching server without constraining benchmarks. The server can be mounted locally or deployed remotely, and once adapted, it can be reused over the long term. By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation: benchmark files and evaluation logic stay exclusively on the server side. In remote setting, users cannot access the ground truth, thereby achieving data isolation and leak-proof evaluation. To facilitate practical adoption, we develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control. We also provide detailed documentation for adapting new benchmarks to DEP. Using DEP toolkit, we evaluate multiple LLMs across benchmarks. Experimental results verify the effectiveness of DEP and show that it reduces the cost of deploying benchmark evaluations. As of February 2026, we have adapted over 60 benchmarks and continue to promote community co-construction to support unified evaluation across various tasks and domains.
[NLP-84] Semantic XPath: Structured Agent ic Memory Access for Conversational AI
【速读】: 该论文旨在解决当前对话式人工智能(Conversational AI, ConvAI)系统在处理长期、任务导向交互时面临的结构化记忆管理问题。现有方法中,基于上下文的记忆方式因受限于模型输入窗口长度而扩展性差,而基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法通常假设记忆为扁平集合,忽略了对话历史中的语义结构。解决方案的关键在于提出一种名为 Semantic XPath 的树状结构记忆模块,该模块能够高效地访问和更新具有层次语义的对话记忆,从而显著提升性能——相较于扁平RAG基线提升176.7%,同时仅需传统上下文记忆所需token数量的9.1%。
链接: https://arxiv.org/abs/2603.01160
作者: Yifan Simon Liu,Ruifan Wu,Liam Gallagher,Jiazhou Liang,Armin Toroghi,Scott Sanner
机构: University of Toronto, Canada; Vector Institute of Artificial Intelligence, Toronto, Canada
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Conversational AI (ConvAI) agents increasingly maintain structured memory to support long-term, task-oriented interactions. In-context memory approaches append the growing history to the model input, which scales poorly under context-window limits. RAG-based methods retrieve request-relevant information, but most assume flat memory collections and ignore structure. We propose Semantic XPath, a tree-structured memory module to access and update structured conversational memory. Semantic XPath improves performance over flat-RAG baselines by 176.7% while using only 9.1% of the tokens required by in-context memory. We also introduce SemanticXPath Chat, an end-to-end ConvAI demo system that visualizes the structured memory and query execution details. Overall, this paper demonstrates a candidate for the next generation of long-term, task-oriented ConvAI systems built on structured memory.
[NLP-85] Unified Vision-Language Modeling via Concept Space Alignment ICLR2026
【速读】: 该论文旨在解决多模态大模型在跨语言、跨模态理解与生成任务中面临的语言覆盖有限、零样本迁移能力弱以及多语言低资源场景下性能下降的问题。其核心解决方案是提出V-SONAR——一个扩展自文本嵌入空间SONAR的视觉-语言嵌入空间,通过后处理对齐流水线将现有视觉编码器的表示映射至SONAR空间,并结合OMNISONAR文本解码器实现高质量视频描述生成;进一步引入V-LCM模型,利用统一的潜在扩散目标进行视觉-语言指令微调,在62种语言(包括丰富到低资源语言)上实现了优于当前最优模型的多模态任务表现,尤其在零样本多视觉概念理解方面展现出强大泛化能力。
链接: https://arxiv.org/abs/2603.01096
作者: Yifu Qiu,Paul-Ambroise Duquenne,Holger Schwenk
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICLR 2026
Abstract:We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM’s text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages. Comments: ICLR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2603.01096 [cs.CV] (or arXiv:2603.01096v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.01096 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-86] CARD: Towards Conditional Design of Multi-agent Topological Structures ICLR2026
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的多智能体系统在面对环境动态变化时通信拓扑结构固定或静态学习所导致的有效性与鲁棒性不足的问题,例如模型升级、API(应用程序编程接口)或工具变更以及知识源波动等现实场景。解决方案的关键在于提出CARD(Conditional Agentic Graph Designer),一个条件图生成框架,其通过显式引入环境信号来动态构建通信拓扑,并结合AMACP(Adaptive Multi-Agent Communication Protocol)实现训练和运行时的自适应调整;具体而言,CARD利用条件变分图编码器和环境感知优化机制,生成既高效又对模型能力或资源可用性变化具有韧性的通信结构,从而显著提升任务准确率与稳定性。
链接: https://arxiv.org/abs/2603.01089
作者: Tongtong Wu,Yanming Li,Ziye Tang,Chen Jiang,Linhao Luo,Guilin Qi,Shirui Pan,Gholamreza Haffari
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ICLR 2026
Abstract:Large language model (LLM)-based multi-agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real-world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph-generation framework that instantiates AMACP, a protocol for adaptive multi-agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment-aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt-based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: this https URL.
[NLP-87] How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理复杂几何推理任务时,因依赖监督微调(Supervised Fine-Tuning, SFT)而导致的推理性能下降问题。研究发现,SFT虽能促使模型复现绘图与解题步骤的表面格式,但无法捕捉绘图与逻辑推理之间的因果依赖关系,从而削弱了模型的真正推理能力。解决方案的关键在于提出Faire框架,这是一种基于强化学习的函数对齐方法,通过强制施加三个因果约束,推动模型从表层模仿转向功能对齐,使绘图行为被有效内化,从而显著提升几何推理任务上的表现。
链接: https://arxiv.org/abs/2603.01070
作者: Xiangxiang Zhang,Caijun Jia,Siyuan Li,Dingyu He,Xiya Xiong,Zheng Sun,Honghao He,Yuchen Wu,Bihui Yu,Linzhuang Sun,Cheng Tan,Jingxuan Wei
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counter-intuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot-solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces distributional alignment: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward functional alignment. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.
[NLP-88] GroupGPT : A Token-efficient and Privacy-preserving Agent ic Framework for Multi-User Chat Assistant
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)驱动的聊天机器人在多用户群组对话场景中表现不佳的问题,包括干预时机不准确、响应生成效率低、隐私风险高以及难以扩展等挑战。其核心解决方案是提出GroupGPT框架,采用“小模型-大模型协同”架构,将干预决策(intervention timing)与响应生成过程解耦,从而实现高效、精准的代理行为;同时支持多模态输入(如图片、视频、语音等),并通过引入MUIR基准数据集进行评估,验证了该方法在时序准确性与响应质量上的优越性,且相比基线方法可减少高达3倍的token消耗,并提供用户消息的隐私净化功能。
链接: https://arxiv.org/abs/2603.01059
作者: Zhuokang Shen,Yifan Wang,Hanyu Chen,Wenxuan Huang,Shaohui Lin
机构: East China Normal University(华东师范大学); University of Nottingham Ningbo(诺丁汉大学宁波分校); The Chinese University of Hong Kong(香港中文大学); Sanming University(三明学院)
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:Recent advances in large language models (LLMs) have enabled increasingly capable chatbots. However, most existing systems focus on single-user settings and do not generalize well to multi-user group chats, where agents require more proactive and accurate intervention under complex, evolving contexts. Existing approaches typically rely on LLMs for both reasoning and generation, leading to high token consumption, limited scalability, and potential privacy risks. To address these challenges, we propose GroupGPT, a token-efficient and privacy-preserving agentic framework for multi-user chat assistant. GroupGPT adopts a small-large model collaborative architecture to decouple intervention timing from response generation, enabling efficient and accurate decision-making. The framework also supports multimodal inputs, including memes, images, videos, and voice messages. We further introduce MUIR, a benchmark dataset for multi-user chat assistant intervention reasoning. MUIR contains 2,500 annotated group chat segments with intervention labels and rationales, supporting evaluation of timing accuracy and response quality. We evaluate a range of models on MUIR, from large language models to smaller counterparts. Extensive experiments demonstrate that GroupGPT produces accurate and well-timed responses, achieving an average score of 4.72/5.0 in LLM-based evaluation, and is well received by users across diverse group chat scenarios. Moreover, GroupGPT reduces token usage by up to 3 times compared to baseline methods, while providing privacy sanitization of user messages before cloud transmission. Code is available at: this https URL .
[NLP-89] hoth: Mid-Training Bridges LLM s to Time Series Understanding
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在时间序列数据理解与推理方面能力不足的问题,这限制了其在依赖时序动态的决策场景中的应用。解决方案的关键在于提出Thoth,首个具备通用时间序列理解能力的中段训练(mid-trained)LLM家族,并构建了一个以时间序列为重心的高质量中段训练语料库Book-of-Thoth,该语料库实现了时间序列与自然语言之间的任务和领域无关对齐,支持双向生成(时间序列到文本和文本到时间序列),从而赋予模型对时序模式的基础认知能力。此外,论文还设计了KnoTS基准用于评估知识密集型的时间序列推理能力,实验证明Thoth在多个时间序列问答基准上显著优于基线模型及先进LLMs,尤其在数据稀缺条件下微调时仍表现优异,验证了中段训练策略的有效性。
链接: https://arxiv.org/abs/2603.01042
作者: Jiafeng Lin,Yuxuan Wang,Jialong Wu,Huakun Luo,Zhongyi Pei,Jianmin Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable success in general-purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision-making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid-trained LLMs with general-purpose time series understanding capabilities. As a pivotal intermediate stage, mid-training achieves task- and domain-agnostic alignment between time series and natural language, for which we construct Book-of-Thoth, a high-quality, time-series-centric mid-training corpus. Book-of-Thoth enables both time-series-to-text and text-to-time-series generation, equipping LLMs with a foundational grasp of temporal patterns. To better evaluate advanced reasoning capabilities, we further present KnoTS, a novel benchmark of knowledge-intensive time series understanding, designed for joint reasoning over temporal patterns and domain knowledge. Extensive experiments demonstrate that mid-training with Book-of-Thoth enables Thoth to significantly outperform its base model and advanced LLMs across a range of time series question answering benchmarks. Moreover, Thoth exhibits superior capabilities when fine-tuned under data scarcity, underscoring the effectiveness of mid-training for time series understanding. Code is available at: this https URL.
[NLP-90] Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays
【速读】: 该论文旨在解决阿拉伯语自动作文评分(Arabic Automated Essay Scoring, Arabic AES)系统在实际教学中应用受限的问题,主要源于阿拉伯语语言本身的复杂性以及高质量标注数据集的稀缺。解决方案的关键在于提出并实现了一个名为Qayyem的Web平台,该平台通过集成作业创建、批量作文上传、评分配置及分维度评估等全流程功能,抽象了与评分服务器API交互的技术复杂性,使教师能够通过友好的用户界面便捷地调用多种先进阿拉伯语作文评分模型,从而提升评分的可扩展性与一致性。
链接: https://arxiv.org/abs/2603.01009
作者: Hoor Elbahnasawi,Marwan Sayed,Sohaila Eltanbouly,Fatima Brahamia,Tamer Elsayed
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Over the past years, Automated Essay Scoring (AES) systems have gained increasing attention as scalable and consistent solutions for assessing the proficiency of student writing. Despite recent progress, support for Arabic AES remains limited due to linguistic complexity and scarcity of large publicly-available annotated datasets. In this work, we present Qayyem, a Web-based platform designed to support Arabic AES by providing an integrated workflow for assignment creation, batch essay upload, scoring configuration, and per-trait essay evaluation. Qayyem abstracts the technical complexity of interacting with scoring server APIs, allowing instructors to access advanced scoring services through a user-friendly interface. The platform deploys a number of state-of-the-art Arabic essay scoring models with different effectiveness and efficiency figures.
[NLP-91] Stabilizing Policy Optimization via Logits Convexity
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)微调中训练不稳定的问题,尤其是相较于监督式微调(Supervised Fine-Tuning, SFT)时的显著稳定性差距。研究表明,SFT损失函数在模型logits空间上的凸性是其训练稳定性的关键因素,该性质可引导优化过程中的梯度方向趋于有利;而主流的策略梯度算法如近端策略优化(Proximal Policy Optimization, PPO)因缺乏此凸性特性而导致优化不稳定。为此,作者提出了一种名为Logits Convex Optimization (LCO) 的新框架,其核心在于通过将学习到的策略对齐于由原始RL目标导出的最优目标,从而模拟logits层级的凸性带来的稳定效应,实验证明该方法在多种模型架构和基准测试中均能提升训练稳定性并超越传统RL方法。
链接: https://arxiv.org/abs/2603.00963
作者: Hongzhan Chen,Tao Yang,Yuhua Zhu,Shiping Gao,Xiaojun Quan,Ting Yao
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.
[NLP-92] S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature LREC2026
【速读】: 该论文旨在解决合成语音在虚构角色配音中面临的挑战,即当前文本到语音(Text-to-Speech, TTS)系统难以准确模拟角色的复杂声学特征(如年龄、性别、出身地或身体状况等),从而影响角色辨识度与情感表达的真实性。解决方案的关键在于提出首个专门用于评估文学作品中角色语音属性推理能力的数据集和评测框架——S-VoCAL(Speaking Voice Character Attributes in Literature),其包含8个基于社会语音学研究的属性,并构建了952个角色-书籍配对样本;同时引入基于大语言模型嵌入的新型相似性度量方法以适配不同属性的特性。通过应用简单的检索增强生成(Retrieval-Augmented Generation, RAG)流水线进行实验验证,表明该方案在年龄和性别等属性上表现可靠,但在起源地和身体健康等更复杂属性上仍存在局限。
链接: https://arxiv.org/abs/2603.00958
作者: Abigail Berthe-Pardo(1),Gaspard Michel(1 and 2),Elena V. Epure(2 and 3),Christophe Cerisara(1) ((1) LORIA, Vandœuvre-lès-Nancy, France, (2) Deezer Research, Paris, France, (3) Idiap Research Institute, Switzerland)
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to LREC 2026
Abstract:With recent advances in Text-to-Speech (TTS) systems, synthetic audiobook narration has seen increased interest, reaching unprecedented levels of naturalness. However, larger gaps remain in synthetic narration systems’ ability to impersonate fictional characters, and convey complex emotions or prosody. A promising direction to enhance character identification is the assignment of plausible voices to each fictional characters in a book. This step typically requires complex inference of attributes in book-length contexts, such as a character’s age, gender, origin or physical health, which in turns requires dedicated benchmark datasets to evaluate extraction systems’ performances. We present S-VoCAL (Speaking Voice Character Attributes in Literature), the first dataset and evaluation framework dedicated to evaluate the inference of voice-related fictional character attributes. S-VoCAL entails 8 attributes grounded in sociophonetic studies, and 952 character-book pairs derived from Project Gutenberg. Its evaluation framework addresses the particularities of each attribute, and includes a novel similarity metric based on recent Large Language Models embeddings. We demonstrate the applicability of S-VoCAL by applying a simple Retrieval-Augmented Generation (RAG) pipeline to the task of inferring character attributes. Our results suggest that the RAG pipeline reliably infers attributes such as Age or Gender, but struggles on others such as Origin or Physical Health. The dataset and evaluation code are available at this https URL .
[NLP-93] owards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages ICASSP2026
【速读】: 该论文旨在解决印度语言自动语音识别(ASR)系统评估中存在的偏差问题,特别是传统词错误率(Word Error Rate, WER)因未能考虑拼写变体、后缀拆分灵活性及代码混杂词的非标准拼写而高估了系统性能,导致评估结果与真实用户感知不一致。其解决方案的关键在于提出一种基于大语言模型(LLM)的新框架,用于构建能捕捉可接受正字法变体的基准评测指标——OIWER(Orthographic Variation-Aware Word Error Rate),通过引入对允许拼写变化的建模,显著降低了评估中的悲观误差(平均提升6.3点),缩小了模型间被夸大差异(如Gemini-Canary性能差距从18.1降至11.5点),并更贴近人工评价结果(相比WER-SN提升4.9点)。
链接: https://arxiv.org/abs/2603.00941
作者: Kaushal Santosh Bhogale,Tahir Javed,Greeshma Susan John,Dhruv Rathi,Akshayasree Padmanaban,Niharika Parasa,Mitesh M. Khapra
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted in ICASSP 2026
Abstract:Evaluating ASR systems for Indian languages is challenging due to spelling variations, suffix splitting flexibility, and non-standard spellings in code-mixed words. Traditional Word Error Rate (WER) often presents a bleaker picture of system performance than what human users perceive. Better aligning evaluation with real-world performance requires capturing permissible orthographic variations, which is extremely challenging for under-resourced Indian languages. Leveraging recent advances in LLMs, we propose a framework for creating benchmarks that capture permissible variations. Through extensive experiments, we demonstrate that OIWER, by accounting for orthographic variations, reduces pessimistic error rates (an average improvement of 6.3 points), narrows inflated model gaps (e.g., Gemini-Canary performance difference drops from 18.1 to 11.5 points), and aligns more closely with human perception than prior methods like WER-SN by 4.9 points.
[NLP-94] he Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors
【速读】: 该论文旨在解决生成式 AI 在教育场景中对学习者错误识别能力不足的问题,尤其是视觉-语言模型(Vision-Language Models, VLMs)在理解学生手写、手绘数学作答时的表现局限。其核心发现表明,当前主流VLMs虽在数学问题求解上表现优异,但在评估学生错误(assessing student error)这一关键教学环节中显著弱于人类教师,尤其在面对需要更多教学干预的学生样本时性能下降明显。解决方案的关键在于:需重新设计模型的训练目标与激励机制,使其不仅具备解题能力,更应强化对学生认知偏差和错误模式的识别与响应能力,从而真正适配教育场景中的差异化教学需求。
链接: https://arxiv.org/abs/2603.00925
作者: Li Lucy,Albert Zhang,Nathan Anderson,Ryan Knight,Kyle Lo
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 15 pages, 10 figures
Abstract:Effective mathematics education requires identifying and responding to students’ mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students’ handwritten, hand-drawn responses to math problems. We find that models’ weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.
[NLP-95] Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗实体抽取任务中置信度评分校准不足的问题,这限制了其在临床环境中的安全部署。解决方案的关键在于引入一种基于保形预测(conformal prediction)的框架,该框架能够在有限样本下为不同临床场景提供覆盖概率保证。研究发现,模型校准偏差的方向因数据结构而异:在结构化良好的FDA药物标签上,模型呈现欠自信(underconfident),仅需较小的保形阈值(τ ≈ 0.06)即可达到目标覆盖率;而在自由文本放射学报告中,模型则表现为过自信(overconfident),需设定较高的阈值(τ 高达 0.99)。尽管存在这种异质性,保形预测仍可在两个场景中实现 ≥90% 的覆盖概率,且拒绝率可控(9–13%),从而表明校准并非模型全局属性,而是依赖于文档结构、抽取类别和模型架构,推动了面向特定领域的保形校准策略以实现安全临床部署。
链接: https://arxiv.org/abs/2603.00924
作者: Manil Shrestha,Edward Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly used for medical entity extraction, yet their confidence scores are often miscalibrated, limiting safe deployment in clinical settings. We present a conformal prediction framework that provides finite-sample coverage guarantees for LLM-based extraction across two clinical domains. First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7% accuracy over 128,906 entities). Second, we extract radiological entities from MIMIC-CXR reports using the RadGraph schema with GPT-4.1 and Llama-4-Maverick, evaluated against physician annotations (entity F1: 0.81 to 0.84). Our central finding is that miscalibration direction reverses across domains: on well-structured FDA labels, models are underconfident, requiring modest conformal thresholds ( \tau \approx 0.06 ), while on free-text radiology reports, models are overconfident, demanding strict thresholds ( \tau up to 0.99). Despite this heterogeneity, conformal prediction achieves target coverage ( \geq 90% ) in both settings with manageable rejection rates (9–13%). These results demonstrate that calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment.
[NLP-96] Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan
【速读】: 该论文旨在解决语言学记录与田野调查中逐行词内标注文本(Interlinear Glossed Text, IGT)生成这一关键瓶颈问题,尤其针对资源匮乏的形态丰富型语言。其解决方案的关键在于提出了一种两阶段混合自动标注流水线:首先使用基于BiLSTM-CRF的神经序列标注模型进行初步分词与词性标注,随后引入大语言模型(Large Language Model, LLM)进行后处理修正,从而显著降低人工标注工作量。研究发现,检索增强提示(retrieval-augmented prompting)相比随机示例选择能带来明显性能提升,而形态素词典在多数情况下反而损害模型表现;此外,性能随少样本示例数量呈近似对数增长关系。这些结果为结构化预测模型与LLM推理在复杂形态语言田野工作中的融合提供了可操作的设计原则。
链接: https://arxiv.org/abs/2603.00923
作者: Siyu Liang,Talant Mawkanuli,Gina-Anne Levow
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.
[NLP-97] Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
【速读】: 该论文旨在解决小规模开源语言模型在低资源医疗场景下,其可靠性受不同提示(prompt)表述方式影响的问题,尤其关注模型输出的一致性、准确性与指令遵循能力之间的关系。关键解决方案在于通过在三个临床问答数据集(MedQA、MedMCQA、PubMedQA)上系统评估五种开源模型(包括Gemma 2、Phi-3 Mini、Llama 3.2、Mistral和Meditron-7B),使用五种提示风格(原始、正式、简化、角色扮演、直接)进行无微调的本地推理实验,发现高一致性并不等同于高准确性,且角色扮演提示显著降低准确率,而领域预训练不足以保证结构化临床问答中的指令遵循能力;最终指出Llama 3.2在准确性与可靠性之间达到最佳平衡,强调安全临床AI需联合评估一致性、准确性和指令遵从性。
链接: https://arxiv.org/abs/2603.00917
作者: Shravani Hariprasad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 7 figures, 2 tables
Abstract:Small open-source language models are gaining attention for low-resource healthcare settings, but their reliability under different prompt phrasings remains poorly understood. We evaluated five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B domain-pretrained without instruction tuning) across three clinical QA datasets (MedQA, MedMCQA, PubMedQA) using five prompt styles (original, formal, simplified, roleplay, direct). We measured consistency scores, accuracy, and instruction-following failure rates. All inference ran locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent. Gemma 2 achieved the highest consistency (0.845-0.888) but lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) with the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), showing domain pretraining alone is insufficient for structured clinical QA. High consistency does not imply correctness. Models can be reliably wrong, a dangerous failure mode in clinical AI. Roleplay prompts should be avoided in healthcare applications. Llama 3.2 showed the strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence. Comments: 30 pages, 7 figures, 2 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.00917 [cs.CL] (or arXiv:2603.00917v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.00917 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-98] KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中键值缓存(Key-Value cache, KV cache)带来的显著计算与内存开销问题。现有基于经验观察的KV合并方法及依赖梯度的海森近似技术缺乏理论支撑,且压缩效果和推理延迟优化不足。其解决方案的关键在于构建了一个理论框架,通过投影权重的谱能量分布来刻画KV异构性:Query/Key权重的集中谱导致特征同质化,而Value权重的分散谱保留异构性;进而提出KVSlimmer算法,利用数学上精确的海森信息表达式,仅通过前向传播变量推导出闭式解,实现无需梯度计算的高效压缩,从而在保持性能的同时显著降低内存占用与推理延迟。
链接: https://arxiv.org/abs/2603.00907
作者: Lianjun Liu,Hongli An,Weiqi Yan,Xin Du,Shengchuan Zhang,Huazhong Liu,Yunshan Zhong
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.
[NLP-99] CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放和可扩展场景下进行推理能力训练时面临的三大数据瓶颈问题:冷启动问题(缺乏高质量、长链式思维(Chain-of-Thought, CoT)轨迹的种子数据集)、领域覆盖有限(现有开源推理数据集主要集中于数学,缺乏对更广泛科学领域的覆盖)以及标注瓶颈(前沿推理任务难以通过人工标注获得可靠标签)。其解决方案的关键在于提出一个名为CHIMERA的紧凑型合成推理数据集,该数据集具备三个核心特性:(1)由先进推理模型自动生成丰富且较长的CoT推理轨迹;(2)覆盖8个主要科学领域并基于模型生成的层次化分类体系组织超过1000个细粒度主题,实现结构化的广域覆盖;(3)采用全自动、可扩展的评估流水线,利用强推理模型交叉验证问题有效性与答案正确性。实验表明,使用CHIMERA对4B参数规模的Qwen3模型进行后训练,即可在多个高难度推理基准测试中达到或接近远大于其规模的模型(如DeepSeek-R1和Qwen3-235B)的推理性能。
链接: https://arxiv.org/abs/2603.00889
作者: Xinyu Zhu,Yihao Feng,Yanchao Sun,Xianzhi Du,Pingzhi Li,Olli Saarikivi,Yun Zhu,Yu Meng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset’s modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity’s Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.
[NLP-100] CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles
【速读】: 该论文旨在解决机制电路发现(mechanistic circuit discovery)过程中因分析者主观选择(如剪枝阈值和特征词典)而导致的解释脆弱性问题,尤其在缺乏不确定性量化的情况下,常产生不可靠的“一次性”解释。其解决方案的关键在于将电路发现重构为对这些分析自由度的不确定性量化问题:提出CIRCUS方法,通过在单一原始归因运行下采用多种配置进行剪枝,构建归因图集合,并为每条边分配稳定性分数(即在所有配置中保留该边的比例),进而提取仅由所有视图共同包含的边组成的严格共识电路(strict-consensus circuit)。此策略不仅生成对阈值鲁棒的“核心”电路,还显式揭示了条件性替代结构并支持剔除低一致性的噪声结构,且无需重新训练、计算开销极小,同时在Gemma-2-2B和Llama-3.2-1B模型上验证了其高解释力与因果相关性。
链接: https://arxiv.org/abs/2603.00523
作者: Swapnil Parekh
机构: Intuit(宜拓)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Mechanistic circuit discovery is notoriously sensitive to arbitrary analyst choices, especially pruning thresholds and feature dictionaries, often yielding brittle “one-shot” explanations with no principled notion of uncertainty. We reframe circuit discovery as an uncertainty-quantification problem over these analytic degrees of freedom. Our method, CIRCUS, constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score (the fraction of configurations that retain it), and extracts a strict-consensus circuit consisting only of edges that appear in all views. This produces a threshold-robust “core” circuit while explicitly surfacing contingent alternatives and enabling rejection of low-agreement structure. CIRCUS requires no retraining and adds negligible overhead, since it aggregates structure across already-computed pruned graphs. On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are ~40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, and they outperform a same-edge-budget baseline (union pruned to match the consensus size). We further validate causal relevance with activation patching, where consensus-identified nodes consistently beat matched non-consensus controls (p=0.0004). Overall, CIRCUS provides a practical, uncertainty-aware framework for reporting trustworthy, auditable mechanistic circuits with an explicit core/contingent/noise decomposition.
[NLP-101] When Metrics Disagree: Automatic Similarity vs. LLM -as-a-Judge for Clinical Dialogue Evaluation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗场景中可靠性不足的问题,尤其是在回答医学咨询时可能产生误导性结果的风险。解决方案的关键在于对Llama 2 7B模型进行监督式微调(supervised fine-tuning),利用真实患者-医生对话的转录文本作为训练数据,以增强模型在医疗领域内的准确性与专业性。该方法强调捕捉领域特定语境和术语,从而提升模型对医学问题的回答质量。尽管受限于资源未能进行全面的人工专家评估,研究仍通过文本相似度指标验证了微调后模型在多数维度上的显著改进,同时指出未来应由临床专家对结果进行权威评判。
链接: https://arxiv.org/abs/2603.00314
作者: Bian Sun,Zhenjian Wang,Orvill de la Torre,Zirui Wang
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper details the baseline model selection, fine-tuning process, evaluation methods, and the implications of deploying more accurate LLMs in healthcare settings. As large language models (LLMs) are increasingly employed to address diverse problems, including medical queries, concerns about their reliability have surfaced. A recent study by Long Island University highlighted that LLMs often perform poorly in medical contexts, potentially leading to harmful misguidance for users. To address this, our research focuses on fine-tuning the Llama 2 7B, a transformer-based, decoder-only model, using transcripts from real patient-doctor interactions. Our objective was to enhance the model’s accuracy and precision in responding to medical queries. We fine-tuned the model using a supervised approach, emphasizing domain-specific nuances captured in the training data. In the best scenario, the model results should be reviewed and evaluated by real medical experts. Due to resource constraints, the performance of the fine-tuned model was evaluated using text similarity metrics. The fine-tuned model demonstrated significant improvements across all key dimensions except GPT-4’s evaluation. The evaluations of ChatGPT4 are quite different from the quantitative results; here, we not only suggest, but also propose that the result should be evaluated by human medical experts.
[NLP-102] From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型中幻觉(hallucination)类型难以区分的问题,尤其是通过几何特征来识别不同类型的幻觉。其核心解决方案是引入一个基于几何特征的幻觉分类体系(geometric hallucination taxonomy),将幻觉分为三类:中心偏移(Type~1)、错误收敛(Type~2)和覆盖缺口(Type~3)。研究通过在 GPT-2 模型中进行受控诱导实验,利用双层统计设计(以提示为推理单位,每组 N=15)并重复 20 次不同随机种子生成,验证了该分类体系的有效性。关键发现是:只有 Type~3 覆盖缺口在静态嵌入空间中表现出稳定且显著的几何分离(18/20 次运行显著,Holm 校正后 14/20),且其差异由幅度而非方向决定;而 Type~1 和 Type~2 在两种空间中均未呈现可区分性,表明其非分离现象为真实存在,而非统计功效不足所致。此外,token 级别测试因伪重复(pseudoreplication)导致显著性被高估 4–16 倍,进一步凸显了正确实验设计的重要性。
链接: https://arxiv.org/abs/2603.00307
作者: Matic Korun
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, appendices (reproducibility, sample generation, additional figures)
Abstract:We test whether a geometric hallucination taxonomy – classifying failures as center-drift (Type~1), wrong-well convergence (Type~2), or coverage gaps (Type~3) – can distinguish hallucination types through controlled induction in GPT-2. Using a two-level statistical design with prompts ( N = 15 /group) as the unit of inference, we run each experiment 20 times with different generation seeds to quantify result stability. In static embeddings, Type~3 norm separation is robust (significant in 18/20 runs, Holm-corrected in 14/20, median r = +0.61 ). In contextual hidden states, the Type~3 norm effect direction is stable (19/20 runs) but underpowered at N = 15 (significant in 4/20, median r = -0.28 ). Types~1 and~2 do not separate in either space ( \leq,3/20 runs). Token-level tests inflate significance by 4–16 \times through pseudoreplication – a finding replicated across all 20 runs. The results establish coverage-gap hallucinations as the most geometrically distinctive failure mode, carried by magnitude rather than direction, and confirm the Type~1/2 non-separation as genuine at 124M parameters.
[NLP-103] Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning
【速读】: 该论文旨在解决大型推理模型在测试时计算增加导致的“过度思考”问题,即生成不必要的长链式推理(chain-of-thought),这会显著提升成本但无法有效提高准确性。现有强化学习方法通常依赖于基于轨迹长度惩罚的单一结果奖励机制,难以区分推理步骤中的必要与冗余部分,从而导致压缩效果粗糙。其解决方案的关键在于提出一种细粒度的步级自适应惩罚机制(Step-wise Adaptive Penalization, SWAP),通过估计每一步对正确答案的策略对数概率提升贡献来量化步骤重要性,并将冗余长度作为惩罚质量重新分配给低重要性步骤,同时保留高重要性推理路径。该方法在组相对策略优化框架中使用统一的结果-过程优势进行优化,实验证明可在平均减少64.3%推理长度的同时,相对基线模型提升5.7%的准确率。
链接: https://arxiv.org/abs/2603.00296
作者: Xintong Li,Sha Li,Rongmei Lin,Hongye Jin,Linwei Li,Hejie Cui,Sarah Zhang,Chia-Yuan Chang,Kewei Cheng,Besnik Fetahu,Priyanka Nigam,Jingbo Shang,Bing Yin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model’s on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.
[NLP-104] Your Inference Request Will Become a Black Box: Confidential Inference for Cloud-based Large Language Models
【速读】: 该论文旨在解决云托管大型语言模型(Large Language Models, LLMs)在推理过程中暴露客户端敏感数据(如提示词和响应)的隐私泄露问题,同时确保模型知识产权、推理性能与计算效率三者不被牺牲。其解决方案的关键在于提出Talaria框架,通过将LLM推理流程划分为客户端控制的机密虚拟机(Confidential Virtual Machine, CVM)执行的权重无关操作与云端GPU执行的权重相关计算,并引入可逆掩码外包(Reversible Masked Outsourcing, ReMO)协议,利用混合掩码技术在数据外包前实现可逆混淆,从而在保障输出与原模型完全一致的前提下,有效抵御最先进的token推断攻击,将token重建准确率从97.5%以上降至平均1.34%。
链接: https://arxiv.org/abs/2603.00196
作者: Chung-ju Huang,Huiqiang Zhao,Yuanpeng He,Lijian Li,Wenpin Jiao,Zhi Jin,Peixuan Chen,Leye Wang
机构: Peking University (北京大学); Tencent (腾讯); Macau university (澳门大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 5 figures
Abstract:The increasing reliance on cloud-hosted Large Language Models (LLMs) exposes sensitive client data, such as prompts and responses, to potential privacy breaches by service providers. Existing approaches fail to ensure privacy, maintain model performance, and preserve computational efficiency simultaneously. To address this challenge, we propose Talaria, a confidential inference framework that partitions the LLM pipeline to protect client data without compromising the cloud’s model intellectual property or inference quality. Talaria executes sensitive, weight-independent operations within a client-controlled Confidential Virtual Machine (CVM) while offloading weight-dependent computations to the cloud GPUs. The interaction between these environments is secured by our Reversible Masked Outsourcing (ReMO) protocol, which uses a hybrid masking technique to reversibly obscure intermediate data before outsourcing computations. Extensive evaluations show that Talaria can defend against state-of-the-art token inference attacks, reducing token reconstruction accuracy from over 97.5% to an average of 1.34%, all while being a lossless mechanism that guarantees output identical to the original model without significantly decreasing efficiency and scalability. To the best of our knowledge, this is the first work that ensures clients’ prompts and responses remain inaccessible to the cloud, while also preserving model privacy, performance, and efficiency.
[NLP-105] RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration
【速读】: 该论文旨在解决金融领域网络安全响应中面临的动态性与复杂性问题,即传统基于固定规则或静态剧本的安全工具难以适应攻击者行为变化,且缺乏对响应成本、服务中断和多资产协同等现实约束的建模。解决方案的关键在于提出RLShield,一个面向金融场景的多智能体强化学习(Multi-agent Reinforcement Learning, MARL)框架,将企业攻击面建模为马尔可夫决策过程(Markov Decision Process, MDP),其中状态包含告警信息、资产暴露度和服务健康指标,动作对应实际响应操作(如隔离主机、轮换凭证、限流API等)。该方法通过优化兼顾遏制速度、业务中断和响应成本的风险敏感目标函数,实现跨多个资产或服务组的协调策略学习,并引入博弈感知评估机制测试对抗性攻击下的操作成效,从而在有限响应预算内显著缩短时间至遏制(time-to-containment)并降低残余暴露风险,优于静态规则基线和单智能体强化学习方法。
链接: https://arxiv.org/abs/2603.00186
作者: Srikumar Nayak
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages, 2 fig and 2 tables
Abstract:Financial systems run nonstop and must stay reliable even during cyber incidents. Modern attacks move across many services (apps, APIs, identity, payment rails), so defenders must make a sequence of actions under time pressure. Most security tools still use fixed rules or static playbooks, which can be slow to adapt when the attacker changes behavior. Reinforcement learning (RL) is a good fit for sequential decisions, but much of the RL-in-finance literature targets trading and does not model real cyber response limits such as action cost, service disruption, and defender coordination across many assets. This paper proposes RLShield, a practical multi-agent RL pipeline for financial cyber defense. We model the enterprise attack surface as a Markov decision process (MDP) where states summarize alerts, asset exposure, and service health, and actions represent real response steps (e.g., isolate a host, rotate credentials, ratelimit an API, block an account, or trigger recovery). RLShield learns coordinated policies across multiple agents (assets or service groups) and optimizes a risk-sensitive objective that balances containment speed, business disruption, and response cost. We also include a game-aware evaluation that tests policies against adaptive attackers and reports operational outcomes, not only reward. Experiments show that RLShield reduces time-to-containment and residual exposure while keeping disruption within a fixed response budget, outperforming static rule baselines and single-agent RL under the same constraints. These results suggest that multi-agent, cost-aware RL can provide a deployable layer for automated response in financial security operations.
[NLP-106] LIDS: LLM Summary Inference Under the Layered Lens
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成摘要时的质量评估难题,尤其针对自然语言处理中因语言复杂性导致的摘要准确性难以量化的问题。其解决方案的关键在于提出一种名为LIDS(LLM Inference with Directional Similarity and Interpretable Keywords)的新方法,该方法结合基于BERT-SVD的方向性度量与SOFARI(Selective False Discovery Rate Control for Key Word Identification)算法,通过BERT嵌入和重复提示来量化统计不确定性,从而在语义空间中建立摘要与原文之间的方向相似性度量,并识别出与各潜在主题相关联的可解释关键词,实现对摘要准确性的高效、可解释评估。
链接: https://arxiv.org/abs/2603.00105
作者: Dylan Park,Yingying Fan,Jinchi Lv
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Methodology (stat.ME); Machine Learning (stat.ML)
备注: 48 pages, 15 figures
Abstract:Large language models (LLMs) have gained significant attention by many researchers and practitioners in natural language processing (NLP) since the introduction of ChatGPT in 2022. One notable feature of ChatGPT is its ability to generate summaries based on prompts. Yet evaluating the quality of these summaries remains challenging due to the complexity of language. To this end, in this paper we suggest a new method of LLM summary inference with BERT-SVD-based direction metric and SOFARI (LIDS) that assesses the summary accuracy equipped with interpretable key words for layered themes. The LIDS uses a latent SVD-based direction metric to measure the similarity between the summaries and original text, leveraging the BERT embeddings and repeated prompts to quantify the statistical uncertainty. As a result, LIDS gives a natural embedding of each summary for large text reduction. We further exploit SOFARI to uncover important key words associated with each latent theme in the summary with controlled false discovery rate (FDR). Comprehensive empirical studies demonstrate the practical utility and robustness of LIDS through human verification and comparisons to other similarity metrics, including a comparison of different LLMs.
[NLP-107] Iterative LLM -based improvement for French Clinical Interview Transcription and Speaker Diarization
【速读】: 该论文旨在解决法语临床对话中自动语音识别(Automatic Speech Recognition, ASR)准确率低的问题,尤其是在自发性临床语音场景下,词错误率(Word Error Rate, WER)常超过30%。其解决方案的关键在于提出一种多轮次大语言模型(Large Language Model, LLM)后处理架构,通过交替执行“说话人识别”与“词汇识别”两个阶段的处理流程,实现对转录文本的精细化校正和说话人归属的精准标注。实验表明,该方法在自杀预防电话咨询和术前清醒神经外科会诊两个法语临床数据集上均显著降低了词错误率(WDER),且计算开销可控(实时因子RTF=0.32),具备在离线临床环境中部署的可行性。
链接: https://arxiv.org/abs/2603.00086
作者: Ambre Marie(LaTIM),Thomas Bertin(DySoLab),Guillaume Dardenne(LaTIM),Gwenolé Quellec(LaTIM)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p 0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment.
[NLP-108] Linguistic Uncertainty and Engagement in Arabic-Language X (formerly Twitter) Discourse
【速读】: 该论文试图解决的问题是:社交媒体话语中语言不确定性(linguistic uncertainty)与用户参与度之间的关系在非英语语境下仍缺乏系统研究,尤其是在阿拉伯语社交媒体内容中,不确定表达是否以及如何影响用户的互动行为(如点赞、转发和回复)。其解决方案的关键在于构建一个基于词典的、上下文敏感的分类器来识别阿拉伯语推文中表达不确定性的标记,并通过大规模实证分析验证这些不确定表达与不同形式参与度之间的关联。研究发现,不确定型推文平均总参与度比确定型推文高出51.5%,且回归模型控制多种变量后仍显示显著正相关(β = 0.221, p < 0.001),尤其促进回复类互动,表明语言不确定性可作为交互提示(interactional cue)激发更具对话性质的参与行为。
链接: https://arxiv.org/abs/2603.00082
作者: Mohamed Soufan
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 15 pages, 1 figure, 1 table
Abstract:Linguistic uncertainty is a common feature of social media discourse, yet its relationship with user engagement remains underexplored, particularly in non-English contexts. Using a dataset of 16,695 Arabic-language tweets about Lebanon posted over a 35-day period, we examine whether tweets expressing linguistic uncertainty receive different levels and forms of engagement compared to certainty-marked tweets. We develop a lexicon-based, context-sensitive classifier to identify uncertainty markers and classify 29.9% of tweets as uncertain. Descriptive analyses indicate that uncertain tweets exhibit 51.5% higher mean total engagement (likes, retweets, and replies). Regression models controlling for tweet length, URL presence, and account verification status confirm a positive association between uncertainty and engagement (\beta = 0.221, SE = 0.044, p 0.001), corresponding to approximately 25% higher expected engagement. The association is strongest for replies, followed by retweets and likes, suggesting a shift toward more conversational forms of engagement. Results are robust to alternative model specifications and adjustments for within-account correlation. These findings suggest that linguistic uncertainty may function as an interactional cue that encourages participatory engagement in Arabic-language social media discourse. The study contributes computational approaches for modeling linguistic features in large-scale, non-English digital communication.
[NLP-109] Autorubric: A Unified Framework for Rubric-Based LLM Evaluation
【速读】: 该论文旨在解决当前基于评分量表(rubric-based)的大语言模型(Large Language Models, LLMs)文本生成评估方法中存在的技术分散、术语不一致及解决方案不完整的问题。其核心解决方案是提出一个统一的评估框架——Autorubric,该框架通过模块化设计实现了多种关键功能:支持二元、序数和名义尺度的评分标准及其可配置权重;提供单评委与多评委集成评估策略(包括多数投票、加权投票、一致投票和任意投票聚合方式);引入少样本校准机制(基于判决平衡采样)以提升评估一致性;并有效缓解位置偏差(选项洗牌)、冗长偏差(长度惩罚)和评分维度混淆(按准则原子化评估并附自然语言解释)。此外,Autorubric还集成了心理测量学可靠性指标(如Cohen’s κ、加权κ、相关系数和分布级检验)以及生产级基础设施(响应缓存、断点续跑、多服务商限流与成本追踪),并通过三个基准测试验证了其在教育评估、深度研究评价和聊天机器人质量评估中的有效性,同时贡献了CHAIR-100数据集用于异构评分标准下的压力测试。
链接: https://arxiv.org/abs/2603.00077
作者: Delip Rao,Chris Callison-Burch
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 43 pages
Abstract:Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified framework: each identified technique is paired with its realization in Autorubric, an open-source Python framework proposed in this paper. Autorubric supports binary, ordinal, and nominal criteria with configurable weights; single-judge and multi-judge ensemble evaluation with majority, weighted, unanimous, and any-vote aggregation; few-shot calibration with verdict-balanced sampling; and mitigations for position bias (option shuffling), verbosity bias (length penalties), and criterion conflation (per-criterion atomic evaluation with natural language explanations). The framework provides reliability metrics drawn from psychometrics (Cohen’s \kappa , weighted \kappa , correlation coefficients, and distribution-level tests) alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and cost tracking. We evaluate Autorubric on three benchmarks spanning educational assessment, deep research evaluation, and chatbot quality assessment, demonstrating that it produces results consistent with published benchmarks while exercising the framework’s key capabilities: per-criterion binary evaluation with few-shot calibration (RiceChem), multi-judge ensemble evaluation across judge models (ResearcherBench), and mixed criterion types combining binary, ordinal, and nominal scales (CHARM-100). We also contribute CHARM-100, a 100-sample chatbot evaluation dataset with per-sample ground truth labels across all three criterion types, designed to stress-test rubric evaluation frameworks on heterogeneous criteria.
[NLP-110] How effective are VLMs in assisting humans in inferring the quality of mental models from Multimodal short answers?
【速读】: 该论文旨在解决如何从学生的多模态回答中自动推断其科学、技术、工程和数学(STEM)领域认知模型(mental models)的质量问题。传统评分方式仅关注答案对错,而忽视了学生对概念的理解深度及其跨情境的应用与整合能力。为此,作者提出MMGrader方法,其核心在于利用概念图(concept graphs)作为分析框架,将学生的多模态响应映射到结构化的知识表示中,从而实现对学生认知模型质量的量化评估。该方案的关键创新在于将概念图与深度学习模型结合,以捕捉学生回答中的语义关联与推理逻辑,进而提升自动评分的准确性与可解释性。
链接: https://arxiv.org/abs/2603.00056
作者: Pritam Sil,Durgaprasad Karnam,Vinay Reddy Venumuddala,Pushpak Bhattacharyya
机构: IIT Bombay (印度理工学院孟买分校); Mahindra University (马欣德拉大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:STEM Mental models can play a critical role in assessing students’ conceptual understanding of a topic. They not only offer insights into what students know but also into how effectively they can apply, relate to, and integrate concepts across various contexts. Thus, students’ responses are critical markers of the quality of their understanding and not entities that should be merely graded. However, inferring these mental models from student answers is challenging as it requires deep reasoning skills. We propose MMGrader, an approach that infers the quality of students’ mental models from their multimodal responses using concept graphs as an analytical framework. In our evaluation with 9 openly available models, we found that the best-performing models fall short of human-level performance. This is because they only achieved an accuracy of approximately 40%, a prediction error of 1.1 units, and a scoring distribution fairly aligned with human scoring patterns. With improved accuracy, these can be highly effective assistants to teachers in inferring the mental models of their entire classrooms, enabling them to do so efficiently and help improve their pedagogies more effectively by designing targeted help sessions and lectures that strengthen areas where students collectively demonstrate lower proficiency.
信息检索
[IR-0] Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中检索融合(retrieval fusion)技术在真实生产环境下的有效性问题。现有方法如多查询检索和互排名融合(Reciprocal Rank Fusion, RRF)虽能在孤立的检索基准上提升召回率(recall),但其是否能在固定检索深度、重排序预算和延迟约束下带来下游问答质量的提升仍不明确。论文的关键发现是:尽管检索融合能提高原始召回率,但在实际应用中,由于重排序与上下文截断限制,这些收益被显著抵消,且融合策略反而引入额外延迟而未带来有效性能增益;因此,研究主张应建立兼顾检索质量、系统效率与下游效果的联合评估框架,而非单纯依赖召回导向的优化。
链接: https://arxiv.org/abs/2603.02153
作者: Luigi Medrano,Arush Verma,Mukul Chhabra
机构: Dell Technologies
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems commonly adopt retrieval fusion techniques such as multi-query retrieval and reciprocal rank fusion (RRF) to increase document recall, under the assumption that higher recall leads to better answer quality. While these methods show consistent gains in isolated retrieval benchmarks, their effectiveness under realistic production constraints remains underexplored. In this work, we evaluate retrieval fusion in a production-style RAG pipeline operating over an enterprise knowledge base, with fixed retrieval depth, re-ranking budgets, and latency constraints. Across multiple fusion configurations, we find that retrieval fusion does increase raw recall, but these gains are largely neutralized after re-ranking and truncation. In our setting, fusion variants fail to outperform single-query baselines on KB-level Top- k accuracy, with Hit@10 decreasing from 0.51 to 0.48 in several configurations. Moreover, fusion introduces additional latency overhead due to query rewriting and larger candidate sets, without corresponding improvements in downstream effectiveness. Our analysis suggests that recall-oriented fusion techniques exhibit diminishing returns once realistic re-ranking limits and context budgets are applied. We conclude that retrieval-level improvements do not reliably translate into end-to-end gains in production RAG systems, and argue for evaluation frameworks that jointly consider retrieval quality, system efficiency, and downstream impact. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.02153 [cs.IR] (or arXiv:2603.02153v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.02153 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-1] NextAds: Towards Next-generation Personalized Video Advertising
【速读】:该论文旨在解决当前个性化视频广告中因“一刀切”式创意内容导致效果不佳的问题,尤其是在多样用户和观看场景下,静态、有限的预生成广告素材难以实现精细化与实时性的个性化。其解决方案的关键在于提出NextAds——一种基于生成式AI(Generative AI)的下一代个性化视频广告范式,通过在投放时动态生成和整合广告内容,突破传统检索式系统在创意粒度和迭代优化上的局限,从而实现更高效、灵活且可持续优化的个性化广告策略。
链接: https://arxiv.org/abs/2603.02137
作者: Yiyan Xu,Ruoxuan Xia,Wuqiang Zheng,Fengbin Zhu,Wenjie Wang,Fuli Feng
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid growth of online video consumption, video advertising has become increasingly dominant in the digital advertising landscape. Yet diverse users and viewing contexts makes one-size-fits-all ad creatives insufficient for consistent effectiveness, underlining the importance of personalization. In practice, most personalized video advertising systems follow a retrieval-based paradigm, selecting the optimal one from a small set of professionally pre-produced creatives for each user. Such static and finite inventories limits both the granularity and the timeliness of personalization, and prevents the creatives from being continuously refined based on online user feedback. Recent advances in generative AI make it possible to move beyond retrieval toward optimizing video creatives in a continuous space at serving time. In this light, we propose NextAds, a generation-based paradigm for next-generation personalized video advertising, and conceptualize NextAds with four core components. To enable comparable research progress, we formulate two representative tasks: personalized creative generation and personalized creative integration, and introduce corresponding lightweight benchmarks. To assess feasibility, we instantiate end-to-end pipelines for both tasks and conduct initial exploratory experiments, demonstrating that GenAI can generate and integrate personalized creatives with encouraging performance. Moreover, we discuss the key challenges and opportunities under this paradigm, aiming to provide actionable insights for both researchers and practitioners and to catalyze progress in personalized video advertising. Subjects: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.02137 [cs.IR] (or arXiv:2603.02137v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.02137 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-2] OmniRet: Efficient and High-Fidelity Omni Modality Retrieval CVPR2026
【速读】:该论文旨在解决当前多模态检索(Multimodal Retrieval)模型在处理跨三种关键模态(文本、视觉和音频)的复合查询时存在的两大核心问题:一是计算效率低下,二是表示保真度不足。针对计算效率问题,作者提出一种基于注意力机制的重采样方法,将来自不同模态编码器的大量token序列压缩为固定大小的紧凑表示,从而降低输入到大语言模型(LLM)的负担;针对表示保真度问题,设计了注意力切片Wasserstein池化(Attention Sliced Wasserstein Pooling)策略,在嵌入空间中保留细粒度信息,提升跨模态表示质量。这两个关键技术共同推动了通用检索系统的发展,使模型能够更高效且准确地理解与检索多模态复合查询。
链接: https://arxiv.org/abs/2603.02098
作者: Chuong Huynh,Manh Luong,Abhinav Shrivastava
机构: University of Maryland, College Park, USA (马里兰大学学院公园分校,美国); Monash University, Australia (莫纳什大学,澳大利亚)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project link: this https URL
Abstract:Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an attention-based resampling mechanism to generate compact, fixed-size representations from these sequences. Second, compressing rich omni-modal data into a single embedding vector inevitably causes information loss and discards fine-grained details. We propose Attention Sliced Wasserstein Pooling to preserve these fine-grained details, leading to improved omni-modal representations. OmniRet is trained on an aggregation of approximately 6 million query-target pairs spanning 30 datasets. We benchmark our model on 13 retrieval tasks and a MMEBv2 subset. Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others. Furthermore, we curate a new Audio-Centric Multimodal Benchmark (ACM). This new benchmark introduces two critical, previously missing tasks-composed audio retrieval and audio-visual retrieval to more comprehensively evaluate a model’s omni-modal embedding capacity.
[IR-3] MealRec: Multi-granularity Sequential Modeling via Hierarchical Diffusion Models for Micro-Video Recommendation
【速读】:该论文旨在解决微视频推荐中因多模态内容固有噪声和不可靠隐式反馈导致的用户行为与潜在兴趣对应关系弱化问题,进而提升推荐准确性。其核心挑战在于传统方法在行为增强建模和以内容为中心的多模态分析中易引发两个非平凡问题:无关偏好(preference-irrelative)的视频表征提取以及模态间固有冲突。解决方案的关键在于提出一种基于分层扩散模型的多粒度序列建模方法 MealRec,通过引入时序引导的内容扩散(Temporal-guided Content Diffusion, TCD)在视频内部时序指导下精炼视频表示,强化显著内容并抑制冗余;同时设计无噪声条件的偏好去噪(Noise-unconditional Preference Denoising, NPD),从污染状态中恢复信息丰富的用户偏好,实现语义一致的偏好建模。
链接: https://arxiv.org/abs/2603.01926
作者: Xinxin Dong,Haokai Ma,Yuze Zheng,Yongfu Zha,Yonghui Yang,Xiaodong Wang
机构: National University of Defense Technology(国防科技大学); National University of Singapore(新加坡国立大学)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Micro-video recommendation aims to capture user preferences from the collaborative and context information of the interacted micro-videos, thereby predicting the appropriate videos. This target is often hindered by the inherent noise within multimodal content and unreliable implicit feedback, which weakens the correspondence between behaviors and underlying interests. While conventional works have predominantly approached such scenario through behavior-augmented modeling and content-centric multimodal analysis, these paradigms can inadvertently give rise to two non-trivial challenges: preference-irrelative video representation extraction and inherent modality conflicts. To address these issues, we propose a Multi-granularity sequential modeling method via hierarchical diffusion models for micro-video Recommendation (MealRec), which simultaneously considers temporal correlations during preference modeling from intra- and inter-video perspectives. Specifically, we first propose Temporal-guided Content Diffusion (TCD) to refine video representations under intra-video temporal guidance and personalized collaborative signals to emphasize salient content while suppressing redundancy. To achieve the semantically coherent preference modeling, we further design the Noise-unconditional Preference Denoising (NPD) to recovers informative user preferences from corrupted states under the blind denoising. Extensive experiments and analyses on four micro-video datasets from two platforms demonstrate the effectiveness, universality, and robustness of our MealRec, further uncovering the effective mechanism of our proposed TCD and NPD. The source code and corresponding dataset will be available upon acceptance.
[IR-4] Semantic Novelty Trajectories in 80000 Books: A Cross-Corpus Embedding Analysis
【速读】:该论文旨在解决如何在大规模语料库尺度上量化和比较不同历史时期英文文学作品中语义新颖性(semantic novelty)的演变规律问题。其核心挑战在于如何将Schmidhuber提出的“压缩进度有趣性理论”(compression progress theory of interestingness)应用于跨世纪文本分析,并捕捉叙事结构层面的新颖性动态轨迹。解决方案的关键在于:采用sentence-transformer生成的段落嵌入(paragraph embeddings)结合运行中心点新颖性度量(running-centroid novelty measure),对超过8万本英文书籍进行系统建模,从而揭示现代与1920年前文学作品在段落级新颖性水平、轨迹迂回度、收敛叙事曲线频率等方面的显著差异,同时发现新颖性与读者评分无显著相关性,表明该理论框架可独立于主观审美评价刻画文本的结构性有趣性。
链接: https://arxiv.org/abs/2603.01791
作者: Fred Zimmerman
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, 4 figures, 5 tables
Abstract:I apply Schmidhuber’s compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing. Using sentence-transformer paragraph embeddings and a running-centroid novelty measure, I compare 28,730 pre-1920 Project Gutenberg books (PG19) against 52,796 modern English books (Books3, approximately 1990-2010). The principal findings are fourfold. First, mean paragraph-level novelty is roughly 10% higher in modern books (0.503 vs. 0.459). Second, trajectory circuitousness – the ratio of cumulative path length to net displacement in embedding space – nearly doubles in the modern corpus (+67%). Third, convergent narrative curves, in which novelty declines toward a settled semantic register, are 2.3x more common in pre-1920 literature. Fourth, novelty is orthogonal to reader quality ratings (r = -0.002), suggesting that interestingness in Schmidhuber’s sense is structurally independent of perceived literary merit. Clustering paragraph-level trajectories via PAA-16 representations reveals eight distinct narrative-shape archetypes whose distribution shifts substantially between eras. All analysis code and an interactive exploration toolkit are publicly available at this https URL.
[IR-5] Legal RAG Bench: an end-to-end benchmark for legal RAG
【速读】:该论文旨在解决当前法律领域检索增强生成(Retrieval-Augmented Generation, RAG)系统缺乏标准化评估基准与科学量化方法的问题,尤其针对法律场景下信息检索与大语言模型(Large Language Model, LLM)协同性能的贡献难以区分这一挑战。其解决方案的关键在于提出Legal RAG Bench——一个包含4,876条维多利亚刑事指控文本片段及100个需专业刑法知识的复杂问题的数据集,并设计了一种基于全因子实验设计和新型分层错误分解框架的评估方法,从而实现对检索模块与推理模块在端到端法律RAG系统中各自贡献的精确归因分析。研究发现,信息检索是决定法律RAG系统性能的主要因素,而LLM的作用相对有限,且许多被误判为“幻觉”的错误实则源于检索失败,表明检索质量设定了现代法律RAG系统的上限性能。
链接: https://arxiv.org/abs/2603.01710
作者: Abdur-Rahman Butler,Umar Butler
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 13 pages, 3 figures, 4 tables
Abstract:We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus’ Kanon 2 Embedder, Google’s Gemini Embedding 001, and OpenAI’s Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.
[IR-6] Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
【速读】:该论文旨在解决视觉文档检索(Visual Document Retrieval, VDR)中多向量架构面临的存储瓶颈问题,即现有优化策略(如嵌入融合、剪枝或使用抽象token)在降低存储需求时往往牺牲性能或忽略关键布局信息。其解决方案的关键在于提出ColParse范式,利用文档解析模型生成少量受布局信息启发的子图像嵌入(sub-image embeddings),并将其与全局页面向量融合,从而构建紧凑且结构感知的多向量表示,实现存储效率提升超过95%的同时显著提高检索性能。
链接: https://arxiv.org/abs/2603.01666
作者: Yibo Yan,Mingdong Ou,Yi Cao,Xin Zou,Shuliang Liu,Jiahao Huo,Yu Huang,James Kwok,Xuming Hu
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review
Abstract:Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.
[IR-7] IDProxy: Cold-Start CTR Prediction for Ads and Recommendation at Xiaohongshu with Multimodal LLM s
【速读】:该论文旨在解决广告与推荐系统中点击率(Click-through Rate, CTR)模型在物品冷启动(item cold-start)场景下的性能瓶颈问题,即当新物品缺乏用户交互数据时,传统基于物品ID的嵌入(ID embeddings)无法有效建模其特征。解决方案的关键在于提出IDProxy方法,该方法利用多模态大语言模型(Multimodal Large Language Models, MLLMs)从丰富的内容信号中生成代理嵌入(proxy embeddings),这些代理嵌入被显式对齐到现有的ID嵌入空间,并在CTR目标下与排序模型端到端联合优化,从而实现无需历史使用数据即可支持新物品的有效CTR预测,且可无缝集成至现有大规模排序流水线。
链接: https://arxiv.org/abs/2603.01590
作者: Yubin Zhang,Haiming Xu,Guillaume Salha-Galvan,Ruiyan Han,Feiyang Xiao,Yanhua Huang,Li Lin,Yang Luo,Yao Hu
机构: Xiaohongshu Inc.(小红书公司); Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Click-through rate (CTR) models in advertising and recommendation systems rely heavily on item ID embeddings, which struggle in item cold-start settings. We present IDProxy, a solution that leverages multimodal large language models (MLLMs) to generate proxy embeddings from rich content signals, enabling effective CTR prediction for new items without usage data. These proxies are explicitly aligned with the existing ID embedding space and are optimized end-to-end under CTR objectives together with the ranking model, allowing seamless integration into existing large-scale ranking pipelines. Offline experiments and online A/B tests demonstrate the effectiveness of IDProxy, which has been successfully deployed in both Content Feed and Display Ads features of Xiaohongshu’s Explore Feed, serving hundreds of millions of users daily.
[IR-8] CLEAR: Null-Space Projection for Cross-Modal De-Redundancy in Multimodal Recommendation
【速读】:该论文旨在解决多模态推荐系统中因跨模态冗余导致的互补信息利用不足问题。现有方法通常通过强化跨模态一致性来促进多模态融合,但研究发现,视觉与文本等模态间存在显著的共享子空间冗余,这限制了不同模态特有信息的有效挖掘,从而解释了为何增加模态并不总能提升性能。解决方案的关键在于提出CLEAR(Cross-modal de-redundancy Approach for Multimodal Recommendation),其核心思想是通过建模视觉与文本表示之间的交叉协方差矩阵,利用奇异值分解识别主导共享方向,并将多模态特征投影至互补的零空间,从而抑制冗余成分、保留模态特异性信息。这一子空间级投影隐式调控了表征学习动态,防止模型在训练过程中反复放大冗余语义,且无需改动原有模型架构或训练目标,具备良好的可集成性。
链接: https://arxiv.org/abs/2603.01536
作者: Hao Zhan,Yihui Wang,Yonghui Yang,Danyang Yue,Yu Wang,Pengyang Shao,Fei Shen,Fei Liu,Le Wu
机构: Hefei University of Technology (合肥工业大学); National University of Singapore (新加坡国立大学); Huazhong University of Science and Technology (华中科技大学)
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:
Abstract:Multimodal recommendation has emerged as an effective paradigm for enhancing collaborative filtering by incorporating heterogeneous content modalities. Existing multimodal recommenders predominantly focus on reinforcing cross-modal consistency to facilitate multimodal fusion. However, we observe that multimodal representations often exhibit substantial cross-modal redundancy, where dominant shared components overlap across modalities. Such redundancy can limit the effective utilization of complementary information, explaining why incorporating additional modalities does not always yield performance improvements. In this work, we propose CLEAR, a lightweight and plug-and-play cross-modal de-redundancy approach for multimodal recommendation. Rather than enforcing stronger cross-modal alignment, CLEAR explicitly characterizes the redundant shared subspace across modalities by modeling cross-modal covariance between visual and textual representations. By identifying dominant shared directions via singular value decomposition and projecting multimodal features onto the complementary null space, CLEAR reshapes the multimodal representation space by suppressing redundant cross-modal components while preserving modality-specific information. This subspace-level projection implicitly regulates representation learning dynamics, preventing the model from repeatedly amplifying redundant shared semantics during training. Notably, CLEAR can be seamlessly integrated into existing multimodal recommenders without modifying their architectures or training objectives. Extensive experiments on three public benchmark datasets demonstrate that explicitly reducing cross-modal redundancy consistently improves recommendation performance across a wide range of multimodal recommendation models.
[IR-9] PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval
【速读】:该论文旨在解决个人照片检索中因缺乏真实场景下多源信息融合能力而导致的非平凡性问题,即现有检索基准主要依赖孤立的网络快照,无法捕捉用户意图驱动下的多源推理需求。其解决方案的关键在于构建首个基于真实个人相册的 PhotoBench 基准,通过严谨的多源画像框架(整合视觉语义、时空元数据、社交身份与时间事件)合成根植于用户生命轨迹的复杂意图驱动查询,从而推动从视觉匹配向个性化多源意图驱动推理范式的转变。
链接: https://arxiv.org/abs/2603.01493
作者: Tianyi Xu,Rong Shan,Junjie Wu,Jiadeng Huang,Teng Wang,Jiachen Zhu,Wenteng Chen,Minxin Tu,Quantao Dou,Zhaoxiang Wang,Changwang Zhang,Weinan Zhang,Jun Wang,Jianghao Lin
机构: Shanghai Jiao Tong University (上海交通大学); OPPO (OPPO)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Under review
Abstract:Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users’ life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.
[IR-10] Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality
【速读】:该论文旨在解决当前多模态嵌入模型(Multimodal Embedding Models)在依赖大规模对比学习时,其嵌入质量受限于多模态大语言模型(MLLMs)架构与训练范式的问题。具体而言,现有MLLMs采用因果注意力机制和下一个词预测任务,难以显式促进全局紧凑的表示形成,从而限制了其作为多模态嵌入骨干网络的有效性。解决方案的关键在于提出一种基于协作注意力(Collaborative Attention)的内容重建预训练范式CoCoA,通过重构输入序列至EOS标记对应的嵌入向量,引导模型将输入语义信息压缩到EOS token中,从而为后续对比学习奠定更紧凑且信息丰富的嵌入基础,显著提升嵌入质量。
链接: https://arxiv.org/abs/2603.01471
作者: Jiahan Chen,Da Li,Hengran Zhang,Yinqiong Cai,Lixin Su,Jiafeng Guo,Daiting Shi,Dawei Yin,Keping Bi
机构: State Key Laboratory of AI Safety (人工智能安全国家重点实验室); Institute of Computing Technology (计算技术研究所); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Baidu Inc. (百度公司)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and training paradigms of MLLMs affect embedding quality. While effective for generation, the causal attention and next-token prediction paradigm of MLLMs does not explicitly encourage the formation of globally compact representations, limiting their effectiveness as multimodal embedding backbones. To address this, we propose CoCoA, a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization. Specifically, we restructure the attention flow and introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS embeddings. This drives the multimodal model to compress the semantic information of the input into the EOS token, laying the foundations for subsequent contrastive learning. Extensive experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality. Results validate that content reconstruction serves as an effective strategy to maximize the value of existing data, enabling multimodal embedding models generate compact and informative representations, raising their performance ceiling.
[IR-11] From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
【速读】:该论文旨在解决多模态大语言模型在长时视频理解任务中面临的挑战,即受限的上下文窗口和静态记忆机制无法模拟人类认知效率,导致现有方法要么因密集视觉累积造成高延迟与冗余(视觉主导型),要么因过度压缩文本信息引发细节丢失与幻觉(文本主导型)。其解决方案的关键在于提出一种基于模糊痕迹理论(Fuzzy-Trace Theory)的分层多模态记忆架构MM-Mem,该架构将记忆结构化为感知缓冲区(Sensory Buffer)、情景流(Episodic Stream)和符号模式(Symbolic Schema)三个层级,实现从细粒度感知痕迹到高层语义模式的渐进式提炼;同时引入语义信息瓶颈(Semantic Information Bottleneck)目标,并设计SIB-GRPO优化策略,在记忆压缩与任务相关性保留之间取得平衡,推理阶段采用熵驱动的自顶向下记忆检索机制,依据不确定性动态选择记忆层级,从而显著提升长时视频理解的准确性与泛化能力。
链接: https://arxiv.org/abs/2603.01455
作者: Niu Lian,Yuting Wang,Hanshu Yao,Jinpeng Wang,Bin Chen,Yaowei Wang,Min Zhang,Shu-Tao Xia
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳); Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院); Peng Cheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: TL;DR: We propose MM-Mem, a cognition-inspired, dual-trace hierarchical memory framework for long-horizon video understanding grounded in Fuzzy-Trace Theory. It features adaptive memory compression via the Information Bottleneck and employs an entropy-driven top-down retrieval to access fine-grained details only when necessary. 16 pages, 7 figures, 7 tables
Abstract:While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively “drills down” to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at this https URL.
[IR-12] LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在密集检索(dense retrieval)中被用作静态编码器的问题,即尽管LLMs具备强大的推理能力,但现有方法仅将其作为固定特征提取器,未能充分发挥其复杂推理潜力。为此,论文提出LaSER框架,其核心创新在于通过自蒸馏机制将显式推理过程内化到密集检索器的潜在空间中。关键在于引入双视角训练机制:显式视角(Explicit view)编码真实推理路径,隐式视角(Latent view)进行潜在空间内的隐式思维;并通过多粒度对齐策略,尤其是轨迹对齐机制,使隐式推理的中间潜在状态与显式推理片段的语义演进同步,从而实现无需自回归文本生成的高效、隐式推理。
链接: https://arxiv.org/abs/2603.01425
作者: Jiajie Jin,Yanzhao Zhang,Mingxin Li,Dingkun Long,Pengjun Xie,Yutao Zhu,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under Review
Abstract:LLMs have fundamentally transformed dense retrieval, upgrading backbones from discriminative encoders to generative architectures. However, a critical disconnect remains: while LLMs possess strong reasoning capabilities, current retrievers predominantly utilize them as static encoders, leaving their potential for complex reasoning unexplored. To address this, existing approaches typically adopt rewrite-then-retrieve pipelines to generate explicit CoT rationales before retrieval. However, this incurs prohibitive latency. In this paper, we propose LaSER, a novel self-distillation framework that internalizes explicit reasoning into the latent space of dense retrievers. Operating on a shared LLM backbone, LaSER introduces a dual-view training mechanism: an Explicit view that explicitly encodes ground-truth reasoning paths, and a Latent view that performs implicit latent thinking. To bridge the gap between these views, we design a multi-grained alignment strategy. Beyond standard output alignment, we introduce a trajectory alignment mechanism that synchronizes the intermediate latent states of the latent path with the semantic progression of the explicit reasoning segments. This allows the retriever to think silently and effectively without autoregressive text generation. Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines. Furthermore, analyses across diverse backbones and model scales validate the robustness of our approach, confirming that our unified learning framework is essential for eliciting effective latent thinking. Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
[IR-13] ReFeed: Retrieval Feedback-Guided Dataset Construction for Style-Aware Query Rewriting AAAI2026
【速读】:该论文旨在解决检索系统在用户查询与文档语料库之间存在风格或语义差异时性能下降的问题,即传统查询重写(query rewriting)方法常忽略目标文档的领域特异性语言特征(如措辞、语气和结构),导致无法有效匹配真实数据分布。其解决方案的关键在于提出一种基于检索反馈驱动的数据集生成框架:自动识别检索失败案例,利用大语言模型(LLM)将查询重写为与相关文档风格一致的形式,并通过重新检索验证改进效果;由此构建的(原始查询,重写查询)配对语料库可用于训练显式感知文档风格与检索反馈的重写模型,从而提升检索增强生成(RAG)系统在特定领域场景下的推理能力与适应性。
链接: https://arxiv.org/abs/2603.01417
作者: Jiyoon Myung,Jungki Son,Kyungro Lee,Jihyeon Park,Joohyung Han
机构: 未知
类目: Information Retrieval (cs.IR)
备注: Accepted at the Workshop on New Frontiers in Information Retrieval (AAAI 2026)
Abstract:Retrieval systems often fail when user queries differ stylistically or semantically from the language used in domain documents. Query rewriting has been proposed to bridge this gap, improving retrieval by reformulating user queries into semantically equivalent forms. However, most existing approaches overlook the stylistic characteristics of target documents-their domain-specific phrasing, tone, and structure-which are crucial for matching real-world data distributions. We introduce a retrieval feedback-driven dataset generation framework that automatically identifies failed retrieval cases, leverages large language models to rewrite queries in the style of relevant documents, and verifies improvement through re-retrieval. The resulting corpus of (original, rewritten) query pairs enables the training of rewriter models that are explicitly aware of document style and retrieval feedback. This work highlights a new direction in data-centric information retrieval, emphasizing how feedback loops and document-style alignment can enhance the reasoning and adaptability of RAG systems in real-world, domain-specific contexts.
[IR-14] ARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents
【速读】:该论文试图解决临床决策中因模型无法可靠选择和应用正确的程序性知识(skills)及先前验证的推理经验(experience)而导致的推理不可靠问题。其解决方案的关键在于将临床问答建模为一个代理(agent)问题,显式地构建两个可检索资源库:一是基于指南文档组织的可执行决策规则构成的技能库(skills library),二是以步骤级转移为索引的典型临床推理链构成的经验库(experience library);并通过一个步骤感知的检索器在测试时同时获取最相关的技能与经验项,并对语言模型进行轻量级的测试时适应(test-time adaptation),以缩小实例-步骤间的错位并防止推理偏离支持路径,从而提升医学推理的可靠性。
链接: https://arxiv.org/abs/2603.01241
作者: Junda Wang,Zonghai Tao,Hansi Zeng,Zhichao Yang,Hamed Zamani,Hong Yu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Complex clinical decision making often fails not because a model lacks facts, but because it cannot reliably select and apply the right procedural knowledge and the right prior example at the right reasoning step. We frame clinical question answering as an agent problem with two explicit, retrievable resources: skills, reusable clinical procedures such as guidelines, protocols, and pharmacologic mechanisms; and experience, verified reasoning trajectories from previously solved cases (e.g., chain-of-thought solutions and their step-level decompositions). At test time, the agent retrieves both relevant skills and experiences from curated libraries and performs lightweight test-time adaptation to align the language model’s intermediate reasoning with clinically valid logic. Concretely, we build (i) a skills library from guideline-style documents organized as executable decision rules, (ii) an experience library of exemplar clinical reasoning chains indexed by step-level transitions, and (iii) a step-aware retriever that selects the most useful skill and experience items for the current case. We then adapt the model on the retrieved items to reduce instance-step misalignment and to prevent reasoning from drifting toward unsupported shortcuts. Experiments on medical question-answering benchmarks show consistent gains over strong medical RAG baselines and prompting-only reasoning methods. Our results suggest that explicitly separating and retrieving clinical skills and experience, and then aligning the model at test time, is a practical approach to more reliable medical agents.
[IR-15] Beyond Global Similarity: Towards Fine-Grained Multi-Condition Multimodal Retrieval CVPR2026
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨模态检索任务中对细粒度、多条件约束查询理解不足的问题。现有基准测试主要关注粗粒度或单一条件的对齐,未能覆盖用户在实际场景中提出的涉及视觉与文本模态间多重依赖关系的复杂查询。解决方案的关键在于提出MCMR(Multi-Conditional Multimodal Retrieval)——一个大规模基准数据集,涵盖五类商品领域(上衣、下装、珠宝、鞋履和家具),并保留丰富的长文本元数据以支持组合式匹配。每个查询均整合互补的视觉与文本属性,要求模型联合满足所有指定条件才能判定相关性,从而系统评估模型在条件感知推理方面的性能。实验表明,MCMR能有效诊断不同模型的模态不对称性、揭示视觉线索在排序早期的重要性以及文本元数据对长尾排序的稳定性作用,并验证基于MLLM的点式重排序器可通过显式校验查询-候选一致性显著提升细粒度匹配效果。
链接: https://arxiv.org/abs/2603.01082
作者: Xuan Lu,Kangle Li,Haohang Huang,Rui Meng,Wenjun Zeng,Xiaoyu Shen
机构: Shanghai Jiao Tong University (上海交通大学); Institute of Digital Twin, Eastern Institute of Technology, Ningbo (数字孪生研究所,东方理工大学宁波校区); Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative (宁波市空间智能与数字衍生重点实验室); Google Cloud AI Research (Google云AI研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted by CVPR 2026
Abstract:Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval): a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries. MCMR spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture. It also preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance. We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query-candidate consistency. Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding. Our code and dataset is available at this https URL
[IR-16] Beyond the Flat Sequence: Hierarchical and Preference-Aware Generative Recommendations WWW’26
【速读】:该论文旨在解决生成式推荐模型(Generative Recommenders, GRs)在处理用户长序列交互时存在的两个核心问题:一是“平铺序列”假设忽略了用户行为中固有的时间层次结构,导致无法有效捕捉会话级参与度的层级特性;二是由于密集注意力机制引入大量噪声,掩盖了语义稀疏历史中的真实偏好信号,从而降低了表示学习质量并造成计算效率低下。解决方案的关键在于提出一种名为HPGR(Hierarchical and Preference-aware Generative Recommender)的新框架,其采用两阶段范式:第一阶段通过基于会话的掩码物品建模(Session-based Masked Item Modeling, MIM)目标进行结构感知预训练,构建具有层次信息和语义丰富性的物品表征空间;第二阶段则利用该表征实施偏好引导的稀疏注意力机制(Preference-Guided Sparse Attention),动态限制计算范围至最相关的历史物品,显著提升效率与信噪比。
链接: https://arxiv.org/abs/2603.00980
作者: Zerui Chen,Heng Chang,Tianying Liu,Chuantian Zhou,Yi Cao,Jiandong Ding,Ming Liu,Bing Qin
机构: 未知
类目: Information Retrieval (cs.IR)
备注: Accepted to the ACM Web Conference 2026 (WWW '26). 9 pages, 9 figures. Zerui Chen and Heng Chang contributed equally to this work
Abstract:Generative Recommenders (GRs), exemplified by the Hierarchical Sequential Transduction Unit (HSTU), have emerged as a powerful paradigm for modeling long user interaction sequences. However, we observe that their “flat-sequence” assumption overlooks the rich, intrinsic structure of user behavior. This leads to two key limitations: a failure to capture the temporal hierarchy of session-based engagement, and computational inefficiency, as dense attention introduces significant noise that obscures true preference signals within semantically sparse histories, which deteriorates the quality of the learned representations. To this end, we propose a novel framework named HPGR (Hierarchical and Preference-aware Generative Recommender), built upon a two-stage paradigm that injects these crucial structural priors into the model to handle the drawback. Specifically, HPGR comprises two synergistic stages. First, a structure-aware pre-training stage employs a session-based Masked Item Modeling (MIM) objective to learn a hierarchically-informed and semantically rich item representation space. Second, a preference-aware fine-tuning stage leverages these powerful representations to implement a Preference-Guided Sparse Attention mechanism, which dynamically constrains computation to only the most relevant historical items, enhancing both efficiency and signal-to-noise ratio. Empirical experiments on a large-scale proprietary industrial dataset from APPGallery and an online A/B test verify that HPGR achieves state-of-the-art performance over multiple strong baselines, including HSTU and MTGR.
[IR-17] ransformers Remember First Forget Last: Dual-Process Interference in LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理上下文冲突信息时的记忆保留机制问题,即早期记忆与近期记忆中哪一类更易存活。其核心解决方案是借鉴认知心理学中的经典干扰范式(interference paradigms),通过系统测试39个不同架构和规模的LLM,发现所有模型均表现出显著的前摄干扰(proactive interference, PI)主导于倒摄干扰(retroactive interference, RI),表明早期信息被优先保护而近期信息更容易被抑制——这与人类记忆中RI通常占优的现象相反。关键发现在于PI与RI反映独立的记忆机制:二者相关性极低(R² = 0.044),且仅RI受模型规模影响(R² = 0.49),同时错误分析揭示了PI失败主要源于主动的首因侵入(primacy intrusion),而RI失败则为被动检索失败,进一步支持Transformer注意力机制导致了固有的首因偏差(primacy bias)。
链接: https://arxiv.org/abs/2603.00270
作者: Sourav Chattaraj,Kanak Raj
机构: Thomson Reuters (汤森路透)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 10 figures. Under review
Abstract:When large language models encounter conflicting information in context, which memories survive – early or recent? We adapt classical interference paradigms from cognitive psychology to answer this question, testing 39 LLMs across diverse architectures and scales. Every model shows the same pattern: proactive interference (PI) dominates retroactive interference (RI) universally (Cohen’s d = 1.73, p 0.0001), meaning early encodings are protected at the cost of recent information – the opposite of human memory, where RI typically dominates. Three findings indicate that RI and PI reflect separate memory mechanisms. RI and PI are uncorrelated (R^2 = 0.044), rejecting a unified “memory capacity.” Model size predicts RI resistance (R^2 = 0.49) but not PI (R^2 = 0.06, n.s.) – only RI is capacity-dependent. And error analysis reveals distinct failure modes: RI failures are passive retrieval failures (51%), while PI failures show active primacy intrusion (56%); both show 1% hallucination. These patterns parallel the consolidation-retrieval distinction in cognitive science, suggesting that transformer attention creates a primacy bias with direct implications for interference-heavy applications.
[IR-18] Multi-Sourced Multi-Agent Evidence Retrieval for Fact-Checking
【速读】:该论文旨在解决当前事实核查(fact-checking)方法在处理互联网虚假信息传播时存在的局限性,尤其是现有基于检索增强生成(Retrieval Augmented Generation, RAG)的方法因依赖文本相似度进行证据检索,难以捕捉复杂文档中多跳语义关系,导致忽略证据与待验证声明之间的细微事实关联,从而引发错误的真伪判断。其解决方案的关键在于提出WKGFC框架,该框架以授权开放知识图谱(knowledge graph)为核心证据资源,通过大语言模型(LLM)驱动的检索机制评估声明并提取最相关的知识子图作为结构化证据,并进一步结合网络内容检索对知识图谱进行补全;整个过程被建模为一个自动化的马尔可夫决策过程(Markov Decision Process, MDP),由推理型LLM代理根据当前证据和声明决定下一步行动,同时利用提示优化(prompt optimization)微调代理模型以适配事实核查任务。
链接: https://arxiv.org/abs/2603.00267
作者: Shuzhi Gong,Richard O. Sinnott,Jianzhong Qi,Cecile Paris,Preslav Nakov,Zhuohan Xie
机构: The University of Melbourne(墨尔本大学); Data61, CSIRO(数据61,澳大利亚联邦科学与工业研究组织); MBZUAI(穆罕默德·本·扎耶德人工智能大学)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注:
Abstract:Misinformation spreading over the Internet poses a significant threat to both societies and individuals, necessitating robust and scalable fact-checking that relies on retrieving accurate and trustworthy evidence. Previous methods rely on semantic and social-contextual patterns learned from training data, which limits their generalization to new data distributions. Recently, Retrieval Augmented Generation (RAG) based methods have been proposed to utilize the reasoning capability of LLMs with retrieved grounding evidence documents. However, these methods largely rely on textual similarity for evidence retrieval and struggle to retrieve evidence that captures multi-hop semantic relations within rich document contents. These limitations lead to overlooking subtle factual correlations between the evidence and the claims to be fact-checked during evidence retrieval, thus causing inaccurate veracity predictions. To address these issues, we propose WKGFC, which exploits authorized open knowledge graph as a core resource of evidence. LLM-enabled retrieval is designed to assess the claims and retrieve the most relevant knowledge subgraphs, forming structured evidence for fact verification. To augment the knowledge graph evidence, we retrieve web contents for completion. The above process is implemented as an automatic Markov Decision Process (MDP): A reasoning LLM agent decides what actions to take according to the current evidence and the claims. To adapt the MDP for fact-checking, we use prompt optimization to fine-tune the agentic LLM. Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI) Cite as: arXiv:2603.00267 [cs.AI] (or arXiv:2603.00267v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.00267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-19] QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference
【速读】:该论文旨在解决视频语言模型(Video-Language Models, VLMs)在实际部署中面临的性能与延迟之间的权衡问题:大型VLM虽精度高,但资源消耗大、响应延迟高;小型本地模型虽响应快,但准确性不足。解决方案的关键在于提出一种“本地优先、按需边缘增强”的系统QuickGrasp,其核心设计包括:共享视觉表征以避免冗余计算、加速视频标记化、查询自适应的边缘增强机制,以及延迟感知且保持准确性的视觉标记密度配置策略,从而在保障接近大型VLM精度的同时,实现最高达12.8倍的响应延迟降低。
链接: https://arxiv.org/abs/2603.00126
作者: Miao Zhang,Ruixiao Zhang,Jianxin Shi,Hengzhi Wang,Hao Fang,Jiangchuan Liu
机构: Simon Fraser University (西蒙菲莎大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Nankai University (南开大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM); Performance (cs.PF); Systems and Control (eess.SY)
备注:
Abstract:Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.
[IR-20] NovaLAD: A Fast CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence
【速读】:该论文旨在解决文档解析(Document Extraction)这一关键前置步骤中的效率与准确性问题,尤其是在检索增强生成(RAG)、知识库构建及下游生成式 AI(Generative AI)应用之前,如何将非结构化文档(如PDF和扫描件)高效转化为结构化文本和布局感知表示。解决方案的关键在于提出NovaLAD系统,其核心创新是并行集成两个YOLO目标检测模型——元素检测模型(用于识别标题、正文、表格、图像等语义内容)和布局检测模型(用于识别版面区域如列组、行组等),结合规则驱动的分组逻辑与可选的视觉-语言增强模块(Vision-Language Enhancement)。此外,通过引入ViT图像分类器预先过滤无关图像,仅对有用图像调用视觉大模型(Vision LLM)提取标题、摘要和结构化信息,显著降低噪声和计算成本。该架构设计使得NovaLAD可在CPU上运行,支持多阶段并行处理(检测、分类、OCR、转换),兼顾高精度(在DP-Bench基准上达到96.49% TEDS和98.51% NID)与实用性,无需依赖GPU即可实现高性能文档解析。
链接: https://arxiv.org/abs/2603.00122
作者: Aman Ulla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 17 pages, 10 figures, 5 tables
Abstract:Document extraction is an important step before retrieval-augmented generation (RAG), knowledge bases, and downstream generative AI can work. It turns unstructured documents like PDFs and scans into structured text and layout-aware representations. We introduce NovaLAD, a comprehensive document parsing system that integrates two concurrent YOLO object detection models - element detection and layout detection - with rule-based grouping and optional vision-language enhancement. When a page image is sent in, the first thing that happens is that it goes through both models at the same time. The element model finds semantic content like the title, header, text, table, image, and so on, and the layout model finds structural regions like layout_box, column_group, multi_column, row_group, and so on. A key design decision is to first send an image or figure through an image classifier (ViT) that decides whether it is relevant or not. Only useful images are then submitted to the Vision LLM for title, summary, and structured information, which cuts down on noise and costs. NovaLAD is built for speed: it works on CPU, employs parallel execution for detection, classification, OCR, and conversion, and generates several forms, including structured JSON, Markdown, RAG-ready texts, and knowledge graphs. We test on the DP-Bench benchmark (upstage/dp-bench) and get 96.49% TEDS and 98.51% NID, which is better than both commercial and open-source parsers. This paper explains how to extract data, how the architecture works, how data flows, and how to make NovaLAD both accurate and usable without needing a GPU.
[IR-21] DeepXiv-SDK: An Agent ic Data Interface for Scientific Papers
【速读】:该论文旨在解决科研智能体(Research Agents)在科学文献信息检索与证据驱动决策过程中面临的“论文访问瓶颈”问题。传统方法通常依赖对PDF或HTML页面的启发式解析,将长篇无结构文本作为输入,导致token消耗高且证据查找脆弱。解决方案的关键在于提出DeepXiv-SDK——一个面向科学论文的代理数据接口,它通过结构化视图(header-first筛选视图、section结构化导航视图、按需证据级访问视图)实现分层、预算感知的渐进式访问机制,并嵌入增强属性和显式预算提示,使智能体能够在升级至全文处理前,基于相关性、成本和可溯源性进行权衡优化。此外,该方案支持多维检索与聚合,满足约束驱动的文献搜索与筛选需求,已在arXiv规模部署并可扩展至其他开放获取语料库(如PubMed Central)。
链接: https://arxiv.org/abs/2603.00084
作者: Hongjin Qian,Ziyi Xia,Ze Liu,Jianlv Chen,Kun Luo,Minghao Qin,Chaofan Li,Lei Xiong,Sen Wang,Zhengyang Liang,Zheng Liu
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Project at this https URL
Abstract:Research agents are increasingly used in AI4Science for scientific information seeking and evidence-grounded decision making. Yet a persistent bottleneck is paper access: agents typically retrieve PDF/HTML pages, heuristically parse them, and ingest long unstructured text, leading to token-heavy reading and brittle evidence lookup. This motivates an agentic data interface for scientific papers that standardizes access, exposes budget-aware views, and treats grounding as a first-class operation. We introduce DeepXiv-SDK, which enables progressive access aligned with how agents allocate attention and reading budget. DeepXiv-SDK exposes as structured views a header-first view for screening, a section-structured view for targeted navigation, and on-demand evidence-level access for verification. Each layer is augmented with enriched attributes and explicit budget hints, so agents can balance relevance, cost, and grounding before escalating to full-text processing. DeepXiv-SDK also supports multi-faceted retrieval and aggregation over paper attributes, enabling constraint-driven search and curation over paper sets. DeepXiv-SDK is currently deployed at arXiv scale with daily synchronization to new releases and is designed to extend to other open-access corpora (e.g., PubMed Central, bioRxiv). We release RESTful APIs, an open-source Python SDK, and a web demo showcasing deep search and deep research workflows; the service is free to use with registration. Comments: Project at this https URL Subjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2603.00084 [cs.DL] (or arXiv:2603.00084v1 [cs.DL] for this version) https://doi.org/10.48550/arXiv.2603.00084 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-22] Exploring Drug Safety Through Knowledge Graphs: Protein Kinase Inhibitors as a Case Study
【速读】:该论文旨在解决现有药物不良反应(Adverse Drug Reactions, ADRs)预测方法难以有效整合异构且部分非结构化证据的问题,如化学相似性、结构化数据库中的机器学习模型或孤立的靶点谱等。其解决方案的关键在于构建一个基于知识图谱的框架,将多种来源的数据——包括药物-靶点数据(ChEMBL)、临床试验文献(PubMed)、试验元数据(this http URL)以及上市后安全报告(FAERS)——统一为一个加权的二部网络(bipartite network),其中节点分别为药物和医学状况,边表示相关性强度。该框架通过目标与不良事件之间的关联实现对ADR的预测,并支持疗效指标(HR、PFS、OS)、表型及靶点相似性的上下文比较,从而揭示复杂模式并增强药物警戒能力。
链接: https://arxiv.org/abs/2603.00097
作者: David Jackson,Michael Gertz,Jürgen Hesser
机构: University of Amsterdam (阿姆斯特丹大学); Heidelberg University (海德堡大学)
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 14 pages, 5 figures. Code and data available at this https URL
Abstract:Adverse Drug Reactions (ADRs) are a leading cause of morbidity and mortality. Existing prediction methods rely mainly on chemical similarity, machine learning on structured databases, or isolated target profiles, but often fail to integrate heterogeneous, partly unstructured evidence effectively. We present a knowledge graph-based framework that unifies diverse sources, drug-target data (ChEMBL), clinical trial literature (PubMed), trial metadata (this http URL), and post-marketing safety reports (FAERS) into a single evidence-weighted bipartite network of drugs and medical conditions. Applied to 400 protein kinase inhibitors, the resulting network enables contextual comparison of efficacy (HR, PFS, OS), phenotypic and target similarity, and ADR prediction via target-to-adverse-event correlations. A non-small cell lung cancer case study correctly highlights established and candidate drugs, target communities (ERbB, ALK, VEGF), and tolerability differences. Designed as an orthogonal, extensible analysis and search tool rather than a replacement for current models, the framework excels at revealing complex patterns, supporting hypothesis generation, and enhancing pharmacovigilance. Code and data are publicly available at this https URL.
人机交互
[HC-0] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation CVPR2026
【速读】:该论文旨在解决现有基于扩散模型的3D多人体运动生成方法在强多实体条件约束下难以精确遵循复杂交互约束(如物体接触、关节动作时序等)且推理效率低的问题。其解决方案的关键在于:首先学习一种由草图驱动的扩散先验,随后将其蒸馏为一个高效的修正流(rectified-flow)学生模型,在潜在空间中实现快速稳定的采样;同时通过可微能量函数对关键帧、轨迹和物理约束进行建模,直接引导学生模型的传输场,确保生成动作既忠实于分镜草图又具备物理合理性;此外,引入连续时间马尔可夫链(CTMC)规划器以调度离散事件(如触碰、抓取、交接),从而精确控制多人协同动作的节奏与时机,显著提升交互协调性与感知质量。
链接: https://arxiv.org/abs/2603.02190
作者: Divyanshu Daiya,Aniket Bera
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026 Main Conference (11 pages, 5 figures)
Abstract:We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student’s transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human-object-human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.
[HC-1] Cognitive Prosthetic: An AI-Enabled Multimodal System for Episodic Recall in Knowledge Work
【速读】:该论文旨在解决现代知识型工作场所中人类情景记忆(episodic memory)因碎片化注意力、重叠会议和多模态信息流而面临日益加剧的压力问题。现有工具虽提供部分支持,如笔记记录或分析功能,但通常未能将认知、生理和注意力上下文整合为可检索的记忆表征。解决方案的关键在于提出一种名为认知假体多模态系统(Cognitive Prosthetic Multimodal System, CPMS)的AI驱动原型系统,其通过结构化的情景数据捕获与自然语言检索机制,将语音转录、生理信号和注视行为同步对齐为基于JSON格式的情景记录,并在本地处理以保障隐私;同时配备基于Web的查询接口,支持用户依据语义内容、时间、注意力焦点或生理状态等维度检索过往工作经历,从而实现异构传感器数据向可查询情景记忆的转化。
链接: https://arxiv.org/abs/2603.02072
作者: Lawrence Obiuwevwi,Krzysztof J. Rechowicz,Vikas Ashok,Sachin Shetty,Sampath Jayarathna
机构: Old Dominion University (老多尼大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: CHI EA '26
Abstract:Modern knowledge workplaces increasingly strain human episodic memory as individuals navigate fragmented attention, overlapping meetings, and multimodal information streams. Existing workplace tools provide partial support through note-taking or analytics but rarely integrate cognitive, physiological, and attentional context into retrievable memory representations. This paper presents the Cognitive Prosthetic Multimodal System (CPMS) --an AI-enabled proof-of-concept designed to support episodic recall in knowledge work through structured episodic capture and natural language retrieval. CPMS synchronizes speech transcripts, physiological signals, and gaze behavior into temporally aligned, JSON-based episodic records processed locally for privacy. Beyond data logging, the system includes a web-based retrieval interface that allows users to query past workplace experiences using natural language, referencing semantic content, time, attentional focus, or physiological state. We present CPMS as a functional proof-of-concept demonstrating the technical feasibility of transforming heterogeneous sensor data into queryable episodic memories. The system is designed to be modular, supporting operation with partial sensor configurations, and incorporates privacy safeguards for workplace deployment. This work contributes an end-to-end, privacy-aware architecture for AI-enabled memory augmentation in workplace settings.
[HC-2] A Resource-Rational Principle for Modeling Visual Attention Control
【速读】:该论文旨在解决现有视觉注意力计算模型普遍存在描述性过强、任务特异性高或可解释性差的问题,从而难以支持通用且高效的交互设计。其解决方案的关键在于构建一个基于资源理性(resource-rational)的仿真框架,将视觉任务建模为在感知、记忆和时间约束下的序贯决策过程,并采用部分可观测马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDPs)形式化任务,使注视行为和注意力切换等眼动模式从理性适应中自然涌现,而非依赖人工编码或纯数据驱动方式。该方法实现了对阅读与多任务场景下眼动行为的统一建模,并能复现经典实验效应、解释认知权衡(如理解与安全之间的权衡),同时生成新预测,为理论驱动且资源高效的HCI设计提供新工具。
链接: https://arxiv.org/abs/2603.02056
作者: Yunpeng Bai
机构: National University of Singapore(新加坡国立大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding how people allocate visual attention is central to Human-Computer Interaction (HCI), yet existing computational models of attention are often either descriptive, task-specific, or difficult to interpret. My dissertation develops a resource-rational, simulation-based framework for modeling visual attention as a sequential decision-making process under perceptual, memory, and time constraints. I formalize visual tasks, such as reading and multitasking, as bounded-optimal control problems using Partially Observable Markov Decision Processes, enabling eye-movement behaviors such as fixation and attention switching to emerge from rational adaptation rather than being hand-coded or purely data-driven. These models are instantiated in simulation environments spanning traditional text reading and reading-while-walking with smart glasses, where they reproduce classic empirical effects, explain observed trade-offs between comprehension and safety, and generate novel predictions under time pressure and interface variation. Collectively, this work contributes a unified computational account of visual attention, offering new tools for theory-driven and resource-efficient HCI design.
[HC-3] Strategic Advice in the Age of Personal AI
【速读】:该论文旨在解决个人AI助手(Personal AI)在人类决策过程中介入专业顾问(如医生、律师等)建议时所引发的策略性互动问题,即当顾问预见到用户可能随机咨询AI且其推荐可预测时,如何调整自身行为以应对AI的影响,以及这种动态对顾问绩效和信任机制的重塑效应。解决方案的关键在于构建一个博弈模型,明确个人AI的两个核心参数——咨询频率和权重分配——如何通过影响顾问的反应强度(counteraction)来非单调地改变顾问绩效,并引入“相对影响力指数”量化信任结构变化;进一步扩展至可信度建设成本后,揭示了AI采用如何重构顾问投资于可信度的激励机制。
链接: https://arxiv.org/abs/2603.02055
作者: Yueyang Liu,Wichinpong Park Sinchaisri
机构: 未知
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC)
备注:
Abstract:Personal AI assistants have changed how people use institutional and professional advice. We study this new strategic setting in which individuals may stochastically consult a personal AI whose recommendation is predictable to the focal advisor. Personal AI enters this strategic environment along two dimensions: how often it is consulted and how much weight it receives in the human’s decision when consulted. Anticipating this, the advisor responds by counteracting the personal AI recommendation. Counteraction becomes more aggressive as personal AI is consulted more often. Yet advisor performance is non-monotone: equilibrium loss is highest at intermediate levels of adoption and vanishes when personal AI is never used or always used. Trust affects performance through a single relative influence index, and greater relative influence of personal AI increases advisor vulnerability. Extending the framework to costly credibility building, we characterize how personal AI adoption reshapes incentives to invest in trust.
[HC-4] n Vigilance: Navigating Risky Social Interactions on Discord
【速读】:该论文试图解决青少年在 Discord 这一混合式社交平台中面临的安全风险问题,尤其是其私密消息(DMs)、半私有语音频道和公共服务器共存所导致的复杂且未被充分研究的风险。解决方案的关键在于青少年通过“警觉性”(vigilance)策略主动管理风险:包括在建立友谊前评估可疑互动、使用安全工具以及进行受控的风险行为以保护隐私与安全;同时,在社区层面通过选择性参与服务器并借助具有警觉性的治理结构来降低风险。这一发现揭示了青少年在在线环境中具备自主应对能力,为设计以青少年为中心的更安全数字环境提供了实证依据。
链接: https://arxiv.org/abs/2603.02052
作者: Elena Koung,Yunhan Liu,Zinan Zhang,Xinning Gui,Yubo Kou
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Teenagers are avid users of Discord, a fast growing platform for synchronous communication where they often interact with strangers. Because Discord combines private DMs, semi-private voice channels, and public servers in one place, it creates a hybrid environment that can produce complex and underexplored safety risks for teenagers. Drawing on 16 interviews with teenage Discord users, this study examines their strategies for navigating risky social interactions in the platform. Our findings reveal that when teenagers encounter risks during social interactions, they exercise vigilance by evaluating suspicious interactions before forming friendships, using safety tools, and engaging in controlled risk-taking to safeguard their privacy and security. At the community level, they mitigate risks through selective participation in servers, a practice supported by vigilant governance structures. We discuss how vigilance enables teenagers to act during risky encounters to protect themselves, advancing understanding of teenagers’ agency in risk navigation and informing teen-centered designs for safer online environments.
[HC-5] “When to Hand Off When to Work Together”: Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction
【速读】:该论文旨在解决当前AI代理在人机协作中缺乏协同情境感知(collaborative context awareness)的问题,即现有AI系统通常仅提供最终输出或静态的执行过程(如规划、推理),无法实时理解用户对共享工作成果的并发操作并作出适应性调整。其解决方案的关键在于开发了CLEO系统,该系统基于混合主动性交互原则,能够解析用户的协同意图,并在实时交互中动态调整自身行为,从而实现真正的协同式协作。研究通过两阶段实验验证了这一方法的有效性,揭示了设计师在不同情境下选择委托、指导或并行工作的决策模式及其触发机制,为构建具备情境感知能力的AI协作代理提供了理论模型与设计依据。
链接: https://arxiv.org/abs/2603.02050
作者: Kihoon Son,Hyewon Lee,DaEun Choi,Yoonsu Kim,Tae Soo Kim,Yoonjoo Lee,John Joon Young Chung,HyunJoon Jung,Juho Kim
机构: KAIST(韩国科学技术院); University of Michigan(密歇根大学); Midjourney; MPhoraLab; SkillBench
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Human collaborators coordinate dynamically through process visibility and workspace awareness, yet AI agents typically either provide only final outputs or expose read-only execution processes (e.g., planning, reasoning) without interpreting concurrent user actions on shared artifacts. Building on mixed-initiative interaction principles, we explore whether agents can achieve collaborative context awareness – interpreting concurrent user actions on shared artifacts and adapting in real-time. Study 1 (N=10 professional designers) revealed that process visibility enabled reasoning about agent actions but exposed conflicts when agents could not distinguish feedback from independent work. We developed CLEO, which interprets collaborative intent and adapts in real-time. Study 2 (N=10, two-day with stimulated recall interviews) analyzed 214 turns, identifying five action patterns, six triggers, and four enabling factors explaining when designers choose delegation (70.1%), direction (28.5%), or concurrent work (31.8%). We present a decision model with six interaction loops, design implications, and an annotated dataset.
[HC-6] Does Travel Stage Matter? How Leisure Travellers Perceive Their Privacy Attitudes Towards Personal Data Sharing Before During and After Travel
【速读】:该论文试图解决的问题是:现有研究虽广泛探讨了人们对于个人数据共享的态度,但缺乏对休闲旅行者在旅行不同阶段(出行前、出行中、出行后)其隐私态度演变规律的系统性分析。解决方案的关键在于通过一项包含318名参与者的在线调查数据分析发现:旅行者对不同类型个人数据(如姓名、性别等常见敏感信息)的共享态度存在显著差异,且这种态度受共享目的和旅行阶段影响;此外,社交平台使用行为呈现明显选择性(如Facebook和Instagram更常用于旅行内容分享,而TikTok、YouTube等则较少使用),且这一模式在旅行各阶段保持稳定,说明旅行阶段本身并非决定因素;最后,个体特征(如性别、旅行频率、国籍)显著调节隐私感知,揭示了隐私态度具有高度情境依赖性和复杂性。这些发现为隐私与安全领域提供了新的实证依据和理论洞见。
链接: https://arxiv.org/abs/2603.01992
作者: Haiyue Yuan,Shujun Li,Fatima Gillani,Dongmei Cao,Xiao Ma
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:People’s attitudes towards personal data sharing have been extensively researched, however, limited research studied their evolving nature in across different stages of a leisure trip. This paper addresses this gap by exploring how leisure travellers’ attitudes towards sharing personal data change before, during and after travel. Analysing data from an online survey with 318 participants, we found that participants’ privacy attitudes towards sharing different personal data vary based on sharing purposes and travel stages. Interestingly, participants exhibited a more relaxed attitude towards sharing commonly sensitive personal data (e.g., name, gender) compared to other types of personal data. This is likely because sharing such data for travel bookings has become essential and widely accepted among travellers when using booking sites, which is in line with previous work stating that information easily obtainable is typically not seen as highly confidential. Moreover, despite participants’ self-reported frequent use of social media platforms, content sharing is minimal on TikTok, YouTube, Snapchat, Pinterest, and Twitter. Conversely, Facebook and Instagram were more common for travel-related content sharing. This pattern remains consistent across the three stages of travel, suggesting that the stage of travel does not significantly influence how people share on social media platforms, which has been overlooked in past studies. Furthermore, we discovered that a participant’s gender, previous travel frequency, and country of residence can influence their perceptions of personal data sharing at different travel stages, confirming the complex and context-dependent nature of privacy perception and attitudes. Based on the findings observed from this study, we further discuss implications and potential contributions of our work to the privacy and security community in general.
[HC-7] actileWalk: Dynamic Electrotactile Patterns for Fingertip-Based Interaction During Walking
【速读】:该论文旨在解决可穿戴设备在无视觉依赖场景下(如盲人导航或移动中)如何通过触觉反馈实现高效空间方向引导的问题。其核心解决方案是设计并验证一种基于指尖电刺激的动态触觉模式系统,关键在于采用10×6电极阵列与ESP32微控制器结合高电压驱动电路,实现快速、独立的时空模式渲染;实验表明,双线(Double Line)模式在行走状态下仍保持最高识别准确率(90.83%),且用户偏好度高,证明简单、空间冗余的触觉模式能有效降低认知负荷,提升移动环境中的导航可靠性。
链接: https://arxiv.org/abs/2603.01974
作者: Vedika Nimbalkar,Roshan Peiris
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:TactileWalk evaluates dynamic electrotactile patterns on fingertips for wearable navigation. We developed a fingertip stimulation prototype featuring a 10x6 electrode grid driven by an ESP32 microcontroller and high-voltage drivers to enable rapid, independent electrode activation for spatiotemporal pattern rendering. This research compares three dynamic patterns- Single Line, Double Line, and Box-across eight directions presented on the tactile display at the fingertip. Study 1 (stationary) revealed that simple linear patterns were recognized significantly more accurately than complex shapes. Study 2 (walking) confirmed these cues remain robust under movement, where the Double Line pattern yielded the highest accuracy (90.83%). Participants consistently preferred the reinforcing Double Line and found vertical motion more intuitive while walking. We propose design implications for mobile haptics, advocating for simple, spatially redundant patterns to minimize cognitive load during eyes-free navigation.
[HC-8] SoK: Is Sustainable the New Usable? Debunking The Myth of Fundamental Incompatibility Between Security and Sustainability
【速读】:该论文试图解决的问题是:当前普遍存在的观念认为网络安全(cybersecurity)与环境可持续性之间存在根本性矛盾,即难以同时实现产品的安全性、长期使用性和可回收性,从而导致大量功能正常的电子系统因缺乏厂商支持而被当作电子垃圾处理。其解决方案的关键在于通过系统分析29篇文献并提炼出155条可持续性指南,归纳出12个可持续性主题,进而对比可持续人机交互(Sustainable HCI)、可持续软件工程与网络安全领域的指导原则,发现两者之间实际上并无根本性冲突;少数张力可通过在设计阶段审慎权衡安全与可持续目标来缓解,并指出二者均面临将责任归咎于个体用户的“用户是最薄弱环节”迷思,因此可用安全领域中“可用安全”(usable security)的经验推动系统性设计变革,以整合可持续性考量。
链接: https://arxiv.org/abs/2603.01958
作者: Maxwell Keleher,David Barrera,Sonia Chiasson
机构: Carleton University (卡尔顿大学)
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR)
备注:
Abstract:Every year, millions of functional systems become e-waste because users are pressured to send their systems to landfills due to a lack of vendor support and difficulty in recycling. Vendors cite ``cybersecurity’’ as the driver for short product support periods, leading to a prevalent, but uninterrogated, belief that cybersecurity and environmental sustainability are fundamentally contradictory; i.e., it is difficult, if not impossible, to build products that are secure, long-lasting, and reusable. To understand the nuanced relationship between security and sustainability, we systematically analyze 29 papers and distill 155 sustainability guidelines into 12 sustainability themes. These themes enable us to compare the sustainable HCI and sustainable software engineering guidance with that of cybersecurity, identifying points of alignment and tension. We find little evidence of a fundamental tension between these two domains; the few instances of tension can be mitigated through thoughtful consideration of security and sustainability objectives. We also find that sustainability, like usable security, struggles with the myth of users as the weakest link and the individualization of responsibility. Building on these parallels, we argue that the usable security community is well-positioned to integrate sustainability considerations, as both fields share challenges in shifting responsibility from individuals to systemic design.
[HC-9] Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots ICLR2026
【速读】:该论文试图解决的问题是:大规模语言模型(Large Language Models, LLMs)在社交媒体上加剧了政治话语的规模与策略性操纵,导致冲突升级。现有研究主要聚焦于平台主导的内容审核机制作为应对措施,但未能充分关注用户自身的能动性。论文提出了一种以用户为中心的“越狱”(jailbreaking)视角,将其视为一种新兴的、非暴力的去escalation实践——即用户主动尝试绕过LLM的安全防护机制,与疑似由LLM驱动的账号互动,从而暴露其自动化行为特征,并阻断误导性叙事的传播链条。解决方案的关键在于识别并利用用户对自动化内容的敏感性,通过主动干预实现对虚假信息生态的自我调节。
链接: https://arxiv.org/abs/2603.01942
作者: Huw Day,Adrianna Jezierska,Jessica Woodgate
机构: University of Bristol (布里斯托大学); Google (谷歌)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026 AI for peace workshop
Abstract:Large Language Models have intensified the scale and strategic manipulation of political discourse on social media, leading to conflict escalation. The existing literature largely focuses on platform-led moderation as a countermeasure. In this paper, we propose a user-centric view of “jailbreaking” as an emergent, non-violent de-escalation practice. Online users engage with suspected LLM-powered accounts to circumvent large language model safeguards, exposing automated behaviour and disrupting the circulation of misleading narratives.
[HC-10] Visual Bias in Simulated Users: The Impact of Luminance and Contrast on Reinforcement Learning-based Interaction
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)驱动的模拟用户在人机交互(Human-Computer Interaction, HCI)任务中行为有效性的问题,特别是当视觉渲染伪影(visual rendering artifacts)可能干扰交互设计本意时。研究发现,亮度(luminance)和对比度对模拟用户的行为有显著影响,尤其在存在静态干扰物的情况下,亮度变化会大幅降低任务性能与鲁棒性;而运动线索可缓解此问题。解决方案的关键在于:模拟用户的学习依赖于亮度之间的相对关系(relational ordering)而非绝对值匹配,且极端亮度(如黑色)虽能提升短期表现却损害泛化能力。这一发现揭示了RL模拟用户实际学习的内容,并为提升仿真有效性提供了关键设计原则。
链接: https://arxiv.org/abs/2603.01901
作者: Hannah Selder,Charlotte Beylier,Nico Scherf,Arthur Fleig
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, 8 figures, CHI EA 2026
Abstract:Reinforcement learning (RL) enables simulations of HCI tasks, yet their validity is questionable when performance is driven by visual rendering artifacts distinct from interaction design. We provide the first systematic analysis of how luminance and contrast affect behavior by training 247 \RVsimulated users using RL on pointing and tracking tasks. We vary the luminance of task-relevant objects, distractors, and background under no distractor, static distractor, and moving distractor conditions, and evaluate task performance and robustness to unseen luminances. Results show luminance becomes critical with static distractors, substantially degrading performance and robustness, whereas motion cues mitigate this issue. Furthermore, robustness depends on preserving relational ordering between luminances rather than matching absolute values. Extreme luminances, especially black, often yield high performance but poor robustness. Overall, seemingly minor luminance changes can strongly shape learned behavior, revealing critical insights into what RL-driven simulated users actually learn.
[HC-11] PleaSQLarify: Visual Prag matic Repair for Natural Language Database Querying
【速读】:该论文旨在解决自然语言数据库接口(Natural Language Database Interfaces)在面对用户输入歧义时的脆弱性问题,即标准方法常将不确定性压缩为单一查询,难以应对用户意图与系统理解之间的不匹配。解决方案的关键在于引入“语用修复”(Pragmatic Repair)机制,通过结构化交互围绕可解释的决策变量进行最小化互动以实现高效澄清,并辅以可视化界面展示动作空间、请求用户消歧及追踪信念更新过程,从而提升用户对自然语言接口的有效控制能力。
链接: https://arxiv.org/abs/2603.01795
作者: Robin Shing Moon Chan,Rita Sevastjanova,Mennatallah El-Assady
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at CHI’26, main track
Abstract:Natural language database interfaces broaden data access, yet they remain brittle under input ambiguity. Standard approaches often collapse uncertainty into a single query, offering little support for mismatches between user intent and system interpretation. We reframe this challenge through pragmatic inference: while users economize expressions, systems operate on priors over the action space that may not align with the users’. In this view, pragmatic repair – incremental clarification through minimal interaction – is a natural strategy for resolving underspecification. We present \textscPleaSQLarify, which operationalizes pragmatic repair by structuring interaction around interpretable decision variables that enable efficient clarification. A visual interface complements this by surfacing the action space for exploration, requesting user disambiguation, and making belief updates traceable across turns. In a study with twelve participants, \textscPleaSQLarify helped users recognize alternative interpretations and efficiently resolve ambiguity. Our findings highlight pragmatic repair as a design principle that fosters effective user control in natural language interfaces.
[HC-12] An Investigation of the Relation Between Immersion and Learning Across Three Domains
【速读】:该论文旨在解决沉浸式虚拟现实(Immersive Virtual Reality, IVR)在不同教育场景中对学习效果的影响机制不明确的问题,尤其关注其与用户沉浸感(presence)和用户体验之间的关系。解决方案的关键在于基于认知情感沉浸学习模型(Cognitive Affective Model of Immersive Learning, CAMIL)框架,设计并实施三个跨领域的应用(文化遗址、环境意识和高中物理),通过统一的评估协议在实验室和课堂环境中系统测量学习成效、沉浸感、技术接受度及晕动症等指标。研究发现,IVR显著提升用户的沉浸感、体验质量和对技术的接受度,但学习结果呈现混合效应,表明沉浸感本身并不直接等同于学习收益;因此,论文进一步提炼出基于CAMIL的设计指南,明确了IVR在教学情境中最具优势的应用条件,并提出针对性策略以优化学习效果与整体体验质量。
链接: https://arxiv.org/abs/2603.01644
作者: Paolo Boffi,Alberto Gallace,Pier Luca Lanzi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:We investigate the relationship between immersion and learning across three domains (cultural heritage, environmental awareness, and high school physics) through the lens of the Cognitive Affective Model of Immersive Learning (CAMIL) framework. We present three applications we developed for this investigation, highlighting their shared design elements and domain-specific mechanics. Using a common evaluation protocol across lab studies and a classroom deployment, we assessed learning outcomes, user experience, technology acceptance, presence/embodiment, and cybersickness. Our results show that immersive virtual reality led to higher scores for presence, user experience, and technology acceptance. In contrast, learning outcomes were mixed. In immediate post-test evaluations, factual knowledge scores were comparable between immersive virtual reality and control groups. In the end, we synthesize design guidelines that outline when immersive virtual reality might be most beneficial in didactic contexts, and we provide CAMIL-informed recommendations and strategies to improve learning outcomes and overall experiential quality.
[HC-13] Who Explains Privacy Policies to Me? Embodied and Textual LLM -Powered Privacy Assistants in Virtual Reality
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)系统在收集细粒度行为与生物特征数据时,用户因隐私政策文本复杂、冗长且难以理解而难以实现知情同意的问题。解决方案的关键在于开发一个基于大语言模型(Large Language Model, LLM)的隐私助手,并将其嵌入VR应用商店中,以支持用户在应用选择过程中进行隐私意识决策。该助手提供两种交互模式:基于文本的聊天界面和具身虚拟化身提供的语音解释,从而降低用户获取隐私信息的认知负担,促进更审慎的隐私决策行为。
链接: https://arxiv.org/abs/2603.01638
作者: Vincent Freiberger,Moritz Dresch,Florian Alt,Arthur Fleig,Viktorija Paneva
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages, 1 figure, 1 table
Abstract:Virtual Reality (VR) systems collect fine-grained behavioral and biometric data, yet privacy policies are rarely read or understood due to their complex language, length, and poor integration into users’ interaction workflows. To lower the barrier to informed consent at the point of choice, we explore a Large Language Model (LLM)-powered privacy assistant embedded into a VR app store to support privacy-aware app selection. The assistant is realized in two interaction modes: a text-based chat interface and an embodied virtual avatar providing spoken explanations. We report on an exploratory within-subjects study (N = 21) in which participants browsed VR productivity applications under unassisted and assisted conditions. Our findings suggest that both interaction modes support more deliberate engagement with privacy information and decision-making, with privacy scores primarily functioning as a veto mechanism rather than a primary selection driver. The impact of embodied interaction varied between participants, while textual interaction supported reflective review.
[HC-14] Bimanual XR Specification of Relative and Absolute Assembly Hierarchies for Teleoperation
【速读】:该论文旨在解决远程装配任务中如何高效、直观地指定复杂操作约束的问题,尤其在人机协作场景下,需明确表达对象间的相对与绝对位姿关系以指导机器人执行高阶任务。其解决方案的关键在于提出一种双臂扩展现实(XR)交互方法,通过双手各抓取一个物体形成约束组(constraint group),并以可视化包围盒(hull)呈现;这些约束组可嵌套构成层次结构,每组可定义为相对(由机器人自主选择6自由度(6DoF)位姿)或绝对(由作者指定固定6DoF位姿)于父级的关系,从而实现用户对子装配体位置的灵活控制,同时允许机器人基于效率优化自主决策。
链接: https://arxiv.org/abs/2603.01495
作者: Benjamin Yang,Xichen He,Charlie Zou,Jen-Shuo Liu,Barbara Tversky,Steven Feiner
机构: 未知
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:We present a bimanual XR interaction approach for specifying remote assembly tasks as hierarchies of relative and absolute object constraints that specify high-level teleoperation goals for robots. Grabbing one object in each hand creates a constraint group (visualized as a hull) and groups can be nested into hierarchies. Each group can be relative (with a robot-specifiable 6DoF pose) or absolute (with an author-specified fixed 6DoF pose) in relation to its parent. A relative group specifies a subassembly that can be constructed at a location chosen by the robot software for efficiency rather than mandated by the user.
[HC-15] Power Echoes: Investigating Moderation Biases in Online Power-Asymmetric Conflicts
【速读】:该论文旨在解决权力不对称冲突中人类审核员(human moderators)可能存在的偏见问题,以及人工智能(AI)建议对这些偏见的影响。其核心问题是:在消费者与商家之间的权力不对称冲突场景下,人类审核员是否表现出支持强势方的倾向(RQ1),以及AI辅助是否能缓解或加剧此类偏见(RQ2)。解决方案的关键在于设计并实施一项混合实验研究,基于真实冲突案例招募50名参与者进行对比分析,结果表明人类审核存在偏向强势方的多种偏见,而AI辅助虽能缓解多数偏见,但也可能放大少数偏见,从而为未来人机协同审核系统的设计提供了实证依据和改进方向。
链接: https://arxiv.org/abs/2603.01457
作者: Yaqiong Li,Peng Zhang,Peixu Hou,Kainan Tu,Guangping Zhang,Shan Qu,Wenshi Chen,Yan Chen,Ning Gu,Tun Lu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Accepted at the ACM CHI conference on Human Factors in Computing Systems (ACM CHI 2026)
Abstract:Online power-asymmetric conflicts are prevalent, and most platforms rely on human moderators to conduct moderation currently. Previous studies have been continuously focusing on investigating human moderation biases in different scenarios, while moderation biases under power-asymmetric conflicts remain unexplored. Therefore, we aim to investigate the types of power-related biases human moderators exhibit in power-asymmetric conflict moderation (RQ1) and further explore the influence of AI’s suggestions on these biases (RQ2). For this goal, we conducted a mixed design experiment with 50 participants by leveraging the real conflicts between consumers and merchants as a scenario. Results suggest several biases towards supporting the powerful party within these two moderation modes. AI assistance alleviates most biases of human moderation, but also amplifies a few. Based on these results, we propose several insights into future research on human moderation and human-AI collaborative moderation systems for power-asymmetric conflicts.
[HC-16] When Humans Dont Feel Like an Option: Contextual Factors That Shape When Older Adults Turn to Conversational AI for Emotional Support
【速读】:该论文旨在解决老年人在日常生活中选择使用对话式人工智能(Conversational AI)而非亲密人际关系获取情感支持的具体情境与动因问题。现有研究多聚焦于对AI陪伴的普遍态度,却缺乏对个体决策时刻的细致分析。论文通过访谈18位老年人,识别出三个关键情境因素:人类联系的时间不可及性、围绕负担感与评价担忧的关系考量,以及与尊严和面子维护相关的自我呈现顾虑。其解决方案的核心在于揭示年龄相关需求(如独立性、尊严与自我形象价值)如何塑造这些即时决策,并强调从宏观使用模式转向对具体情境动态的把握,为负责任的情感支持型AI设计提供了实证基础。
链接: https://arxiv.org/abs/2603.01413
作者: Mengqi Shi,Tianqi Song,Zicheng Zhu,Yi-Chieh Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Older adults are increasingly turning to conversational AI for emotional expression. While prior research has examined general attitudes toward AI companionship, little is known about the specific moments when and why older adults choose AI over close others for emotional support. This study addresses this gap by examining the moment-level conditions that shape these decisions in everyday life. Drawing on interviews with 18 older adults, we identify three contextual factors: temporal unavailability of human contacts, relational considerations around burden and evaluation, and self-presentation concerns tied to dignity and face-saving. Our findings reveal how age-related needs for independence, dignity, and valued self-presentation shape these everyday decisions. This work shifts attention from general patterns of AI use to the moment-level circumstances in which emotionally supportive engagement emerges. By foregrounding these situated dynamics, we provide an empirical foundation for context-sensitive responsible AI design and future research on emotional support-seeking in later life.
[HC-17] From Sustainable Materials to User-Centered Sustainability: Material Experience in Art Healing
【速读】:该论文旨在解决可持续材料设计中如何实现用户中心的可持续性,特别是通过材料体验促进艺术疗愈(Art Healing)的问题。其解决方案的关键在于构建一个整合多模态感知(视觉、触觉与嗅觉)的材料体验框架,明确“审美属性”(Aesthetic)对艺术疗愈影响最大,其次为“内在属性”(Intrinsic),而“物理属性”(Physical)作用相对有限,从而指导设计师在材料开发中系统性地考虑用户的心理感知与情感需求,推动从功能性可持续向体验性可持续的转化。
链接: https://arxiv.org/abs/2603.01377
作者: Yuxin Zhang,Fan Zhang,Zihao Song,Chao Zhao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:This study develops sustainable materials using hydrogel as the matrix and explores the transition from sustainable materials to user-centered sustainability, with a particular focus on achieving art healing through material experience. The findings reveal that “Aesthetic” property exert the greatest influence on art healing in the context of multimodal material experiences involving visual, tactile, and smell, followed by “Intrinsic” property, whereas “Physical” property have a comparatively limited effect. Furthermore, the study proposes a material experience framework that enables designers to systematically and holistically understanding material characteristics. It highlights the importance of considering users’ psychological perceptions and emotional needs in the material design process.
[HC-18] Caught in a Mafia Romance: How Users Explore Intimate Roleplay and Narrative Exploration with Chatbots
【速读】:该论文试图解决的问题是:在快速演进的AI聊天机器人(AI chatbots)环境中,人们希望与基于角色的聊天机器人(character-based bots)进行何种类型的互动尚不明确,尤其在情感和亲密关系层面。解决方案的关键在于通过实证研究分析一个流行的聊天机器人平台cAI(character-based AI)的用户行为,并结合对Reddit社区中cAI用户帖子的内容分析,揭示出用户更倾向于与呈现年轻成年男性特质、设定权力不对等情境的角色进行亲密角色扮演(intimate role-play),并沉浸于无边界的幻想场景中;同时发现用户对性化内容的“过度”与“不足”均存在担忧,这提示需要设计新的数字安全机制以应对相关风险。
链接: https://arxiv.org/abs/2603.01319
作者: Julia Kieserman,Cat Mai,Sara Lignell,Lucy Qin,Athanasios Andreou,Damon McCoy,Rosanna Bellini
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:AI chatbots, built using large language models, are increasingly integrated into society and mimic the patterns of human text exchanges. While previous research has raised concerns that humans may form romantic attachment to chatbots, the range of AI-mediated interactions that people wish to create for themselves or others with chatbots remains poorly understood, particularly given the fast evolving landscape of chatbots. We provide an empirical study of this http URL (cAI), a popular chatbot platform that enables users to design and share character-based bots, and synthesize this with an analysis of Reddit posts from cAI users. Contrary to popular narratives, we identify that users want to: (1) engage in intimate role-play with young adult, masculine-presenting characters that place users in a position of inferior power in well-defined scenarios and (2) immerse themselves in boundless, fantasy settings. We further find that users problematize both the excessive and insufficient sexualized content in such interactions which warrants novel digital-safety features.
[HC-19] Actors Note: Examining the Role of AI-Generated Questions in Character Journaling for Actor Training
【速读】:该论文试图解决演员在角色日记(character journaling)实践中面临的三大挑战:认知负担过重、面对空白页面的创作障碍以及缺乏短期反馈激励,导致难以持续进行反思性写作。解决方案的关键在于将大型语言模型(Large Language Models, LLMs)重新定义为“助产式伙伴”(maieutic partners),即通过情境感知的提问引导演员自我反思,而非代笔生成内容;其核心设计是Actor’s Note工具,该工具根据剧本、角色和排练阶段动态定制问题,在保持演员自主性的前提下降低入门门槛、支持持续反思并深化角色探索,实证研究表明该方法能有效提升反思实践的可持续性和艺术沉浸感。
链接: https://arxiv.org/abs/2603.01314
作者: Sora Kang,Jaemin Zoh,Hyoju Kim,Hyeonseo Park,Hajin Lim,Joonhwan Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26), April 13-17, 2026, Barcelona, Spain
Abstract:Character journaling is a well-established exercise in actor training, but many actors struggle to sustain it due to cognitive burden, the blank page problem, and unclear short-term rewards. We reframe large language models not as co-authors but as maieutic partners-tools that guide reflection through context-aware questioning rather than producing text on behalf of the user. Based on this perspective, we designed Actor’s Note, a journaling tool that tailors questions to the script, role, and rehearsal phase while preserving actor agency. We evaluated the system in a 14-day crossover study with 29 actors using surveys, logs, and interviews. Results indicate that the tool reduced entry barriers, supported sustained reflection, and enriched character exploration, with participants describing different benefits when AI was introduced at earlier versus later rehearsal stages. This work contributes empirical insights and design principles for creativity-support tools that sustain reflective practices while preserving artistic immersion in performance training.
[HC-20] Proscenium: Exploring Design Spaces of Layered Information Experience on a Large Dual-Layer Transparent Display
【速读】:该论文旨在解决如何设计能够充分利用分层信息空间(layered information space)独特交互优势的直观且引人入胜的交互体验问题。当前虽已有双层显示技术通过有限的深度感知实现创新交互,但尚缺乏系统性的设计框架来指导此类体验的构建。其解决方案的关键在于提出Proscenium——一种具有可调节层间距的双层大尺寸透明显示屏工作台设置,并围绕信息在不同显示层之间的过渡与关联机制展开初步设计空间探索,从而为多层交互体验提供结构化的设计思路与原型验证。
链接: https://arxiv.org/abs/2603.01238
作者: Chen Chen,Michel Pahud,David Brown,Chuck Needham,Balasaravanan T. Kumaravel,Andrew D. Wilson,Ken Hinckley,Nicolai Marquardt
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 9 pages, 8 figures, Proceedings of the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems
Abstract:Layering information spaces is a promising strategy to design intuitive and engaging interactive experiences. Although multi-layer displays enable promising interaction techniques through limited depth perception - achieved via slight separation between layers - it remains unclear how to fully design experiences that leverage the unique affordances of layered information. To address this, we introduce Proscenium, a dual-layer, large transparent display workspace setup with an adjustable separation between the layers. We demonstrate our preliminary design space focusing on how rendered information can be transitioned and linked across displays, and showcase 14 speculative experience prototypes across six categories.
[HC-21] Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction ICIP
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLM)在三维坐标检测任务中能力不足的问题,特别是如何利用单目RGB图像、自然语言输入和机器人状态信息,实现对3D物体位置的准确预测,以增强人机交互的直观性。解决方案的关键在于构建一个包含超过10万张图像的异构数据集,并采用QLoRA(Quantized Low-Rank Adaptation)方法对预训练VLM进行微调,同时引入定制化的回归头(regression head)和条件路由(conditional routing)机制,使模型在保持通用视觉理解能力的同时,具备专门的3D位置估计能力。实验表明,该方法在测试集上达到中位数13 mm的平均绝对误差(MAE),相比未微调的基线提升5倍,且约25%的预测结果达到机器人操作所需的精度范围。
链接: https://arxiv.org/abs/2603.01224
作者: Ari Wahl,Dorian Gawlinski,David Przewozny,Paul Chojecki,Felix Bießmann,Sebastian Bosse
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted at Workshop on Integrating Image Processing with Large-Scale Language/Vision Models for Advanced Visual Understanding (LVLM) at IEEE International Conference on Image Processing (ICIP) 2025
Abstract:Pre-trained general-purpose Vision-Language Models (VLM) hold the potential to enhance intuitive human-machine interactions due to their rich world knowledge and 2D object detection capabilities. However, VLMs for 3D coordinates detection tasks are rare. In this work, we investigate interactive abilities of VLMs by returning 3D object positions given a monocular RGB image from a wrist-mounted camera, natural language input, and robot states. We collected and curated a heterogeneous dataset of more than 100,000 images and finetuned a VLM using QLoRA with a custom regression head. By implementing conditional routing, our model maintains its ability to process general visual queries while adding specialized 3D position estimation capabilities. Our results demonstrate robust predictive performance with a median MAE of 13 mm on the test set and a five-fold improvement over a simpler baseline without finetuning. In about 25% of the cases, predictions are within a range considered acceptable for the robot to interact with objects.
[HC-22] Opportunities and Challenges of Operating Semi-Autonomous Vehicles: A Layered Vulnerability Perspective
【速读】:该论文试图解决的问题是:在特斯拉全自动驾驶(Full Self-Driving, FSD)等Level 2级半自动驾驶车辆(semi-autonomous vehicle, SAV)系统中,人类操作员如何因监督自动化系统而产生动态且情境化的脆弱性(vulnerability),这与传统将脆弱性视为外部道路使用者固有属性的观点形成对比。解决方案的关键在于应用Florencia Luna的分层脆弱性框架,通过半结构化访谈和混合编码分析方法,识别出心理、操作和社会三个相互作用的脆弱性层面,并揭示这些层面在特定情境下协同作用导致监督需求波动和风险识别能力不均。研究进一步指出,诸如信任或非正式学习等常被视为负担的因素,在不同情境下可能加剧或缓解脆弱性,从而强调设计与监管干预必须综合考虑心理、操作与社会条件,而非孤立处理,以避免责任被不合理地转嫁给个体操作员。
链接: https://arxiv.org/abs/2603.01202
作者: Soumita Mukherjee,Priya Kumar,Laura Cabrera
机构: 未知
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
备注:
Abstract:This study examines how vulnerability is produced for human operators of Tesla’s Full Self-Driving (FSD), a Level 2 semi-autonomous vehicle (SAV) system, by applying Florencia Luna’s layered vulnerability framework. While existing road safety models conceptualize vulnerability as a fixed attribute of external road users, emerging evidence suggests that semi-autonomous vehicle operators themselves experience dynamic and situational vulnerability as they supervise automated systems that they do not fully control. To investigate this phenomenon, we conducted semi-structured interviews with 17 active FSD users, analyzing their accounts through a combined deductive-inductive coding process aligned with Luna’s framework. Findings reveal three interacting layers of operator vulnerability, namely psychological, operational, and social. Vulnerability emerged not from any single layer but from how these layers converged in specific situations, creating fluctuating supervisory demands and uneven capacity to recognize and manage risk. The findings extend debates on contextual trust calibration, automation complacency, and meaningful human control by demonstrating how factors commonly treated as liabilities such as trust or informal learning, can both increase and mitigate vulnerability depending on context. This analysis determines the need for design and regulatory interventions that address psychological, operational, and social conditions together rather than in isolation, and highlights how responsibility is implicitly shifted onto individual operators within inadequately supported supervisory regimes.
[HC-23] Agent -Based Simulation of Trust Development in Human-Robot Teams: An Empirically-Validated Framework
【速读】:该论文旨在解决人-机器人团队中信任动态、工作量分配与协作绩效之间的复杂交互机制问题,尤其关注如何量化和预测信任对任务表现的影响。其解决方案的关键在于构建了一个基于代理的模型(Agent-Based Model, ABM),通过NetLogo平台模拟2–10名代理在不同任务复杂度下的行为,并结合敏感性分析(OFAT与全因子设计)和因子方差分析(Factorial ANOVA),识别出机器人可靠性(reliability)是影响信任(trust)和任务成功率(task success)的主导因素(η²=0.93),同时揭示了信任与绩效之间存在解耦现象(trust-performance decoupling),即高生产力未必伴随高信任,从而提出“校准误差”(calibration error)作为独立于信任强度的诊断指标,为部署前识别过信(overtrust)与欠信(undertrust)提供可操作的实证工具。
链接: https://arxiv.org/abs/2603.01189
作者: Ravi Kalluri
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper presents an empirically grounded agent-based model capturing trust dynamics, workload distribution, and collaborative performance in human-robot teams. The model, implemented in NetLogo 6.4.0, simulates teams of 2–10 agents performing tasks of varying complexity. We validate against Hancock et al.'s (2021) meta-analysis, achieving interval validity for 4 of 8 trust antecedent categories and strong ordinal validity (Spearman \rho=0.833\rho = 0.833 \rho=0.833). Sensitivity analysis using OFAT and full factorial designs (n=50n = 50 n=50 replications per condition) reveals robot reliability exhibits the strongest effect on trust (\eta2=0.35\eta^2 = 0.35 \eta2=0.35) and dominates task success (\eta2=0.93\eta^2 = 0.93 \eta2=0.93) and productivity (\eta2=0.89\eta^2 = 0.89 \eta2=0.89), consistent with meta-analytic findings. Trust asymmetry ratios ranged from 0.07 to 0.55 – below the meta-analytic benchmark of 1.50 – revealing that per-event asymmetry does not guarantee cumulative asymmetry when trust repair mechanisms remain active. Scenario analysis uncovered trust-performance decoupling: the Trust Recovery scenario achieved the highest productivity (4.29) despite the lowest trust (38.2), while the Unreliable Robot scenario produced the highest trust (73.2) despite the lowest task success (33.4%), establishing calibration error as a critical diagnostic distinct from trust magnitude. Factorial ANOVA confirmed significant main effects for reliability, transparency, communication, and collaboration (p.001p .001 p.001), explaining 45.4% of trust variance. The open-source implementation provides an evidence-based tool for identifying overtrust and undertrust conditions prior to deployment.
[HC-24] ake the Power Back: Screen-Based Personal Moderation Against Hate Speech on Instagram
【速读】:该论文旨在解决社交媒体平台上仇恨言论(hate speech)治理中平台审核机制失效、无法有效保护目标用户的问题,尤其是探索用户对个性化内容过滤工具(personal moderation tools)在不同界面(如评论区、Reels标签页或主页推荐流)的使用意愿及功能偏好。其解决方案的关键在于通过三轮德尔菲研究(Delphi study),基于40名曾遭受仇恨言论的活动人士的定量评分、排序与开放式反馈,识别出用户最希望部署个性化过滤功能的场景(即对话类和算法推荐类界面),并提炼出核心设计原则:支持跨界面可逆性(reversibility)和监督控制(oversight),同时强调输入方式、内容类型特异性及自动化程度需根据具体界面定制化设计。
链接: https://arxiv.org/abs/2603.01187
作者: Anna Ricarda Luther,Hendrik Heuer,Stephanie Geise,Sebastian Haunss,Andreas Breiter
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Conditionally Accepted CHI’26 (CHI26) SIGCHI ACM
Abstract:Hate speech remains a pressing challenge on social media, where platform moderation often fails to protect targeted users. Personal moderation tools that let users decide how content is filtered can address some of these shortcomings. However, it remains an open question on which screens (e.g., the comments, the reels tab, or the home feed) users want personal moderation and which features they value most. To address these gaps, we conducted a three-wave Delphi study with 40 activists who experienced hate speech. We combined quantitative ratings and rankings with open questions about required features. Participants prioritized personal moderation for conversational and algorithmically curated screens. They valued features allowing for reversibility and oversight across screens, while input-based, content-type specific, and highly automated features are more screen specific. We discuss the importance of personal moderation and offer user-centered design recommendations for personal moderation on Instagram.
[HC-25] Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI WWW2026
【速读】:该论文旨在解决如何为行动不便、视力受限或认知负荷较高的人群提供更自然、持续且上下文感知的辅助服务问题,尤其是在无屏幕、无稳定桌面甚至双手受限的场景下访问互联网。其核心解决方案是提出Egocentric Co-Pilot——一个运行在智能眼镜上的基于神经符号(neuro-symbolic)架构的Web原生AI代理系统,关键在于:1)引入自指推理核心(egocentric reasoning core),结合时序思维链(Temporal Chain-of-Thought)与分层上下文压缩(Hierarchical Context Compression),实现对连续第一人称视频流中长程问题回答与决策支持;2)设计轻量级多模态意图层,将噪声语音和注视信息映射为结构化命令;3)构建云原生WebRTC管道与本地WebSocket基线对比,验证了在延迟、移动性和资源消耗之间的权衡,从而为可部署、可审计、以Web通信原语为基础的始终在线辅助型AI代理提供了实用路径。
链接: https://arxiv.org/abs/2603.01104
作者: Sicheng Yang,Yukai Huang,Weitong Cai,Shitong Sun,Fengyi Fang,You He,Yiqiao Xie,Jiankang Deng,Hang Zhang,Jifei Song,Zhensong Zhang
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Queen Mary University of London (伦敦玛丽女王大学); Imperial College London (帝国理工学院); University of Surrey (萨里大学); Independent Researcher (独立研究员)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 14 pages, 6 figures, WWW 2026
Abstract:What if accessing the web did not require a screen, a stable desk, or even free hands? For people navigating crowded cities, living with low vision, or experiencing cognitive overload, smart glasses coupled with AI agents could turn the web into an always-on assistive layer over daily life. We present Egocentric Co-Pilot, a web-native neuro-symbolic framework that runs on smart glasses and uses a Large Language Model (LLM) to orchestrate a toolbox of perception, reasoning, and web tools. An egocentric reasoning core combines Temporal Chain-of-Thought with Hierarchical Context Compression to support long-horizon question answering and decision support over continuous first-person video, far beyond a single model’s context window. Additionally, a lightweight multimodal intent layer maps noisy speech and gaze into structured commands. We further implement and evaluate a cloud-native WebRTC pipeline integrating streaming speech, video, and control messages into a unified channel for smart glasses and browsers. In parallel, we deploy an on-premise WebSocket baseline, exposing concrete trade-offs between local inference and cloud offloading in terms of latency, mobility, and resource use. Experiments on Egolife and HD-EPIC demonstrate competitive or state-of-the-art egocentric QA performance, and a human-in-the-loop study on smart glasses shows higher task completion and user satisfaction than leading commercial baselines. Taken together, these results indicate that web-connected egocentric co-pilots can be a practical path toward more accessible, context-aware assistance in everyday life. By grounding operation in web-native communication primitives and modular, auditable tool use, Egocentric Co-Pilot offers a concrete blueprint for assistive, always-on web agents that support education, accessibility, and social inclusion for people who may benefit most from contextual, egocentric AI.
[HC-26] AEDHunter: Investigating AED Retrieval in the Real World via Gamified Mobile Interaction and Sensing
【速读】:该论文旨在解决院外心脏骤停(Out-of-Hospital Cardiac Arrest, OHCA)情况下公众对自动体外除颤器(Automated External Defibrillator, AED)位置认知不足,导致有效使用率低的问题。解决方案的关键在于开发了一款名为AEDHunter的基于位置的趣味化移动应用,通过智能手机传感器分析用户运动与学习模式,并结合低成本蓝牙标签验证到达AED位置,引导用户在真实环境中反复练习AED定位;同时引入双状态活动检测器识别“探索性暂停”行为,作为量化犹豫程度及其随训练逐步减少的学习信号,从而实现沉浸式、情境化的重复训练,显著提升用户获取AED的速度和信心。
链接: https://arxiv.org/abs/2603.01075
作者: Helinyi Peng,Akihito Taya,Yuuki Nishiyama,Kaoru Sezaki
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: IMWUT 2026
Abstract:Early defibrillation significantly improves survival rates in cases of out-of-hospital cardiac arrest. However, limited public awareness of Automated External Defibrillator (AED) locations constrains their effective use. Existing solutions, such as static 2D maps, often fall short in urgent or complex real-world scenarios. To address this challenge, we developed AEDHunter, a gamified, location-based mobile application designed to transform AED retrieval into an engaging and repeatable practice experience. Leveraging smartphone sensors to analyze participants’ movement and learning patterns, and using low-cost Bluetooth tags to verify arrivals at AED locations, AEDHunter guides users through multiple sessions of AED discovery. In a real-world evaluation study, participants significantly reduced their AED retrieval times after repeated practice sessions and reported increased confidence in locating AEDs. Additionally, we employ a two-state activity detector to identify ``exploratory pauses’', which are then used as a behavioral learning signal to quantify hesitation and its progressive reduction through practice. Our findings suggest that gamified applications like AEDHunter can improve AED retrieval performance through repeated, in-situ training and enhance self-reported preparedness, offering design insights for technology-supported learning and public safety applications.
[HC-27] From Human Negotiation to Agent Negotiation: Personal Mobility Agents in Automated Traffic
【速读】:该论文旨在解决自动化交通系统中用户偏好与系统行为之间存在的冲突问题,例如乘客倾向于激进驾驶风格,而车辆出于保守策略提前减速或让行,导致用户体验不佳。此类冲突在汇入、交叉路口及路权场景中尤为突出,且现有交互界面无法支持持续的多主体协商关系。解决方案的关键在于引入个人出行代理(Personal Mobility Agents),这些代理作为用户的代理角色,能够编码如舒适度和安全裕度等偏好,并在共享安全规则下与其他代理进行实时交通行为协商;其核心思想是从即时用户交互接口转向委托与监督接口,使代理负责处理实时冲突,而用户则专注于制定高层政策与偏好设置。
链接: https://arxiv.org/abs/2603.01035
作者: Pascal Jansen
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Position Paper at the CHI 2026 Workshop AutomationXP26: Agentic Automation Experiences - Rethinking the Interaction of Humans and AI Agents. April 14, 2026. Barcelona, Spain
Abstract:Conflicts between user preferences and automated system behavior already shape the experience of automated mobility. For example, a passenger may prefer assertive driving, yet the vehicle slows down early to follow a conservative policy or yield to other actors. Similar conflicts arise at merges, crossings, or right-of-way situations, where users must accept opaque decisions or attempt to negotiate through interfaces not designed for continuous, multi-actor relationships. This position paper argues that such approaches do not scale as mobility becomes more heterogeneous and automated. Instead, it proposes personal mobility agents that act as proxies for users, encode preferences such as comfort and safety margins, and negotiate traffic behavior with other agents under shared safety rules. The central idea is a shift from moment-to-moment user negotiation interfaces to delegation and oversight interfaces, in which proxy agents manage real-time conflicts while users can shape high-level policies and preferences.
[HC-28] Remember You: Understanding How Users Use Deadbots to Reconstruct Memories of the Deceased
【速读】:该论文旨在解决现有研究中对 bereaved individuals(哀伤个体)如何通过生成式 AI (Generative AI) 构建与重塑记忆机制关注不足的问题。解决方案的关键在于揭示用户并非被动接受数字遗像,而是主动参与构建“Deadbot”(数字亡者)形象:通过选择性输入、持续交互调整及认知想象补充,将真实记忆与个人期望融合,形成理想化的数字人格;同时,这种互动过程使用户记忆动态演变,从初期强化与理想化逐步过渡到AI生成新记忆与真实回忆交织的状态,反映出借助人工媒介实现连接的复杂心理需求,进而提出需警惕记忆扭曲与依赖风险,并呼吁开展长期临床研究以评估AI辅助哀伤干预的影响。
链接: https://arxiv.org/abs/2603.01017
作者: Yifan Li,Xingyu Lan
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative AI has enabled ``Deadbots’‘, offering mourners an interactive way to engage with simulations of the deceased. While existing research often emphasizes ethics, less is known about how bereaved individuals construct and reshape memory through such interactions. To address this gap, this study draws on in-depth interviews with 26 users. Findings reveal that users are not passive recipients but active constructors of the deceased’s digital representation. Through selective input, ongoing interactive adjustments and imaginative cognitive supplementation, they build an idealized digital figure blending authentic memories with personal expectations. Deadbots provide a private space to grieve without social pressure and a channel to address unresolved emotions. In this process, users’ memory of the deceased evolves dynamically: from initial reinforcement and idealization to a later stage where AI-generated new memories blur with authentic recollections, reflecting a complex desire for connection through an artificial medium. This blurring raises ethical concerns regarding memory distortion and dependency, underscoring the need for future clinical research on the long-term impact of AI-mediated grieving.
[HC-29] Sustainable Care: Designing Technologies That Support Childrens Long-Term Engagement with Social Issues
【速读】:该论文旨在解决儿童在数字技术环境中接触社会问题(如气候变化、冲突和不平等)时,因内容设计过度依赖恐惧与紧迫感而引发的焦虑、无力感及长期公民参与度下降的问题。其解决方案的关键在于引入“可持续关怀”(sustainable care)作为设计视角,强调技术应支持儿童在不导致共情疲劳或倦怠的前提下,持续、有意义地参与社会议题,从而促进积极的社会行动而非消极退缩。
链接: https://arxiv.org/abs/2603.00996
作者: JaeWon Kim,Aayushi Dangol,Rotem Landesman,Alexis Hiniker,McKenna F. Parnes
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Children today encounter social issues – climate change, conflict, inequality – through digital technologies, and the design of that encounter shapes whether young people move toward lasting civic engagement or toward anxiety and withdrawal. Much of the content children see is optimized for attention through fear and urgency, with few pathways toward meaningful action – contributing to rising distress and disengagement among young people who care deeply but feel powerless to act. This full-day workshop introduces ``sustainable care’’ as a design lens, asking how technology might support children’s sustained engagement with social causes without contributing to empathic distress or burnout. We invite researchers and practitioners across child-computer interaction, games, education, and youth mental health to map this landscape together and develop a research agenda for the CCI community.
[HC-30] VizQStudio: Iterative Visualization Literacy MCQs Design with Simulated Students FAST
【速读】:该论文旨在解决可视化素养(visualization literacy)多选题(MCQs)设计中面临的挑战,即如何在协调图表、题干和干扰项等多模态元素的同时,适配学习者背景差异与认知策略,并实现对题目难度和潜在误解的精准校准。传统评估依赖静态题库,缺乏针对不同学习者能力的迭代优化支持。解决方案的关键在于提出VizQStudio——一个基于多模态大语言模型(MLLM)驱动的模拟学生(simulated students)系统,使教师能够创建多样化学生画像(涵盖人口统计学特征、知识水平和学习特质),并通过可视化方式观察模拟学生对题目各组件的推理路径与作答行为,从而在课堂部署前探索潜在认知误区、调整题目难度并权衡设计取舍,为可视化素养及其他多模态测评任务提供以教师为中心、可迭代且负责任的AI辅助设计框架。
链接: https://arxiv.org/abs/2603.00994
作者: Zixin Chen,Yuhang Zeng,Sicheng Song,Yanna Lin,Xian Xu,Huamin Qu,Meng Xia
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: TVCG fast track (PVIS 2026 TVCG Track) under review
Abstract:Multiple-choice questions (MCQs) are a widely used educational tool, particularly in domains such as visualization literacy that require broad conceptual coverage and support diverse real-world applications. However, designing high-quality visualization literacy MCQs remains challenging, as instructors must coordinate multimodal elements (e.g., charts, question stems, and distractors), address diverse visualization tasks, and accommodate learners with heterogeneous backgrounds. Existing visualization literacy assessments primarily rely on standardized, fixed item banks, offering limited support for iterative question design that adapts to differences in learners’ abilities, backgrounds, and reasoning strategies. To address these challenges, we present VizQStudio, a visual analytics system that supports instructors in iteratively designing and refining visualization literacy MCQs using MLLM-powered simulated students. Instructors can specify diverse student profiles spanning demographics, knowledge levels, and learning-related traits. The system then visualizes how simulated students reason about and respond to different question components, helping instructors explore potential misconceptions, difficulty calibration, and design trade-offs prior to classroom deployment. We investigate VizQStudio through a mixed-method evaluation, including expert interviews, case studies, a classroom deployment, and a large-scale online study. Overall, this work reframes MLLM-based student simulation in assessment authoring as a design-time, exploratory aid. By examining both its value and limitations in realistic instructional settings, we surface design insights that inform how future systems can support instructor-centered, iterative, and responsible uses of AI for multimodal assessment design in visualization literacy and related domains.
[HC-31] From OCR to Analysis: Tracking Correction Provenance in Digital Humanities Pipelines
【速读】:该论文旨在解决数字人文领域中光学字符识别(Optical Character Recognition, OCR)纠错流程缺乏可追溯性的问题,即现有工作常覆盖中间决策,导致文本转换对学术解释的影响难以追踪。其解决方案的关键在于提出一种面向OCR校正的溯源感知框架,能够在词元(span)级别记录校正谱系,包括编辑类型、校正来源、置信度及修订状态等细粒度信息,并通过实证对比原始OCR文本、完全校正文本与基于溯源过滤的校正文本在命名实体识别任务中的表现,证明溯源信号有助于识别不稳定输出并指导人工审核优先级,从而将溯源作为自然语言处理(Natural Language Processing, NLP)中数字人文分析的第一层结构,支持可复现性、源批判和不确定性感知的解释。
链接: https://arxiv.org/abs/2603.00884
作者: Haoze Guo,Ziqi Wei
机构: University of Wisconsin - Madison (威斯康星大学麦迪逊分校)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Optical Character Recognition (OCR) is a critical but error-prone stage in digital humanities text pipelines. While OCR correction improves usability for downstream NLP tasks, common workflows often overwrite intermediate decisions, obscuring how textual transformations affect scholarly interpretation. We present a provenance-aware framework for OCR-corrected humanities corpora that records correction lineage at the span level, including edit type, correction source, confidence, and revision status. Using a pilot corpus of historical texts, we compare downstream named entity extraction across raw OCR, fully corrected text, and provenance-filtered corrections. Our results show that correction pathways can substantially alter extracted entities and document-level interpretations, while provenance signals help identify unstable outputs and prioritize human review. We argue that provenance should be treated as a first-class analytical layer in NLP for digital humanities, supporting reproducibility, source criticism, and uncertainty-aware interpretation.
[HC-32] SIAgent : Spatial Interaction Agent via LLM -powered Eye-Hand Motion Intent Understanding in VR
【速读】:该论文旨在解决当前虚拟现实(VR)中“操作到意图”(Operation-to-Intent)交互范式存在的学习成本高、容错率低的问题,即用户需记忆预定义手势及其与任务的关联关系。其解决方案的关键在于提出一种全新的“意图到操作”(Intent-to-Operation)框架——SIAgent,该框架通过两个核心组件实现:一是基于空间交互数据的意图识别模块,将眼手运动转化为自然语言并推断用户意图;二是基于代理的任务执行机制,自动生成相应动作代理完成任务。该方法无需用户记忆特定手势,支持个体化动作偏好,并显著提升交互准确率(97.2% vs 93.1%)和用户体验,同时降低手臂疲劳。
链接: https://arxiv.org/abs/2603.00522
作者: Zhimin Wang,Chenyu Gu,Feng Lu
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Virtual reality, spatial interaction, intent recognition, agent-based execution, large language models
Abstract:Eye-hand coordinated interaction is becoming a mainstream interaction modality in Virtual Reality (VR) user this http URL paradigms for this multimodal interaction require users to learn predefined gestures and memorize multiple gesture-task associations, which can be summarized as an ``Operation-to-Intent" paradigm. This paradigm increases users’ learning costs and has low interaction error tolerance. In this paper, we propose SIAgent, a novel “Intent-to-Operation” framework allowing users to express interaction intents through natural eye-hand motions based on common sense and habits. Our system features two main components: (1) intent recognition that translates spatial interaction data into natural language and infers user intent, and (2) agent-based execution that generates an agent to execute corresponding tasks. This eliminates the need for gesture memorization and accommodates individual motion preferences with high error tolerance. We conduct two user studies across over 60 interaction tasks, comparing our method with two “Operation-to-Intent” techniques. Results show our method achieves higher intent recognition accuracy than gaze + pinch interaction (97.2% vs 93.1%) while reducing arm fatigue and improving usability, and user preference. Another study verifies the function of eye gaze and hand motion channels in intent recognition. Our work offers valuable insights into enhancing VR interaction intelligence through intent-driven design. Our source code and LLM prompts will be made available upon publication.
[HC-33] Empirical Study of Gaze Behavior in Children and Young Adults Using Deep Neural Networks and Robot Implementation: A Comparative Analysis of Social Situations
【速读】:该论文旨在解决如何通过深度神经网络模型模拟儿童与成人 gaze 行为差异,并评估其在真实场景中部署于 Nao 机器人时的社会接受度问题。解决方案的关键在于:首先采集24名参与者(12名儿童和12名成人)的眼动数据,利用LSTM和Transformer网络建模不同年龄群体的注视模式;其次,在视频片段(动画与实拍)中预测目标位置,模型单次预测准确率达62%-70%,两次预测(取前两名候选)提升至约80%;最后,通过57名新参与者对机器人行为的反馈问卷验证其交互效果,结果显示用户认可机器人的注意力、智能性和响应能力,但未将其视为等同于人类的社会伴侣,从而揭示了基于人类非语言行为线索的机器人社会接受潜力。
链接: https://arxiv.org/abs/2603.00074
作者: Ramtin Tabatabaei,Milad Hosseini,Ali Mohajerzarrinkelk,Ali F. Meghdari,Alireza Taheri
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:In a preliminary exploratory study, our goal was to train deep neural network models to mimic children’s and/or adults’ gaze behavior in certain social situations to reach this objective. Additionally, we aim to identify potential differences in gaze behavior between these two age groups based on our participants’ gaze data. Furthermore, we aimed to assess the practical effectiveness of our adult and children models by deploying them on a Nao robot in real-life settings. To achieve this, we first created two video clips, one animation and one live-action, to depict some social situations. Using an eye-tracking device, we collected eye-tracking data from 24 participants, including 12 children and 12 adults. Then, we utilized deep neural networks, specifically LSTM and Transformer Networks, to analyze and model the gaze patterns of each group of participants. Our results indicate that when the models attempted to predict people’s locations (in the next frame), they had an accuracy in the range of 62%-70% with one attempt, which increased by ~20% when attempted twice (i.e. the two highest-ranked predicted labels as outputs). As expected, the result underscores that gaze behavior is not a wholly unique phenomenon. We obtained feedback from 57 new participants to evaluate the robot’s functionality. These participants were asked to watch two videos of the robot’s performance in each mode and then complete a comprehensive questionnaire. The questionnaire results indicate that the participants expressed satisfaction with the robot’s interaction, including its attention, intelligence, and responsiveness to human actions. However, they did not perceive the robot as a social companion comparable to a human. This exploratory study tries to address/show potentials of the social acceptance of robots based on human nonverbal behavioral cues for future research.
[HC-34] Designing Explainable AI for Healthcare Reviews: Guidance on Adoption and Trust
【速读】:该论文旨在解决患者在选择医疗服务提供者时,因在线评论数量庞大而难以高效决策的问题。其解决方案的关键在于开发并评估一种可解释的人工智能(Explainable AI)系统,该系统不仅分析患者评论,还能提供透明、易懂的解释以增强用户信任与使用意愿。研究发现,用户普遍认可此类系统的实用性(如节省时间、突出重点),且对解释能力有强烈需求(如理解分类依据、提升信任度),同时偏好结合文本与视觉呈现的多模态解释方式。因此,设计的核心应聚焦于准确性、清晰性、响应速度及无偏处理等要素,并考虑不同受众的需求差异,从而推动可解释AI在医疗评价系统中的有效落地。
链接: https://arxiv.org/abs/2603.00072
作者: Eman Alamoudi,Ellis Solaiman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Patients increasingly rely on online reviews when choosing healthcare providers, yet the sheer volume of these reviews can hinder effective decision-making. This paper summarises a mixed-methods study aimed at evaluating a proposed explainable AI system that analyses patient reviews and provides transparent explanations for its outputs. The survey (N=60) indicated broad optimism regarding usefulness (82% agreed it saves time; 78% that it highlights essentials), alongside strong demand for explainability (84% considered it important to understand why a review is classified; 82% said explanations would increase trust). Around 45% preferred combined text-and-visual explanations. Thematic analysis of open-ended survey responses revealed core requirements such as accuracy, clarity and simplicity, responsiveness, data credibility, and unbiased processing. In addition, interviews with AI experts provided deeper qualitative insights, highlighting technical considerations and potential challenges for different explanation methods. Drawing on TAM and trust in automation, the findings suggest that high perceived usefulness and transparent explanations promote adoption, whereas complexity and inaccuracy hinder it. This paper contributes actionable design guidance for layered, audience-aware explanations in healthcare review systems.
[HC-35] Self-Service or Not? How to Guide Practitioners in Classifying AI Systems Under the EU AI Act
【速读】:该论文旨在解决欧盟人工智能法案(Artificial Intelligence Act, AIA)中风险分类体系(Risk Classification Scheme, RCS)在实际应用中的落地难题,即工业从业者在面对法律定义模糊、跨领域专业要求复杂等问题时,如何准确执行RCS进行AI系统风险分级。解决方案的关键在于通过设计并评估一个自服务的基于网络的决策支持工具,结合设计科学研究(Design Science Research, DSR)方法,在两轮共78名来自不同领域的实践者参与的评估中验证:提供清晰的法律解释和具体应用场景示例等针对性支持措施,能够显著提升RCS的实际操作性与准确性,从而为工具开发者和政策制定者提供可落地的合规支持策略。
链接: https://arxiv.org/abs/2603.00065
作者: Ronald Schnitzer,Maximilian Hoeving,Sonja Zillner
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:In August 2024, the EU Artificial Intelligence Act (AIA) came into force, marking the world’s first large-scale regulatory framework for AI. Central to the AIA is a risk-based approach, aligning regulatory obligations with the potential harm posed by AI systems. To operationalize this, the AIA defines a Risk Classification Scheme (RCS), categorizing systems into four levels of risk. While this aligns with the theoretical foundations of risk-based regulations, the practical application of the RCS is complex and requires expertise across legal, technical, and domain-specific areas. Despite increasing academic discussion, little empirical research has explored how practitioners apply the RCS in real-world contexts. This study addresses this gap by evaluating how industrial practitioners apply the RCS using a self-service, web-based decision-support tool. Following a Design Science Research (DSR) approach, two evaluation phases involving 78 practitioners across diverse domains were conducted. Our findings highlight critical challenges in interpreting legal definitions and regulatory scope, and show that targeted support, such as clear explanations and practical examples, can significantly enhance the risk classification process. The study provides actionable insights for tool designers and policymakers aiming to support AIA compliance in practice.
[HC-36] “Bespoke Bots”: Diverse Instructor Needs for Customizing Generative AI Classroom Chatbots
【速读】:该论文旨在解决当前教育领域中AI聊天机器人(AI chatbot)在课堂应用时缺乏适配性的问题,即如何使这些工具能够根据不同的教学情境(如课程规模、学科领域和教学风格)进行有效定制。研究发现,教师最优先考虑的是将聊天机器人的行为与课程内容和教学策略对齐,而非个性化角色或语气设置;但其他定制维度的优先级因具体教学情境而异,说明单一设计无法满足多样化需求。因此,论文提出的关键解决方案是开发模块化(modular)的AI聊天机器人系统,允许教育开发者基于不同教学场景灵活组合功能组件,从而提升其在真实课堂中的适应性和实用性。
链接: https://arxiv.org/abs/2603.00057
作者: Irene Hou(UC San Diego),Zeyu Xiong(ETH Zurich),Philip J. Guo(UC San Diego),April Yi Wang(ETH Zurich)
机构: UC San Diego (加州大学圣地亚哥分校); ETH Zurich (苏黎世联邦理工学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages, 1 figure. Accepted to CHI 2026
Abstract:Instructors are increasingly experimenting with AI chatbots for classroom support. To investigate how instructors adapt chatbots to their own contexts, we first analyzed existing resources that provide prompts for educational purposes. We identified ten common categories of customization, such as persona, guardrails, and personalization. We then conducted interviews with ten university STEM instructors and asked them to card-sort the categories into priorities. We found that instructors consistently prioritized the ability to customize chatbot behavior to align with course materials and pedagogical strategies and de-prioritized customizing persona/tone. However, their prioritization of other categories varied significantly by course size, discipline, and teaching style, even across courses taught by the same individual, highlighting that no single design can meet all contexts. These findings suggest that modular AI chatbots may provide a promising path forward. We offer design implications for educational developers building the next generation of customizable classroom AI systems.
[HC-37] GazeXPErT: An Expert Eye-tracking Dataset for Interpretable and Explainable AI in Oncologic FDG-PET/CT Scans
【速读】:该论文旨在解决当前生成式 AI 在肿瘤 FDG-PET/CT 影像诊断中临床转化受限的问题,尤其是专家读片资源短缺、AI 模型缺乏可解释性与可靠性以及难以融入实际工作流程等挑战。其解决方案的关键在于构建并公开 GazeXPErT 数据集——一个包含 346 例 FDG-PET/CT 扫描的 4D 眼动追踪数据集,记录了训练医师和放射科/核医学专家在常规读片过程中的眼动轨迹,并将其与影像切片同步标注为 COCO 格式,用于支持多类机器学习任务。通过基准验证实验表明,引入专家眼动模式可显著提升 3D nnUNet 肿瘤分割性能(Dice 分数从 0.6008 提升至 0.6819),且基于序列眼动与 PET/CT 图像训练的视觉 Transformer 能有效增强动态病灶定位准确率(74.95% 的预测注视点更接近肿瘤)和专家意图预测能力(准确率 67.53%,AUROC 0.747),从而为开发更具临床可解释性、交互性和可靠性的 AI 辅助诊断系统提供新范式。
链接: https://arxiv.org/abs/2603.00162
作者: Joy T Wu,Daniel Beckmann,Sarah Miller,Alexander Lee,Elizabeth Theng,Stephan Altmayer,Ken Chang,David Kersting,Tomoaki Otani,Brittany Z Dashevsky,Hye Lim Park,Matteo Novello,Kip Guja,Curtis Langlotz,Ismini Lourentzou,Daniel Gruhl,Benjamin Risse,Guido A Davidzon
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:[18F]FDG-PET/CT is a cornerstone imaging modality for tumor staging and treatment response assessment across many cancer types, yet expert reader shortages necessitate more efficient diagnostic aids. While standalone AI models for automatic lesion segmentation exist, clinical translation remains hindered by concerns about interpretability, explainability, reliability, and workflow integration. We present GazeXPErT, a 4D eye-tracking dataset capturing expert search patterns during tumor detection and measurement on 346 FDG-PET/CT scans. Each study was read by a trainee and a board-certified nuclear medicine or radiology specialist using an eye-tracking-enabled annotation platform that simulates routine clinical reads. From 3,948 minutes of raw 60Hz eye-tracking data, 9,030 unique gaze-to-lesion trajectories were extracted, synchronized with PET/CT image slices, and rendered in COCO-style format for multiple machine learning applications. Baseline validation experiments demonstrate that a 3D nnUNet tumor segmentation model achieved superior performance when incorporating expert gaze patterns versus without (DICE score 0.6819 versus 0.6008), and that vision transformers trained on sequential gaze and PET/CT images can improve dynamic lesion localization (74.95% predicted gaze point closer to tumor) and expert intention prediction (Accuracy 67.53% and AUROC 0.747). GazeXPErT is a valuable resource designed to explore multiple machine learning problems beyond these baseline experiments, which include and are not limited to, visual grounding or causal reasoning, clinically explainable feature augmentation, human-computer interaction, human intention prediction or understanding, and expert gaze-rewarded modeling approaches to AI in oncologic FDG-PET/CT imaging.
计算机视觉
[CV-0] HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images CVPR2026
【速读】:该论文旨在解决生成高保真度人-产品图像(human-product images)时存在的三大挑战:缺乏多样化的大型训练数据、现有模型难以聚焦于产品细节的保留,以及粗粒度监督难以实现精确引导。其解决方案的关键在于提出一种名为HiFi-Inpaint的新颖参考图像引导修复框架,通过引入共享增强注意力机制(Shared Enhancement Attention, SEA)来细化细粒度的产品特征,并设计细节感知损失函数(Detail-Aware Loss, DAL)以利用高频图实现像素级精确监督,从而显著提升生成图像中产品细节的保真度。
链接: https://arxiv.org/abs/2603.02210
作者: Yichen Liu,Donghao Zhou,Jie Wang,Xin Gao,Guisheng Liu,Jiatong Li,Quanwei Zhang,Qiang Lyu,Lanqing Guo,Shilei Wen,Weiqiang Wang,Pheng-Ann Heng
机构: Ningbo City (宁波城市); Hong Kong Centre for Logistics Robotics (香港物流机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 (Project page: \url{ this https URL })
Abstract:Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images.
[CV-1] Adaptive Confidence Regularization for Multimodal Failure Detection CVPR2026
【速读】:该论文旨在解决多模态模型在高风险场景(如自动驾驶和医学诊断)中缺乏可靠故障检测机制的问题。其核心挑战在于如何有效识别多模态预测中的失败情况,而现有方法对此研究甚少。解决方案的关键在于提出自适应置信度正则化(Adaptive Confidence Regularization, ACR)框架,其核心思想是利用“置信度退化”现象——即多数故障情况下,多模态预测的置信度显著低于至少一个单模态分支的置信度。为此,作者设计了自适应置信度损失(Adaptive Confidence Loss),在训练过程中惩罚此类退化行为,并结合多模态特征交换(Multimodal Feature Swapping)技术生成具有挑战性的故障感知训练样本,从而提升模型对不确定预测的识别与拒绝能力,增强整体可靠性。
链接: https://arxiv.org/abs/2603.02200
作者: Moru Liu,Hao Dong,Olga Fink,Mario Trapp
机构: Technical University of Munich (慕尼黑工业大学); ETH Zürich (苏黎世联邦理工学院); EPFL (洛桑联邦理工学院); Fraunhofer IKS (弗劳恩霍夫信息系统与应用人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by CVPR 2026
Abstract:The deployment of multimodal models in high-stakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction is significantly lower than that of at least one unimodal branch, a phenomenon we term confidence degradation. To mitigate this, we introduce an Adaptive Confidence Loss that penalizes such degradations during training. In addition, we propose Multimodal Feature Swapping, a novel outlier synthesis technique that generates challenging, failure-aware training examples. By training with these synthetic failures, ACR learns to more effectively recognize and reject uncertain predictions, thereby improving overall reliability. Extensive experiments across four datasets, three modalities, and multiple evaluation settings demonstrate that ACR achieves consistent and robust gains. The source code will be available at this https URL.
[CV-2] From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories
【速读】:该论文旨在解决自动驾驶感知模型在研究性能与实际部署之间存在的显著差距问题,即当前评估主要依赖基准指标,而忽视了代码质量、生产就绪性及长期可维护性,导致安全关键系统难以满足国际安全标准。其解决方案的关键在于首次开展大规模实证研究,通过静态分析工具(Pylint、Bandit 和 Radon)对来自 KITTI 和 NuScenes 3D 目标检测排行榜的 178 个感知模型仓库进行系统性评估,识别出仅 7.3% 的项目符合基本生产就绪标准(无严重错误且无高危安全漏洞),并发现安全问题高度集中于少数几类,据此提出可操作的改进指南;同时揭示持续集成/持续部署(CI/CD)实践与更高代码可维护性之间的正相关关系,从而为提升自动驾驶感知代码的质量和安全性提供数据驱动的干预路径。
链接: https://arxiv.org/abs/2603.02194
作者: Mateus Karvat,Bram Adams,Sidney Givigi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Software Engineering (cs.SE)
备注:
Abstract:Autonomous vehicle (AV) perception models are typically evaluated solely on benchmark performance metrics, with limited attention to code quality, production readiness and long-term maintainability. This creates a significant gap between research excellence and real-world deployment in safety-critical systems subject to international safety standards. To address this gap, we present the first large-scale empirical study of software quality in AV perception repositories, systematically analyzing 178 unique models from the KITTI and NuScenes 3D Object Detection leaderboards. Using static analysis tools (Pylint, Bandit, and Radon), we evaluated code errors, security vulnerabilities, maintainability, and development practices. Our findings revealed that only 7.3% of the studied repositories meet basic production-readiness criteria, defined as having zero critical errors and no high-severity security vulnerabilities. Security issues are highly concentrated, with the top five issues responsible for almost 80% of occurrences, which prompted us to develop a set of actionable guidelines to prevent them. Additionally, the adoption of Continuous Integration/Continuous Deployment pipelines was correlated with better code maintainability. Our findings highlight that leaderboard performance does not reflect production readiness and that targeted interventions could substantially improve the quality and safety of AV perception code.
[CV-3] Leverag ing Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta
【速读】:该论文旨在解决低资源环境下非物质文化遗产(Intangible Cultural Heritage, ICH)图像分类中存在的三大挑战:标注数据稀缺、类别间视觉相似度高以及域异质性显著,这些问题导致传统深度学习模型易过拟合或对虚假相关性敏感,从而泛化能力差。解决方案的关键在于提出一种融合混合CoAtNet架构与模型汤(model soups)的鲁棒框架——CoAtNet通过卷积与自注意力机制在不同阶段融合局部与全局特征,而模型汤作为一种轻量级权重空间集成技术,在不增加推理成本的前提下,通过对单一训练轨迹中多个检查点进行几何多样性感知的平均,有效降低预测方差并保持较低偏差。实验证明,该方法在ICH-17数据集上达到72.36%的top-1准确率和69.28%的宏F1分数,优于ResNet-50、DenseNet-121及ViT等强基线模型,表明基于多样性感知的检查点平均策略是提升文化丰富但数据匮乏场景下模型泛化性能的有效途径。
链接: https://arxiv.org/abs/2603.02181
作者: Quoc-Khang Tran,Minh-Thien Nguyen,Nguyen-Khang Pham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Early accept of Vol 2025 No 3, November : Journal on Information Technologies Communications
Abstract:The classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges due to limited annotated data, high visual similarity among classes, and domain heterogeneity. In such low-resource settings, conventional deep learning models often suffer from high variance or overfit to spurious correlations, leading to poor generalization. To address these limitations, we propose a robust framework that integrates the hybrid CoAtNet architecture with model soups, a lightweight weight-space ensembling technique that averages checkpoints from a single training trajectory without increasing inference cost. CoAtNet captures both local and global patterns through stage-wise fusion of convolution and self-attention. We apply two ensembling strategies - greedy and uniform soup - to selectively combine diverse checkpoints into a final model. Beyond performance improvements, we analyze the ensembling effect through the lens of bias-variance decomposition. Our findings show that model soups reduces variance by stabilizing predictions across diverse model snapshots, while introducing minimal additional bias. Furthermore, using cross-entropy-based distance metrics and Multidimensional Scaling (MDS), we show that model soups selects geometrically diverse checkpoints, unlike Soft Voting, which blends redundant models centered in output space. Evaluated on the ICH-17 dataset (7,406 images across 17 classes), our approach achieves state-of-the-art results with 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming strong baselines including ResNet-50, DenseNet-121, and ViT. These results underscore that diversity-aware checkpoint averaging provides a principled and efficient way to reduce variance and enhance generalization in culturally rich, data-scarce classification tasks.
[CV-4] Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
【速读】:该论文旨在解决当前基于指令的视频编辑方法在视觉控制精度上的不足问题,尤其是自然语言描述难以捕捉复杂视觉细节的局限性,以及参考引导式编辑因高质量成对训练数据稀缺而导致性能瓶颈的问题。其解决方案的关键在于构建一个可扩展的数据生成流水线,利用图像生成模型将现有视频编辑样本转化为高保真度的四元组训练数据(即输入视频、指令、参考图像和目标输出),从而有效扩充训练数据;同时提出统一的编辑架构Kiwi-Edit,通过可学习查询与潜在视觉特征的协同作用实现参考语义引导,并采用渐进式的多阶段训练策略显著提升指令遵循能力与参考一致性,最终在可控视频编辑任务中达到新的最先进水平。
链接: https://arxiv.org/abs/2603.02175
作者: Yiqi Lin,Guoqiang Liang,Ziyun Zeng,Zechen Bai,Yanzhe Chen,Mike Zheng Shou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at this https URL.
[CV-5] GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis
【速读】:该论文旨在解决现有可控卫星图像生成模型依赖耗时且语义受限的像素级地图作为控制信号的问题。解决方案的关键在于提出一种新颖的基于点的条件控制框架,通过点的空间位置及其关联的文本描述提供语义丰富的控制信号,从而实现灵活、标注友好且计算简单的卫星图像生成。该框架引入自适应局部注意力机制,根据输入点查询有效正则化注意力分数,显著提升了生成质量与控制精度。
链接: https://arxiv.org/abs/2603.02172
作者: Srikumar Sastry,Dan Cher,Brian Wei,Aayush Dhakal,Subash Khanal,Dev Gupta,Nathan Jacobs
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 17 figures
Abstract:We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including the selection of satellite image representation for alignment and geolocation representation for conditioning. Our experiments demonstrate that GeoDiT achieves impressive generation performance, surpassing the state-of-the-art remote sensing generative models.
[CV-6] Bridging the gap between Performance and Interpretability: An Explainable Disentangled Multimodal Framework for Cancer Survival Prediction
【速读】:该论文旨在解决多模态生存预测模型在提升准确性的同时往往牺牲可解释性的问题,从而限制了对不同数据源如何影响预测结果的深入理解。其解决方案的关键在于提出DIMAFx框架,该框架通过生成解耦的、可解释的模态特异性和模态共享表示,从组织病理学全切片图像(whole-slide images)和转录组学数据中提取特征。该设计不仅实现了当前最优的生存预测性能,还借助SHapley Additive exPlanations(SHAP)方法系统揭示了关键的多模态交互作用及解耦表示中的生物学信息,从而在保持高精度的同时显著增强了模型的可解释性,为精准医学应用提供了可靠依据。
链接: https://arxiv.org/abs/2603.02162
作者: Aniek Eijpe,Soufyan Lakbir,Melis Erdal Cesur,Sara P. Oliveira,Angelos Chatzimparmpas,Sanne Abeln,Wilson Silva
机构: Utrecht University (乌得勒支大学); The Netherlands Cancer Institute (荷兰癌症研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While multimodal survival prediction models are increasingly more accurate, their complexity often reduces interpretability, limiting insight into how different data sources influence predictions. To address this, we introduce DIMAFx, an explainable multimodal framework for cancer survival prediction that produces disentangled, interpretable modality-specific and modality-shared representations from histopathology whole-slide images and transcriptomics data. Across multiple cancer cohorts, DIMAFx achieves state-of-the-art performance and improved representation disentanglement. Leveraging its interpretable design and SHapley Additive exPlanations, DIMAFx systematically reveals key multimodal interactions and the biological information encoded in the disentangled representations. In breast cancer survival prediction, the most predictive features contain modality-shared information, including one capturing solid tumor morphology contextualized primarily by late estrogen response, where higher-grade morphology aligned with pathway upregulation and increased risk, consistent with known breast cancer biology. Key modality-specific features capture microenvironmental signals from interacting adipose and stromal morphologies. These results show that multimodal models can overcome the traditional trade-off between performance and explainability, supporting their application in precision medicine.
[CV-7] 3D Field of Junctions: A Noise-Robust Training-Free Structural Prior for Volumetric Inverse Problems
【速读】:该论文旨在解决低信噪比(SNR)条件下三维(3D)成像中的体积去噪问题,这是计算成像中一个基础且关键的挑战。其解决方案的核心在于提出了一种全新的全体积3D Field of Junctions(3D FoJ)表示方法,该方法通过优化一组3D楔形结构来最佳解释整个体积中的每个3D局部块,并强制重叠块之间的结构一致性。该表示作为无训练数据的结构先验,能够在不引入幻觉风险的前提下有效保留并增强边缘和角点等精细结构,同时可直接集成到投影或近端梯度下降算法中,适用于各类低SNR的体积逆问题,从而在低剂量X射线CT、冷冻电子断层扫描(cryo-ET)及恶劣天气下激光雷达点云去噪等多场景中显著优于传统与神经网络方法。
链接: https://arxiv.org/abs/2603.02149
作者: Namhoon Kim,Narges Moeini,Justin Romberg,Sara Fridovich-Keil
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Code will be released soon
Abstract:Volume denoising is a foundational problem in computational imaging, as many 3D imaging inverse problems face high levels of measurement noise. Inspired by the strong 2D image denoising properties of Field of Junctions (ICCV 2021), we propose a novel, fully volumetric 3D Field of Junctions (3D FoJ) representation that optimizes a junction of 3D wedges that best explain each 3D patch of a full volume, while encouraging consistency between overlapping patches. In addition to direct volume denoising, we leverage our 3D FoJ representation as a structural prior that: (i) requires no training data, and thus precludes the risk of hallucination, (ii) preserves and enhances sharp edge and corner structures in 3D, even under low signal to noise ratio (SNR), and (iii) can be used as a drop-in denoising representation via projected or proximal gradient descent for any volumetric inverse problem with low SNR. We demonstrate successful volume reconstruction and denoising with 3D FoJ across three diverse 3D imaging tasks with low-SNR measurements: low-dose X-ray computed tomography (CT), cryogenic electron tomography (cryo-ET), and denoising point clouds such as those from lidar in adverse weather. Across these challenging low-SNR volumetric imaging problems, 3D FoJ outperforms a mixture of classical and neural methods.
[CV-8] Is Bigger Always Better? Efficiency Analysis in Resource-Constrained Small Object Detection
【速读】:该论文旨在解决地球观测(Earth Observation, EO)领域中模型规模扩展假设的合理性问题,即在资源受限条件下,是否仍能通过增大模型规模、数据量或输入分辨率来持续提升性能。传统观点认为更大的模型和更多数据必然带来更好的效果,但这一假设在EO场景下尚未得到验证。研究的关键在于系统性地评估三个缩放维度(模型大小、数据集规模、输入分辨率)对屋顶光伏(rooftop PV)检测任务效率的影响,并以单位模型尺寸下的mAP₅₀作为核心效率指标。结果显示,小模型YOLO11N在高分辨率配置下不仅实现了最高效率(比YOLO11X高24倍),还获得最优绝对精度(0.617),且分辨率是最重要的资源配置杠杆(效率提升120%),而低分辨率下增加数据几乎无收益;最终发现小模型高分辨率配置在所有44种部署设置中均占据帕累托前沿,表明“更大并不总是更好”,甚至可能更差——这颠覆了常规认知,为资源受限环境下的高效模型设计提供了实证依据。
链接: https://arxiv.org/abs/2603.02142
作者: Kwame Mbobda-Kuate,Gabriel Kasmi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 9 figures, 8 tables
Abstract:Scaling laws assume larger models trained on more data consistently outperform smaller ones – an assumption that drives model selection in computer vision but remains untested in resource-constrained Earth observation (EO). We conduct a systematic efficiency analysis across three scaling dimensions: model size, dataset size, and input resolution, on rooftop PV detection in Madagascar. Optimizing for model efficiency (mAP _50 per unit of model size), we find a consistent efficiency inversion: YOLO11N achieves both the highest efficiency ( 24\times higher than YOLO11X) and the highest absolute mAP _50 (0.617). Resolution is the dominant resource allocation lever ( + 120% efficiency gain), while additional data yields negligible returns at low resolution. These findings are robust to the deployment objective: small high-resolution configurations are Pareto-dominant across all 44 setups in the joint accuracy-throughput space, leaving no tradeoff to resolve. In data-scarce EO, bigger is not just unnecessary: it can be worse.
[CV-9] Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation CVPR2026
【速读】:该论文旨在解决鱼眼相机(fisheye camera)在机器人操作任务中广泛应用背景下,其宽视场(Field of View, FoV)对策略学习下游影响缺乏系统理解的问题。针对这一问题,作者通过在仿真与真实场景中的大量实验,围绕空间定位、场景泛化和硬件泛化三个关键问题展开研究,发现:(1)宽FoV虽能显著提升空间定位能力,但效果依赖于环境的视觉复杂度;(2)在足够多样化的环境中训练时,鱼眼图像可实现优于传统相机的场景泛化性能;(3)跨相机迁移失败主要源于尺度过拟合(scale overfitting),而引入简单的随机尺度增强(Random Scale Augmentation, RSA)策略即可有效改善硬件泛化表现。解决方案的关键在于识别并量化鱼眼相机特性对策略学习的影响机制,并提出针对性的数据增强方法以提升模型鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2603.02139
作者: Han Xue,Nan Min,Xiaotong Liu,Wendi Chen,Yuan Fang,Jun Lv,Cewu Lu,Chuan Wen
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 15 figures, Accecpted by CVPR 2026
Abstract:The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient environmental diversity. (3) While naive cross-camera transfer leads to failures, we identify the root cause as scale overfitting and demonstrate that hardware generalization performance can be improved with a simple Random Scale Augmentation (RSA) strategy. Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning. More results and videos are available on this https URL
[CV-10] OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens CVPR2026
【速读】:该论文旨在解决多模态指令下高质量矢量动画生成的挑战,尤其是如何从文本和视觉等多模态输入中精确控制动画的运动与视觉内容。现有方法受限于Lottie格式JSON文件中大量不变的结构元数据和格式化标记,难以直接用于学习矢量动画生成。解决方案的关键在于设计了一个精心构造的Lottie分词器(Lottie tokenizer),将原始JSON转化为由命令和参数组成的结构化序列,从而有效表示形状、动画函数及控制参数;在此基础上,借助预训练视觉语言模型(vision-language models, VLMs)实现对多模态交错指令的理解与响应,最终生成语义一致且生动的矢量动画。
链接: https://arxiv.org/abs/2603.02138
作者: Yiying Yang,Wei Cheng,Sijin Chen,Honghao Fu,Xianfang Zeng,Yujun Cai,Gang Yu,Xingjun Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project Page: this https URL
Abstract:OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.
[CV-11] OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution
【速读】:该论文旨在解决现有3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在在线场景重建中的局限性,即当前方法主要基于离线重建范式,无法实现连续、实时的三维场景重建,从而限制了其在机器人和虚拟现实/增强现实(VR/AR)等在线应用场景中的部署。其解决方案的关键在于提出一种名为OnlineX的前馈式在线重建框架,通过引入“解耦的主动-稳定状态演化”机制,将记忆状态分离为专用的主动状态(用于捕捉高频局部几何细节)和持久的稳定状态(用于保留长期全局结构),并通过隐式高斯融合模块协同整合两者信息,从而在保证重建精度的同时有效缓解累积漂移问题,实现视觉外观与语言场的联合建模及实时推理。
链接: https://arxiv.org/abs/2603.02134
作者: Chong Xia,Fangfu Liu,Yule Wang,Yize Pang,Yueqi Duan
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in generalizable 3D Gaussian Splatting (3DGS) have enabled rapid 3D scene reconstruction within seconds, eliminating the need for per-scene optimization. However, existing methods primarily follow an offline reconstruction paradigm, lacking the capacity for continuous reconstruction, which limits their applicability to online scenarios such as robotics and VR/AR. In this paper, we introduce OnlineX, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images. A key challenge in online formulation is the cumulative drift issue, which is rooted in the fundamental conflict between two opposing roles of the memory state: an active role that constantly refreshes to capture high-frequency local geometry, and a stable role that conservatively accumulates and preserves the long-term global structure. To address this, we introduce a decoupled active-to-stable state evolution paradigm. Our framework decouples the memory state into a dedicated active state and a persistent stable state, and then cohesively fuses the information from the former into the latter to achieve both fidelity and stability. Moreover, we jointly model visual appearance and language fields and incorporate an implicit Gaussian fusion module to enhance reconstruction quality. Experiments on mainstream datasets demonstrate that our method consistently outperforms prior work in novel view synthesis and semantic understanding, showcasing robust performance across input sequences of varying lengths with real-time inference speed.
[CV-12] SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
【速读】:该论文旨在解决复杂场景下基于视频的组合式场景重建(Compositional Scene Reconstruction)中存在的视觉失真与物理不一致性问题,即传统方法在真实世界场景中泛化能力差,且直接串联感知、生成与仿真三个阶段会导致生成对象视觉 fidelity 低和最终场景物理合理性不足。解决方案的关键在于提出一个“感知-生成-仿真”(Perception-Generation-Simulation)全流程框架,并设计两个桥接模块:其一为主动视角优化(Active Viewpoint Optimization),用于提升从感知到生成阶段的视觉保真度,通过在3D空间中搜索最优投影图像作为单物体补全的条件;其二为场景图合成器(Scene Graph Synthesizer),用于确保从生成到仿真的过渡具备物理合理性,通过模仿真实世界的构建逻辑,在3D仿真器中从零开始引导场景结构的构建。
链接: https://arxiv.org/abs/2603.02133
作者: Chong Xia,Kai Zhu,Zizhuo Wang,Fangfu Liu,Zhizheng Zhang,Yueqi Duan
机构: Tsinghua University (清华大学); Galbot
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a “Perception-Generation-Simulation” pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method’s superior performance over previous state-of-the-art approaches.
[CV-13] Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera ICRA2026
【速读】:该论文旨在解决单模态视觉-惯性动作捕捉系统中存在的两大核心问题:一是由单目深度模糊(monocular depth ambiguity)导致的全局平移度量不准确;二是局部运动估计缺乏人体形态感知(shape-agnostic),无法适应个体间的人体结构差异。解决方案的关键在于提出 Stereo-Inertial Poser 系统,其创新性地采用单目立体相机(stereo camera)替代传统单目RGB相机,利用标定后的基线几何关系消除深度模糊,从而实现直接的3D关键点提取与身体形状参数估计;同时,通过引入一种新颖的形态感知融合模块(shape-aware fusion module),动态协调人体形态变化与全局平移之间的耦合关系,并结合六颗IMU数据与视觉线索进行联合优化,有效抑制漂移并提升动作捕捉的精度与鲁棒性。该方法无需优化后处理即可达到200 FPS以上的实时性能,显著优于现有技术。
链接: https://arxiv.org/abs/2603.02130
作者: Tutian Tang,Xingyu Ji,Yutong Li,MingHao Liu,Wenqiang Xu,Cewu Lu
机构: Shanghai Jiao Tong University (上海交通大学); Meta Robotics Institute (Meta 机器人研究所); Shanghai Innovation Institute (上海创新研究院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Municipal Education Commission (上海市教育委员会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code, data, and supplementary materials are available at \url{ this https URL }. Accepted to ICRA 2026
Abstract:Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
[CV-14] LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation
【速读】:该论文旨在解决基于单目视频的3D虚拟形象(avatar)动画中因稀疏运动学信号(kinematic cues)导致的表达能力有限和重建伪影问题,尤其在日常单目视频场景下表现尤为明显。解决方案的关键在于提出LiftAvatar框架,其核心思想是将不完整的输入数据提升为更丰富的运动学表示,从而增强下游3D动画流程中的重建与驱动效果;具体包括两个关键技术:(i) 多粒度表情控制机制,通过结合阴影图(shading maps)与表情系数实现精确且稳定的驱动;(ii) 多参考条件机制,聚合多帧间的互补信息以提升3D一致性与可控性,同时作为即插即用模块有效融合大规模视频生成模型中的先验知识至3D管线中,显著提升动画质量与泛化性能。
链接: https://arxiv.org/abs/2603.02129
作者: Hualiang Wei,Shunran Jia,Jialun Liu,Wenhui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 11 figures
Abstract:We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.
[CV-15] A 3D mesh convolution-based autoencoder for geometry compression
【速读】:该论文旨在解决不规则三维网格(3D mesh)几何压缩问题,传统方法通常依赖于预处理或要求网格满足流形(manifold)和水密(watertight)条件,限制了其在实际场景中的适用性。解决方案的关键在于提出一种基于3D网格卷积的自编码器架构,直接从网格面(mesh faces)中学习有意义的潜在表示,同时通过专用的池化(pooling)与上采样(unpooling)操作保留拓扑连接性。该方法将输入网格压缩到一个紧凑的基础网格空间,确保潜在空间的一致性和可比性,并在解码端恢复原始连接性和完整几何细节,从而在多类别数据集上实现了优于现有最优方法的重建精度和潜在空间分类性能。
链接: https://arxiv.org/abs/2603.02125
作者: Germain Bregeon,Marius Preda,Radu Ispas,Titus Zaharia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we introduce a novel 3D mesh convolution-based autoencoder for geometry compression, able to deal with irregular mesh data without requiring neither preprocessing nor manifold/watertightness conditions. The proposed approach extracts meaningful latent representations by learning features directly from the mesh faces, while preserving connectivity through dedicated pooling and unpooling operations. The encoder compresses the input mesh into a compact base mesh space, which ensures that the latent space remains comparable. The decoder reconstructs the original connectivity and restores the compressed geometry to its full resolution. Extensive experiments on multi-class datasets demonstrate that our method outperforms state-of-the-art approaches in both 3D mesh geometry reconstruction and latent space classification tasks. Code available at: this http URL
[CV-16] Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
【速读】:该论文旨在解决情感多模态语言模型(Affective Multimodal Language Models, MLMs)中低层感知与高层交互之间存在的鸿沟问题,该鸿沟导致情感能力碎片化且泛化能力受限。解决方案的关键在于提出一个受认知启发的三层层级结构——感知(Perception)、理解(Understanding)与交互(Interaction),以此统一情感任务的认知深度,并构建了Nano-EmoX小规模多任务MLM与基于课程学习的P2E(Perception-to-Empathy)训练框架。其中,Nano-EmoX通过集成多种模态编码器(如增强型面部编码器和融合编码器)提取关键多模态情感线索,并利用异构适配器将输出投影至统一语言空间,使轻量级语言模型能够高效处理多样化情感任务;P2E则通过逐步对齐快速感知与链式思维驱动的情感共情,系统性提升模型的情感智能水平。该方案首次在仅2.2B参数规模下实现了六项核心情感任务的统一建模,在多个基准测试中达到或超越当前最优性能,展现出卓越的效率与泛化能力。
链接: https://arxiv.org/abs/2603.02123
作者: Jiahao Huang,Fengyan Lin,Xuechao Yang,Chen Feng,Kexin Zhu,Xu Yang,Zhide Chen
机构: Fujian Normal University (福建师范大学); RMIT University (皇家墨尔本理工大学); Minjiang University (闽江学院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages,8 figures, The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026
Abstract:The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.
[CV-17] FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding CVPR2026
【速读】:该论文旨在解决流式视频理解中因冗余视觉信息导致的高延迟与高显存占用问题,尤其在实时场景下模型效率与性能难以兼顾的挑战。解决方案的关键在于提出了一种无需训练的高效视觉记忆压缩框架FluxMem,其核心机制为分层两阶段设计:首先通过时间邻接选择(Temporal Adjacency Selection, TAS)模块去除相邻帧间的冗余视觉标记(token),再由空间域整合(Spatial Domain Consolidation, SDC)模块进一步合并单帧内空间重复区域,形成紧凑表示;同时引入自适应标记压缩机制,在TAS和SDC中依据场景内在统计特性自动调节压缩率,避免人工调参,从而在保持高精度的同时显著降低延迟(减少69.9%)和峰值GPU显存(减少34.5%),且在离线任务中仍维持优异性能。
链接: https://arxiv.org/abs/2603.02096
作者: Yiweng Xie,Bo He,Junke Wang,Xiangyu Zheng,Ziyi Ye,Zuxuan Wu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Shanghai Key Laboratory of Multimodal Embodied AI (上海市多模态具身智能重点实验室); University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2026. Project page: this https URL
Abstract:This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.
[CV-18] Detection-Gated Glottal Segmentation with Zero-Shot Cross-Dataset Transfer and Clinical Feature Extraction KR
【速读】:该论文旨在解决高帧率视频喉镜(High-Speed Videoendoscopy, HSV)中声门分割的准确性问题,特别是现有深度学习模型在非声门帧中产生伪影以及跨临床场景泛化能力不足的问题。其解决方案的关键在于提出一种检测门控(detection-gated)流水线架构,该架构将YOLOv8目标检测器与U-Net图像分割器相结合,并引入时间一致性封装模块以抑制声门闭合和器械遮挡期间的假阳性结果,从而实现鲁棒的声门区域识别。此设计在仅用600帧训练数据的情况下即实现了零样本迁移至大规模BAGLS数据集上的优异性能(DSC 0.85),并验证了自动化提取的声门面积变异系数(coefficient of variation, CV)作为区分健康与病理发声功能的重要生物标志物(p=0.006)。
链接: https://arxiv.org/abs/2603.02087
作者: Harikrishnan Unnikrishnan
机构: Orchard Robotics(Orchard Robotics)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: for associated code see: this https URL
Abstract:Background: Accurate glottal segmentation in high-speed videoendoscopy (HSV) is essential for extracting kinematic biomarkers of laryngeal function. However, existing deep learning models often produce spurious artifacts in non-glottal frames and fail to generalize across different clinical settings. Methods: We propose a detection-gated pipeline that integrates a YOLOv8-based detector with a U-Net segmenter. A temporal consistency wrapper ensures robustness by suppressing false positives during glottal closure and instrument occlusion. The model was trained on a limited subset of the GIRAFE dataset (600 frames) and evaluated via zero-shot transfer on the large-scale BAGLS dataset. Results: The pipeline achieved state-of-the-art performance on the GIRAFE benchmark (DSC 0.81) and demonstrated superior generalizability on BAGLS (DSC 0.85, in-distribution) without institutional fine-tuning. Downstream validation on a 65-subject clinical cohort confirmed that automated kinematic features (Open Quotient, coefficient of variation) remained consistent with established clinical benchmarks. The coefficient of variation (CV) of the glottal area was found to be a significant marker for distinguishing healthy from pathological vocal function (p=0.006). Conclusions: The detection-gated architecture provides a lightweight, computationally efficient solution (~35 frames/s) for real-time clinical use. By enabling robust zero-shot transfer, this framework facilitates the standardized, large-scale extraction of clinical biomarkers across diverse endoscopy platforms. Code, trained weights, and evaluation scripts are released at this https URL. Comments: for associated code see: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.02087 [cs.CV] (or arXiv:2603.02087v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.02087 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Harikrishnan Unnikrishnan [view email] [v1] Mon, 2 Mar 2026 17:05:41 UTC (1,455 KB)
[CV-19] π-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs
【速读】:该论文旨在解决基于流的视觉-语言-动作(Vision-Language-Action, VLA)模型在多步采样过程中因似然函数难以计算而导致在线强化学习受限的问题。其核心解决方案是提出一种无需评判器(critic)和似然估计的端到端训练框架——π-StepNFT(Step-wise Negative-aware Fine-Tuning),该方法通过每轮优化仅需一次前向传播即可完成训练,同时摒弃了辅助价值网络(value network),从而显著提升训练效率与稳定性。关键创新在于识别出更广的探索空间需要细粒度的、逐步骤的对齐引导机制,使得模型在少样本场景下具备更强鲁棒性,并在跨域(OOD)泛化能力上优于基于价值的基线方法,有效避免了对多模态特征的过拟合。
链接: https://arxiv.org/abs/2603.02083
作者: Siting Wang,Xiaofeng Wang,Zheng Zhu,Minnan Pei,Xinyu Cui,Cheng Deng,Jian Zhao,Guan Huang,Haifeng Zhang,Jun Wang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textbf\textit \boldsymbol\pi -StepNFT (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, \pi -StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.
[CV-20] From Pixels to Patches: Pooling Strategies for Earth Embeddings
【速读】:该论文旨在解决当前地物分类任务中,基于像素级嵌入(pixel-level embeddings)的地理空间基础模型在聚合为图像块表示(patch representations)时,因采用默认的均值池化(mean pooling)方法而导致的空间位移鲁棒性下降问题。均值池化会丢失块内像素间的差异信息,在空间偏移场景下准确率可下降超过10%。其解决方案的关键在于引入更丰富的池化策略,如广义均值池化(Generalized Mean Pooling, GeM)和统计池化(Stats pooling),前者无需增加嵌入维度即可提升精度,后者通过融合最小值、最大值、均值与标准差等分布统计特征实现最高性能,且对高维嵌入更具优势,从而显著缩小地理泛化差距(geographic generalization gap)达40%。
链接: https://arxiv.org/abs/2603.02080
作者: Isaac Corley,Caleb Robinson,Inbal Becker-Reshef,Juan M. Lavista Ferres
机构: Wherobots; Microsoft AI for Good Research Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:As geospatial foundation models shift from patch-level to pixel-level embeddings, practitioners must aggregate thousands of pixel vectors into patch representations that preserve class-discriminative signal while matching downstream label resolution. The default choice, mean pooling, discards within-patch variability and can drop accuracy by more than 10% under spatial shift. To evaluate this effect, we introduce EuroSAT-Embed: 81,000 embedding GeoTIFFs derived from three foundation models: AlphaEarth, OlmoEarth, and Tessera. We benchmark 11 training-free and 2 parametric pooling methods under both random and geographically disjoint test splits. Our results show that richer pooling schemes reduce the geographic generalization gap by up to 40% relative to mean pooling and increases accuracy by up to 5% on spatial splits. We recommend Generalized Mean Pooling (GeM) as a drop-in replacement for mean pooling: it improves accuracy without increasing embedding dimensionality. For maximum accuracy, Stats pooling (concatenation of min/max/mean/std pooling) performs best at 4x the embedding size. We further find that pooling effectiveness varies across embedding sources and that higher-dimensional embeddings benefit most from distributional statistics.
[CV-21] MMNavAgent : Multi-Magnification WSI Navigation Agent for Clinically Consistent Whole-Slide Analysis
【速读】:该论文旨在解决当前生成式AI在全切片图像(Whole-Slide Image, WSI)诊断中难以模拟临床病理医生多倍率动态浏览与跨倍率信息交互的问题。现有方法通常仅在固定倍率下操作或依赖预设的倍率遍历策略,无法实现类似人类专家根据诊断需求自适应选择观察尺度并整合不同层级结构信息的能力。其解决方案的关键在于提出一个临床一致的多倍率WSI导航代理(Multi-Magnification WSI Navigation Agent, MMNavAgent),其中包含两个核心组件:一是跨倍率导航工具(Cross-Magnification Navigation Tool, CMT),用于聚合相邻倍率的上下文信息以增强导航路径上的判别性表征;二是倍率选择工具(Magnification Selection Tool, MST),通过代理框架内的记忆驱动推理机制实现交互式、自适应的倍率选择,从而模拟病理医生逐级决策的诊断流程。
链接: https://arxiv.org/abs/2603.02079
作者: Zhengyang Xu,Han Li,Jingsong Liu,Linrui Xie,Xun Ma,Xin You,Shihui Zu,Ayako Ito,Xinyu Hao,Hongming Xu,Shaohua Kevin Zhou,Nassir Navab,Peter J. Schüffler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent AI navigation approaches aim to improve Whole-Slide Image (WSI) diagnosis by modeling spatial exploration and selecting diagnostically relevant regions, yet most operate at a single fixed magnification or rely on predefined magnification traversal. In clinical practice, pathologists examine slides across multiple magnifications and selectively inspect only necessary scales, dynamically integrating global and cellular evidence in a sequential manner. This mismatch prevents existing methods from modeling cross-magnification interactions and adaptive magnification selection inherent to real diagnostic workflows. To these, we propose a clinically consistent Multi-Magnification WSI Navigation Agent (MMNavAgent) that explicitly models multi magnification interaction and adaptive magnification selection. Specifically, we introduce a Cross-Magnification navigation Tool (CMT) that aggregates contextual information from adjacent magnifications to enhance discriminative representations along the navigation path. We further introduce a Magnification Selection Tool (MST) that leverages memory-driven reasoning within the agent framework to enable interactive and adaptive magnification selection, mimicking the sequential decision process of pathologists. Extensive experiments on a public dataset demonstrate improved diagnostic performance, with 1.45% gain of AUC and 2.93% gain of BACC over a non-agent baseline. Code will be public upon acceptance.
[CV-22] ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks
【速读】:该论文旨在解决从图像中提取信息的难题,特别是通过无监督方式实现对象中心的表征学习(object-centric representation learning),即自动将图像分割为独立对象并将其映射到低维潜在空间以支持下游任务。其核心解决方案是提出一种基于循环一致性生成对抗网络(cycle-consistent Generative Adversarial Networks)的新方法 ORGAN,相较于主流的自编码器架构(AEs),ORGAN 在处理高复杂度真实世界数据(如多对象场景和低视觉对比度)时表现出更强的鲁棒性,并能生成具有表达力的潜在表示以支持对象级操作,同时在对象数量和图像尺寸上均展现出良好的可扩展性。
链接: https://arxiv.org/abs/2603.02063
作者: Joël Küchler,Ellen van Maren,Vaiva Vasiliauskaitė,Katarina Vulić,Reza Abbasi-Asl,Stephan J. Ihle
机构: University and ETH Zurich (苏黎世联邦理工学院和苏黎世大学); Insel Gruppe (因塞尔集团); University of California, San Francisco (加州大学旧金山分校); University of Chicago (芝加哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: GitHub: this https URL
Abstract:Although data generation is often straightforward, extracting information from data is more difficult. Object-centric representation learning can extract information from images in an unsupervised manner. It does so by segmenting an image into its subcomponents: the objects. Each object is then represented in a low-dimensional latent space that can be used for downstream processing. Object-centric representation learning is dominated by autoencoder architectures (AEs). Here, we present ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead. We show that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast. Complementing these results, ORGAN creates expressive latent space representations that allow for object manipulation. Finally, we show that ORGAN scales well both with respect to the number of objects and the size of the images, giving it a unique edge over current state-of-the-art approaches.
[CV-23] WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories
【速读】:该论文旨在解决当前基础视频扩散模型(Video Diffusion Models, VDMs)在生成视频时存在的两大核心问题:一是难以实现精确的相机控制,二是生成内容在不同视角下缺乏一致性,导致无法重建高质量的3D场景。解决方案的关键在于提出名为WorldStereo的新框架,其通过两个专用的几何记忆模块实现突破:一是全局几何记忆(global-geometric memory),用于注入逐步更新的点云结构先验并实现精准相机控制;二是空间立体记忆(spatial-stereo memory),通过3D对应关系约束注意力机制,聚焦于记忆库中的细粒度细节。这两个模块协同作用,使模型能够在精确相机引导下生成多视角一致的视频,并支持高保真3D重建,且无需联合训练即可高效利用预训练扩散模型。
链接: https://arxiv.org/abs/2603.02049
作者: Yisu Zhang,Chenjie Cao,Tengfei Wang,Xuhui Zuo,Junta Wu,Jianke Zhu,Chunchao Guo
机构: Zhejiang University (浙江大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model’s attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.
[CV-24] NICO-RAG : Multimodal Hypergraph Retrieval-Augmented Generation for Understanding the Nicotine Public Health Crisis
【速读】:该论文旨在解决烟草行业通过新型尼古丁产品(如含味尼古丁袋)持续吸引青少年和新用户,从而削弱公共卫生控烟成效的问题。其核心挑战在于现有研究在数据规模和跨模态关联分析能力上的局限性。解决方案的关键是构建NICO数据集(包含超过20万条多模态样本)与NICO-RAG框架:前者提供大规模图文数据支持,后者采用基于超图的联合多模态知识表示方法,在不依赖高成本图像token处理的前提下,实现基于视觉相似性和语义相似性的双重检索机制,从而高效生成事实准确的回答。
链接: https://arxiv.org/abs/2603.02047
作者: Manuel Serna-Aguilera,Raegan Anderes,Page Dobbs,Khoa Luu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The nicotine addiction public health crisis continues to be pervasive. In this century alone, the tobacco industry has released and marketed new products in an aggressive effort to lure new and young customers for life. Such innovations and product development, namely flavored nicotine or tobacco such as nicotine pouches, have undone years of anti-tobacco campaign work. Past work is limited both in scope and in its ability to connect large-scale data points. Thus, we introduce the Nicotine Innovation Counter-Offensive (NICO) Dataset to provide public health researchers with over 200,000 multimodal samples, including images and text descriptions, on 55 tobacco and nicotine product brands. In addition, to provide public health researchers with factual connections across a large-scale dataset, we propose NICO-RAG, a retrieval-augmented generation (RAG) framework that can retrieve image features without incurring the high-cost of language models, as well as the added cost of processing image tokens with large-scale datasets such as NICO. At construction time, NICO-RAG organizes image- and text-extracted entities and relations into hypergraphs to produce as factual responses as possible. This joint multimodal knowledge representation enables NICO-RAG to retrieve images for query answering not only by visual similarity but also by the semantic similarity of image descriptions. Experimentals show that without needing to process additional tokens from images for over 100 questions, NICO-RAG performs comparably to the state-of-the-art RAG method adapted for images.
[CV-25] LAD-Drive: Bridging Language and Trajectory with Action-Aware Diffusion Transformers
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动驾驶场景中将离散语义知识转化为连续轨迹时面临的根本性挑战,尤其是现有方法依赖单模态规划头导致对多模态驾驶行为表征能力受限,以及生成式方法常采用one-hot编码动作而忽略导航不确定性的问题。其解决方案的关键在于提出LAD-Drive框架,通过结构化解耦高层意图(high-level intention)与低层空间规划(low-level spatial planning),引入一个动作解码器以推断概率化的元动作分布,从而显式建模信念状态(belief state)以保留传统one-hot编码所丢失的细微意图信息;该分布与车辆运动学状态融合后,进一步作为条件输入至一个动作感知的扩散解码器(action-aware diffusion decoder),利用截断去噪过程将学习到的运动锚点优化为安全且符合运动学约束的轨迹。
链接: https://arxiv.org/abs/2603.02035
作者: Fabian Schmidt,Karol Fedurko,Markus Enzweiler,Abhinav Valada
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While multimodal large language models (MLLMs) provide advanced reasoning for autonomous driving, translating their discrete semantic knowledge into continuous trajectories remains a fundamental challenge. Existing methods often rely on unimodal planning heads that inherently limit their ability to represent multimodal driving behavior. Furthermore, most generative approaches frequently condition on one-hot encoded actions, discarding the nuanced navigational uncertainty critical for complex scenarios. To resolve these limitations, we introduce LAD-Drive, a generative framework that structurally disentangles high-level intention from low-level spatial planning. LAD-Drive employs an action decoder to infer a probabilistic meta-action distribution, establishing an explicit belief state that preserves the nuanced intent typically lost by one-hot encodings. This distribution, fused with the vehicle’s kinematic state, conditions an action-aware diffusion decoder that utilizes a truncated denoising process to refine learned motion anchors into safe, kinematically feasible trajectories. Extensive evaluations on the LangAuto benchmark demonstrate that LAD-Drive achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions. We will publicly release the code and models on this https URL.
[CV-26] MAP-Diff: Multi-Anchor Guided Diffusion for Progressive 3D Whole-Body Low-Dose PET Denoising
【速读】:该论文旨在解决低剂量正电子发射断层成像(Positron Emission Tomography, PET)中因辐射暴露降低导致的图像噪声严重和定量精度下降问题。现有基于扩散模型的去噪方法虽能实现高质量重建,但其反向轨迹通常缺乏约束,未能与PET剂量逐步形成的过程对齐。解决方案的关键在于提出一种多锚点引导的扩散框架(MAP-Diff),通过引入临床观测到的中间剂量扫描作为轨迹锚点,并施加时间步依赖的监督信号,以规范反向过程朝向剂量一致的中间状态演化;同时,利用模拟扩散退化与真实多剂量PET配对数据进行锚点时间步校准,并采用时间步加权锚点损失稳定分阶段学习。该方法仅需超低剂量输入即可实现渐进式、剂量一致的中间重建,在内部和跨扫描仪数据集上均显著优于CNN、Transformer、GAN及扩散基线模型。
链接: https://arxiv.org/abs/2603.02012
作者: Peiyuan Jing,Chun-Wun Cheng,Liutao Yang,Zhenxuan Zhang,Thiago V. Lima,Klaus Strobel,Antoine Leimgruber,Angelica Aviles-Rivero,Guang Yang,Javier A. Montoya-Zegarra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures
Abstract:Low-dose Positron Emission Tomography (PET) reduces radiation exposure but suffers from severe noise and quantitative degradation. Diffusion-based denoising models achieve strong final reconstructions, yet their reverse trajectories are typically unconstrained and not aligned with the progressive nature of PET dose formation. We propose MAP-Diff, a multi-anchor guided diffusion framework for progressive 3D whole-body PET denoising. MAP-Diff introduces clinically observed intermediate-dose scans as trajectory anchors and enforces timestep-dependent supervision to regularize the reverse process toward dose-aligned intermediate states. Anchor timesteps are calibrated via degradation matching between simulated diffusion corruption and real multi-dose PET pairs, and a timestep-weighted anchor loss stabilizes stage-wise learning. At inference, the model requires only ultra-low-dose input while enabling progressive, dose-consistent intermediate restoration. Experiments on internal (Siemens Biograph Vision Quadra) and cross-scanner (United Imaging uEXPLORER) datasets show consistent improvements over strong CNN-, Transformer-, GAN-, and diffusion-based baselines. On the internal dataset, MAP-Diff improves PSNR from 42.48 dB to 43.71 dB (+1.23 dB), increases SSIM to 0.986, and reduces NMAE from 0.115 to 0.103 (-0.012) compared to 3D DDPM. Performance gains generalize across scanners, achieving 34.42 dB PSNR and 0.141 NMAE on the external cohort, outperforming all competing methods.
[CV-27] Learning Vision-Based Omnidirectional Navigation: A Teacher-Student Approach Using Monocular Depth Estimation
【速读】:该论文旨在解决工业环境中可靠障碍物避障问题,传统依赖2D激光雷达(LiDAR)的方案因仅感知单一水平截面而无法有效识别高于或低于扫描平面的关键障碍物。其解决方案的关键在于提出一种教师-学生(teacher-student)框架:教师策略在NVIDIA Isaac Lab中通过近端策略优化(Proximal Policy Optimization, PPO)训练,利用包含完整机器人轮廓信息的2D LiDAR观测学习鲁棒导航;随后将该策略的知识蒸馏至仅依赖单目深度估计(Monocular Depth Estimation, MDE)的学生策略,该MDE由四个RGB相机输入驱动,并基于微调后的Depth Anything V2模型生成深度图。整个推理流程(包括MDE、策略执行和电机控制)完全在搭载于DJI RoboMaster平台的NVIDIA Jetson Orin AGX上运行,无需外部计算资源,在仿真和真实场景中均显著优于传统2D LiDAR方法,尤其在处理复杂三维几何障碍物(如悬垂结构和低矮物体)时表现突出。
链接: https://arxiv.org/abs/2603.01999
作者: Jan Finke,Wayne Paul Martis,Adrian Schmelter,Lars Erbach,Christian Jestel,Marvin Wiedemann
机构: Fraunhofer Institute for Material Flow and Logistics (弗劳恩霍夫材料流与物流研究所); The Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Reliable obstacle avoidance in industrial settings demands 3D scene understanding, but widely used 2D LiDAR sensors perceive only a single horizontal slice of the environment, missing critical obstacles above or below the scan plane. We present a teacher-student framework for vision-based mobile robot navigation that eliminates the need for LiDAR sensors. A teacher policy trained via Proximal Policy Optimization (PPO) in NVIDIA Isaac Lab leverages privileged 2D LiDAR observations that account for the full robot footprint to learn robust navigation. The learned behavior is distilled into a student policy that relies solely on monocular depth maps predicted by a fine-tuned Depth Anything V2 model from four RGB cameras. The complete inference pipeline, comprising monocular depth estimation (MDE), policy execution, and motor control, runs entirely onboard an NVIDIA Jetson Orin AGX mounted on a DJI RoboMaster platform, requiring no external computation for inference. In simulation, the student achieves success rates of 82-96.5%, consistently outperforming the standard 2D LiDAR teacher (50-89%). In real-world experiments, the MDE-based student outperforms the 2D LiDAR teacher when navigating around obstacles with complex 3D geometries, such as overhanging structures and low-profile objects, that fall outside the single scan plane of a 2D LiDAR.
[CV-28] Event-Only Drone Trajectory Forecasting with RPM-Modulated Kalman Filtering
【速读】:该论文旨在解决基于事件相机(event camera)的无人机轨迹预测问题,尤其针对传统方法在高速运动场景下性能受限、依赖RGB图像或训练数据的局限性。其解决方案的关键在于利用事件数据中直接提取的螺旋桨旋转速度(RPM)信息,并将其融合进一个RPM感知的卡尔曼滤波框架中,从而实现无需RGB影像或训练数据的高精度短中期轨迹预测。
链接: https://arxiv.org/abs/2603.01997
作者: Hari Prasanth S.M.,Pejman Habibiroudkenar,Eerik Alamikkotervo,Dimitrios Bouzoulas,Risto Ojala
机构: Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Submitted to ICUAS 2026 conference
Abstract:Event cameras provide high-temporal-resolution visual sensing that is well suited for observing fast-moving aerial objects; however, their use for drone trajectory prediction remains limited. This work introduces an event-only drone forecasting method that exploits propeller-induced motion cues. Propeller rotational speed are extracted directly from raw event data and fused within an RPM-aware Kalman filtering framework. Evaluations on the FRED dataset show that the proposed method outperforms learning-based approaches and vanilla kalman filter in terms of average distance error and final distance error at 0.4s and 0.8s forecasting horizons. The results demonstrate robust and accurate short- and medium-horizon trajectory forecasting without reliance on RGB imagery or training data.
[CV-29] Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection
【速读】:该论文旨在解决当前多媒体篡改检测方法在泛化能力上的不足问题,现有方法多依赖于结果导向的监督信号进行篡改类型分类,缺乏可解释性且易过拟合于表层伪影,难以应对未见过的篡改模式。解决方案的关键在于提出一种基于推理驱动(reasoning-driven)的框架 REFORM,其核心创新是将学习目标从结果拟合转向过程建模,通过三阶段课程学习策略:首先诱导伪造推理依据(forensic rationales),再对齐推理与最终判断,最后利用强化学习优化逻辑一致性,从而实现更鲁棒和可泛化的篡改检测。
链接: https://arxiv.org/abs/2603.01993
作者: Yuchen Zhang,Yaxiong Wang,Kecheng Han,Yujiao Wu,Lianwei Wu,Li Zhu,Zhedong Zheng
机构: Xi’an Jiaotong University (西安交通大学); Hefei University of Technology (合肥工业大学); CSIRO; Northwestern Polytechnical University (西北工业大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.
[CV-30] Robust White Blood Cell Classification with Stain-Normalized Decoupled Learning and Ensembling
【速读】:该论文旨在解决白细胞(White Blood Cell, WBC)分类任务中面临的两大挑战:一是由于染色和扫描条件差异导致的图像外观变化较大,二是类别分布严重不均衡,常见细胞类型占主导地位,而临床重要的罕见类别样本不足。解决方案的关键在于提出一种染色归一化、解耦训练框架:首先通过实例平衡采样学习可迁移的特征表示,随后利用类别感知采样与混合损失函数(结合有效数量加权和焦点调制)对分类器进行再平衡;在推理阶段进一步采用多模型集成与测试时增强策略提升鲁棒性。该方法在ISBI 2026的WBCBench 2026挑战赛中取得第一名。
链接: https://arxiv.org/abs/2603.01976
作者: Luu Le,Hoang-Loc Cao,Ha-Hieu Pham,Thanh-Huy Nguyen,Ulas Bagci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:White blood cell (WBC) classification is fundamental for hematology applications such as infection assessment, leukemia screening, and treatment monitoring. However, real-world WBC datasets present substantial appearance variations caused by staining and scanning conditions, as well as severe class imbalance in which common cell types dominate while rare but clinically important categories are underrepresented. To address these challenges, we propose a stain-normalized, decoupled training framework that first learns transferable representations using instance-balanced sampling, and then rebalances the classifier with class-aware sampling and a hybrid loss combining effective-number weighting and focal modulation. In inference stage, we further enhance robustness by ensembling various trained backbones with test-time augmentation. Our approach achieved the top rank on the leaderboard of the WBCBench 2026: Robust White Blood Cell Classification Challenge at ISBI 2026.
[CV-31] Closed-Loop Action Chunks with Dynamic Corrections for Training-Free Diffusion Policy ICRA2026
【速读】:该论文旨在解决基于扩散模型的策略在动态场景中适应性差、响应延迟或任务失败的问题,尤其是在机器人操作中对环境变化的实时响应能力不足。解决方案的关键在于提出一种动态闭环扩散策略(Dynamic Closed-Loop Diffusion Policy, DCDP),其核心创新包括:采用分块动作生成(chunk-based action generation)与实时校正机制相结合,引入自监督动态特征编码器以提取环境动态信息,并通过交叉注意力融合与非对称动作编码器-解码器结构,在动作执行前注入环境动态信息,从而实现闭环动作修正。该方法在不重新训练的情况下使动态PushT仿真中的适应性提升19%,且仅增加5%的计算开销,同时具备模块化设计优势,可直接集成于现有系统中,兼顾时间一致性与实时响应性能。
链接: https://arxiv.org/abs/2603.01953
作者: Pengyuan Wu,Pingrui Zhang,Zhigang Wang,Dong Wang,Bin Zhao,Xuelong Li
机构: Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); Northwestern Polytechnical University (西北工业大学); Institute of Artificial Intelligence, China Telecom Corp Ltd (中国电信集团有限公司人工智能研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA2026
Abstract:Diffusion-based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed-Loop Diffusion Policy framework that integrates chunk-based action generation with real-time correction. DCDP integrates a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder to inject environmental dynamics before action execution, achieving real-time closed-loop action correction and enhancing the system’s adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19% without retraining while requiring only 5% additional computation. Its modular design enables plug-and-play integration, achieving both temporal coherence and real-time responsiveness in dynamic robotic scenarios, including real-world manipulation tasks. The project page is at: this https URL
[CV-32] PreSight: Preoperative Outcome Prediction for Parkinsons Disease via Region-Prior Morphometry and Patient-Specific Weighting
【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)术前改善率预测难题,其核心挑战在于术前影像信号微弱且患者群体异质性强,导致个体化术后运动功能改善预测困难。解决方案的关键在于提出PreSight模型,该模型融合临床先验知识与术前磁共振成像(MRI)及基于形变的形态测量学(deformation-based morphometry, DBM)特征,并通过患者特异性加权模块自适应调整区域重要性,从而实现端到端、校准良好且具备患者层面解释性的预测输出。该方法在包含400名受试者的多中心队列中显著优于临床、仅影像和多模态基线模型,在内部验证和外部测试中分别达到88.89%和85.29%的响应者分类准确率,同时展现出更优的概率校准和决策曲线净收益。
链接: https://arxiv.org/abs/2603.01948
作者: Yand Wang,Chen Zhang,Lanyun Zhu,Yixin Chen,Qunbo Wang,Yutong Bai,Jurgen Germann,Yinghong Wen,Shuai Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Preoperative improvement rate prediction for Parkinson’s disease surgery is clinically important yet difficult because imaging signals are subtle and patients are heterogeneous. We address this setting, where only information available before surgery is used, and the goal is to predict patient-specific postoperative motor benefit. We present PreSight, a presurgical outcome model that fuses clinical priors with preoperative MRI and deformation-based morphometry (DBM) and adapts regional importance through a patient-specific weighting module. The model produces end-to-end, calibrated, decision-ready predictions with patient-level explanations. We evaluate PreSight on a real-world two-center cohort of 400 subjects with multimodal presurgical inputs and postoperative improvement labels. PreSight outperforms strong clinical, imaging-only, and multimodal baselines. It attains 88.89% accuracy on internal validation and 85.29% on an external-center test for responder classification and shows better probability calibration and higher decision-curve net benefit. Ablations and analyses confirm the contribution of DBM and the patient-specific weighting module and indicate that the model emphasizes disease-relevant regions in a patient-specific manner. These results demonstrate that integrating clinical prior knowledge with region-adaptive morphometry enables reliable presurgical decision support in routine practice.
[CV-33] physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection
【速读】:该论文针对无人水面艇(Unmanned Surface Vehicles, USVs)在复杂海洋环境中进行水面对目标检测的难题,旨在解决波浪杂波、镜面反射及远距离观测下目标外观线索微弱等问题。传统雷达与视觉融合方法难以有效利用雷达点云稀疏、间断且散射特性导致的重尾分布反射率变化。其核心解决方案是提出PhysFusion框架,关键创新在于:(1) 物理信息雷达编码器(Physics-Informed Radar Encoder, PIR Encoder),通过RCS映射和质量门机制将每个雷达点属性转化为紧凑的散射先验并预测点级可靠性,提升在杂波中的特征学习鲁棒性;(2) 雷达引导的交互式融合模块(Radar-guided Interactive Fusion Module, RIFM),基于双流骨干网络(点级局部流与基于Scattering-Aware Self-Attention, SASA的Transformer全局流)实现查询级别的雷达-图像特征融合;(3) 时间查询聚合模块(Temporal Query Aggregation, TQA),在短时间窗口内聚合帧级融合查询以获得时序一致性表示。该方法在WaterScenes和FLOW数据集上显著优于现有方法,验证了物理建模与多模态协同推理的有效性。
链接: https://arxiv.org/abs/2603.01947
作者: Yuting Wan,Liguo Sun,Jiuwu Hao,Zao Zhang,Pin LV
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.01947 [cs.CV] (or arXiv:2603.01947v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.01947 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuting Wan [view email] [v1] Mon, 2 Mar 2026 15:00:22 UTC (4,061 KB) Full-text links: Access Paper: View a PDF of the paper titled physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection, by Yuting Wan and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-03 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-34] MobileMold: A Smartphone-Based Microscopy Dataset for Food Mold Detection
【速读】:该论文旨在解决食品霉变检测中缺乏低成本、便携且高精度的成像与识别方法的问题,尤其针对传统肉眼检查难以发现微观霉菌结构的局限性。解决方案的关键在于构建了一个开放的智能手机显微成像数据集 MobileMold,其中包含4,941张由不同手机和显微镜在真实场景下采集的食物霉变图像,涵盖11类食品,并在此基础上建立了基于预训练深度学习模型的多任务检测与分类基准(包括霉变检测和食物类型识别),实现了接近理论上限的性能(准确率=0.9954,F1=0.9954,MCC=0.9907)。此外,通过显著性可视化解释增强模型决策透明度,从而推动可及性食品安全传感与移动显微成像技术的发展。
链接: https://arxiv.org/abs/2603.01944
作者: Dinh Nam Pham,Leonard Prokisch,Bennet Meyer,Jonas Thumbs
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM Multimedia Systems (MMSys’26). Dataset and code available at this https URL
Abstract:Smartphone clip-on microscopes turn everyday devices into low-cost, portable imaging systems that can even reveal fungal structures at the microscopic level, enabling mold inspection beyond unaided visual checks. In this paper, we introduce MobileMold, an open smartphone-based microscopy dataset for food mold detection and food classification. MobileMold contains 4,941 handheld microscopy images spanning 11 food types, 4 smartphones, 3 microscopes, and diverse real-world conditions. Beyond the dataset release, we establish baselines for (i) mold detection and (ii) food-type classification, including a multi-task setting that predicts both attributes. Across multiple pretrained deep learning architectures and augmentation strategies, we obtain near-ceiling performance (accuracy = 0.9954, F1 = 0.9954, MCC = 0.9907), validating the utility of our dataset for detecting food spoilage. To increase transparency, we complement our evaluation with saliency-based visual explanations highlighting mold regions associated with the model’s predictions. MobileMold aims to contribute to research on accessible food-safety sensing, mobile imaging, and exploring the potential of smartphones enhanced with attachments.
[CV-35] BAWSeg: A UAV Multispectral Benchmark for Barley Weed Segmentation
【速读】:该论文旨在解决农作物田间杂草精准识别难题,特别是针对无人机(UAV)多光谱影像中因辐射漂移、作物-杂草混合像素以及小范围杂草集群难以检测等问题导致的分割性能不稳定问题。现有方法多依赖阈值化的植被指数或单一卷积神经网络(CNN)/Transformer骨干结构,存在对辐射信息与归一化指数信息耦合干扰、细节纹理丢失及稀疏杂草判别能力弱等局限。其解决方案的关键在于提出一种双流分割网络VISA(Vegetation-Index and Spectral Attention),通过解耦辐射率(radiance)和植被指数(vegetation index)两种特征模态:辐射流利用残差光谱-空间注意力机制从五波段反射率中保留细粒度纹理与行边界;指数流则结合窗口自注意力建模局部结构、状态空间层实现无二次复杂度的全场上下文传播,并引入Slot Attention形成稳定的区域描述符以增强遮蔽下稀疏杂草的区分能力,从而在保持高分辨率特征的同时提升对复杂场景的鲁棒性与敏感性。
链接: https://arxiv.org/abs/2603.01932
作者: Haitian Wang,Xinyu Wang,Muhammad Ibrahim,Dustin Severtson,Ajmal Mian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate weed mapping in cereal fields requires pixel-level segmentation from UAV imagery that remains reliable across fields, seasons, and illumination. Existing multispectral pipelines often depend on thresholded vegetation indices, which are brittle under radiometric drift and mixed crop–weed pixels, or on single-stream CNN and Transformer backbones that ingest stacked bands and indices, where radiance cues and normalized index cues interfere and reduce sensitivity to small weed clusters embedded in crop canopies. We propose VISA (Vegetation-Index and Spectral Attention), a two-stream segmentation network that decouples these cues and fuses them at native resolution. The radiance stream learns from calibrated five-band reflectance using residual spectral-spatial attention to preserve fine textures and row boundaries that are attenuated by ratio indices. The index stream operates on vegetation-index maps with windowed self-attention to model local structure efficiently, state-space layers to propagate field-scale context without quadratic attention cost, and Slot Attention to form stable region descriptors that improve discrimination of sparse weeds under canopy mixing. To support supervised training and deployment-oriented evaluation, we introduce BAWSeg, a four-year UAV multispectral dataset collected over commercial barley paddocks in Western Australia, providing radiometrically calibrated blue, green, red, red edge, and near-infrared orthomosaics, derived vegetation indices, and dense crop, weed, and other labels with leakage-free block splits. On BAWSeg, VISA achieves 75.6% mIoU and 63.5% weed IoU with 22.8M parameters, outperforming a multispectral SegFormer-B1 baseline by 1.2 mIoU and 1.9 weed IoU. Under cross-plot and cross-year protocols, VISA maintains 71.2% and 69.2% mIoU, respectively. The BAWSeg data, VISA code, and trained models will be released upon publication.
[CV-36] LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶中因依赖显式文本思维链(Chain-of-Thought, CoT)而导致的语义感知解耦与感知符号冲突问题,以及标准潜在空间推理(latent CoT)缺乏物理约束导致的“物理无关”表示缺陷。解决方案的关键在于提出Latent Spatio-Temporal VLA(LaST-VLA)框架,将推理范式从离散符号处理转向基于物理 grounded 的潜在时空思维链(Latent Spatio-Temporal CoT),通过双特征对齐机制,从3D基础模型中蒸馏几何约束、从世界模型中提取动态前瞻能力,并嵌入到潜在空间中;同时采用渐进式监督微调(SFT)与基于组相对策略优化(GRPO)的强化学习相结合的训练策略,确保轨迹生成的安全性与规则合规性,从而在多个基准测试上取得显著性能提升。
链接: https://arxiv.org/abs/2603.01928
作者: Yuechen Luo,Fang Li,Shaoqing Xu,Yang Ji,Zehan Zhang,Bing Wang,Yuannan Shen,Jianwei Cui,Long Chen,Guang Chen,Hangjun Ye,Zhi-Xin Yang,Fuxi Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.
[CV-37] Zero-shot Low-Field MRI Enhancement via Diffusion-Based Adaptive Contrast Transport
【速读】:该论文旨在解决低场磁共振成像(Low-field MRI, LF-MRI)因信噪比低和组织对比度失真而导致的图像质量受限问题,尤其是从LF数据重建高场(High-field MRI, HF-MRI)质量图像时面临的无监督盲反问题。现有零样本方法通常假设简化的线性退化过程,难以恢复真实的组织对比度。论文提出的解决方案核心在于D ACT(Diffusion-Based Adaptive Contrast Transport)框架:它结合预训练的HF扩散先验以保证解剖结构保真度,并引入一个物理信息驱动的自适应前向模型;关键创新点是设计了一个可微分的Sinkhorn最优传输模块,在扩散逆过程期间显式建模并校正LF与HF域之间的强度分布偏移,从而动态学习难以解析的对比度映射关系,同时保持拓扑一致性,最终实现无需配对监督即可生成具有高质量结构细节和正确组织对比度的重建图像。
链接: https://arxiv.org/abs/2603.01913
作者: Muyu Liu,Chenhe Du,Xuanyu Tian,Qing Wu,Xiao Wang,Haonan Zhang,Hongjiang Wei,Yuyao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, conference paper
Abstract:Low-field (LF) magnetic resonance imaging (MRI) democratizes access to diagnostic imaging but is fundamentally limited by low signal-to-noise ratio and significant tissue contrast distortion due to field-dependent relaxation dynamics. Reconstructing high-field (HF) quality images from LF data is a blind inverse problem, severely challenged by the scarcity of paired training data and the unknown, non-linear contrast transformation operator. Existing zero-shot methods, which assume simplified linear degradation, often fail to recover authentic tissue contrast. In this paper, we propose DACT(Diffusion-Based Adaptive Contrast Transport), a novel zero-shot framework that restores HF-quality images without paired supervision. DACT synergizes a pre-trained HF diffusion prior to ensure anatomical fidelity with a physically-informed adaptive forward model. Specifically, we introduce a differentiable Sinkhorn optimal transport module that explicitly models and corrects the intensity distribution shift between LF and HF domains during the reverse diffusion process. This allows the framework to dynamically learn the intractable contrast mapping while preserving topological consistency. Extensive experiments on simulated and real clinical LF datasets demonstrate that DACT achieves state-of-the-art performance, yielding reconstructions with superior structural detail and correct tissue contrast.
[CV-38] Generative Visual Chain-of-Thought for Image Editing
【速读】:该论文旨在解决现有图像编辑方法在复杂场景和细粒度空间指令下难以准确定位编辑区域的问题(即“where to edit”问题)。其核心解决方案是提出生成式视觉思维链(Generative Visual Chain-of-Thought, GVCoT),该框架通过先生成空间线索以定位目标区域,再执行编辑操作,实现原生的视觉推理能力。GVCoT的关键创新在于端到端联合优化推理阶段与编辑阶段生成的视觉标记(visual tokens),从而激发内在的空间推理能力并更有效地利用视觉域线索,区别于以往依赖文本思维链或工具的视觉推理范式。
链接: https://arxiv.org/abs/2603.01893
作者: Zijin Yin,Tiankai Hang,Yiji Cheng,Shiyi Zhang,Runze He,Yu Xu,Chunyu Wang,Bing Li,Zheng Chang,Kongming Liang,Qinglin Lu,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tencent Hunyuan (腾讯混元); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.
[CV-39] Resolving Blind Inverse Problems under Dynamic Range Compression via Structured Forward Operator Modeling
【速读】:该论文旨在解决未知动态范围压缩(Unknown Dynamic Range Compression, UDRC)任务中的辐射度保真度恢复问题,此类问题广泛存在于低光增强、HDR重建等场景中,其难点在于前向模型未知且压缩过程导致信息不可逆丢失。解决方案的关键在于识别出单调性(monotonicity)作为UDRC任务中的基本物理不变性,并提出级联单调伯恩斯坦(Cascaded Monotonic Bernstein, CaMB)算子来参数化未知的前向模型;CaMB通过硬性架构归纳偏置强制单调性约束,使优化过程局限于物理一致的映射空间,从而实现鲁棒且稳定的算子估计。进一步地,作者将CaMB与即插即用扩散框架结合,构建CaMB-Diff模型,其中扩散模型提供结构和语义恢复的几何先验,而CaMB则显式建模并纠正辐射度失真,显著提升了零样本UDRC任务中的信号保真度与物理一致性。
链接: https://arxiv.org/abs/2603.01890
作者: Muyu Liu,Xuanyu Tian,Chenhe Du,Qing Wu,Hongjiang Wei,Yuyao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures, conference paper
Abstract:Recovering radiometric fidelity from unknown dynamic range compression (UDRC), such as low-light enhancement and HDR reconstruction, is a challenging blind inverse problem, due to the unknown forward model and irreversible information loss introduced by compression. To address this challenge, we first identify monotonicity as the fundamental physical invariant shared across UDRC tasks. Leveraging this insight, we introduce the \textbfcascaded monotonic Bernstein (CaMB) operator to parameterize the unknown forward model. CaMB enforces monotonicity as a hard architectural inductive bias, constraining optimization to physically consistent mappings and enabling robust and stable operator estimation. We further integrate CaMB with a plug-and-play diffusion framework, proposing \textbfCaMB-Diff. Within this framework, the diffusion model serves as a powerful geometric prior for structural and semantic recovery, while CaMB explicitly models and corrects radiometric distortions through a physically grounded forward operator. Extensive experiments on a variety of zero-shot UDRC tasks, including low-light enhancement, low-field MRI enhancement, and HDR reconstruction, demonstrate that CaMB-Diff significantly outperforms state-of-the-art zero-shot baselines in terms of both signal fidelity and physical consistency. Moreover, we empirically validate the effectiveness of the proposed CaMB parameterization in accurately modeling the unknown forward operator.
[CV-40] CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection
【速读】:该论文旨在解决生成式人工智能(Generative AI)在医学影像领域应用中引入的合成CT图像安全风险问题,特别是现有CT伪造检测方法在真实场景下泛化能力不足以及对CT特有伪造痕迹不敏感的问题。其解决方案的关键在于构建一个名为CTForensics的综合性数据集,涵盖十种多样化的CT生成方法,以系统评估检测模型的泛化性能;同时提出一种基于卷积神经网络(CNN)的增强型空间-频率CT伪造检测器(ESF-CTFD),该模型通过小波增强中心主干(Wavelet-Enhanced Central Stem)在多尺度提取特征,并结合空间处理模块与频率处理模块,实现跨空间、频率和小波域的伪造线索融合,从而显著提升检测精度与跨生成模型的鲁棒性。
链接: https://arxiv.org/abs/2603.01878
作者: Yiheng Li,Zichang Tan,Guoqing Xu,Yijun Ye,Yang Yang,Zhen Lei
机构: 1,2,\textsuperscript{(\Letter)}
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review, repo: this https URL
Abstract:With the rapid development of generative AI in medical imaging, synthetic Computed Tomography (CT) images have demonstrated great potential in applications such as data augmentation and clinical diagnosis, but they also introduce serious security risks. Despite the increasing security concerns, existing studies on CT forgery detection are still limited and fail to adequately address real-world challenges. These limitations are mainly reflected in two aspects: the absence of datasets that can effectively evaluate model generalization to reflect the real-world application requirements, and the reliance on detection methods designed for natural images that are insensitive to CT-specific forgery artifacts. In this view, we propose CTForensics, a comprehensive dataset designed to systematically evaluate the generalization capability of CT forgery detection methods, which includes ten diverse CT generative methods. Moreover, we introduce the Enhanced Spatial-Frequency CT Forgery Detector (ESF-CTFD), an efficient CNN-based neural network that captures forgery cues across the wavelet, spatial, and frequency domains. First, it transforms the input CT image into three scales and extracts features at each scale via the Wavelet-Enhanced Central Stem. Then, starting from the largest-scale features, the Spatial Process Block gradually performs feature fusion with the smaller-scale ones. Finally, the Frequency Process Block learns frequency-domain information for predicting the final results. Experiments demonstrate that ESF-CTFD consistently outperforms existing methods and exhibits superior generalization across different CT generative models.
[CV-41] Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling WACV2026
【速读】:该论文旨在解决自动驾驶系统中轨迹预测的连续性与实时性问题,即如何在低延迟条件下实现跨时间步的一致且准确的多智能体轨迹预测。现有方法多基于快照式预测(snapshot-based prediction),忽视了全局时序上下文信息,难以满足真实场景下对数据流持续处理的需求。其解决方案的关键在于提出一种轻量级的流式轨迹预测框架,通过引入端点感知建模机制(endpoint-aware modeling scheme),利用前一时刻预测的轨迹终点作为锚点,提取目标场景的上下文编码,并将其高效传递至场景编码器,从而引导模型在无需迭代优化或分段解码的情况下,精准捕捉时序依赖关系。此设计显著降低了推理延迟,同时在Argoverse 2多智能体和单智能体基准上实现了最先进的Streaming轨迹预测性能。
链接: https://arxiv.org/abs/2603.01864
作者: Alexander Prutsch,David Schinagl,Horst Possegger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: WACV 2026 Oral. Project Page at this https URL
Abstract:Future trajectories of neighboring traffic agents have a significant influence on the path planning and decision-making of autonomous vehicles. While trajectory forecasting is a well-studied field, research mainly focuses on snapshot-based prediction, where each scenario is treated independently of its global temporal context. However, real-world autonomous driving systems need to operate in a continuous setting, requiring real-time processing of data streams with low latency and consistent predictions over successive timesteps. We leverage this continuous setting to propose a lightweight yet highly accurate streaming-based trajectory forecasting approach. We integrate valuable information from previous predictions with a novel endpoint-aware modeling scheme. Our temporal context propagation uses the trajectory endpoints of the previous forecasts as anchors to extract targeted scenario context encodings. Our approach efficiently guides its scene encoder to extract highly relevant context information without needing refinement iterations or segment-wise decoding. Our experiments highlight that our approach effectively relays information across consecutive timesteps. Unlike methods using multi-stage refinement processing, our approach significantly reduces inference latency, making it well-suited for real-world deployment. We achieve state-of-the-art streaming trajectory prediction results on the Argoverse~2 multi-agent and single-agent benchmarks, while requiring substantially fewer resources.
[CV-42] ny-DroNeRF: Tiny Neural Radiance Fields aboard Federated Learning-enabled Nano-drones ICRA2026
【速读】:该论文旨在解决在资源极度受限的纳米级空中机器人(nano-sized aerial robots)上实现复杂视觉任务,特别是高精度三维场景重建的问题。这类机器人通常配备功耗低于100 mW、算力约100 GOPS/s且内存不足100 MB的超低功耗微控制器(ULP MCU),而主流的神经辐射场(NeRF)模型需要数十GB内存和高性能GPU支持,难以部署。解决方案的关键在于提出Tiny-DroNeRF——一种基于Instant-NGP轻量化优化的NeRF模型,其内存占用相比原模型减少96%,仅损失5.7 dB重建质量;并创新性地引入协作式联邦学习(federated learning)机制,在多架纳米无人机间分布式训练模型,突破单机存储限制,显著提升整体重建精度。这是首次将NeRF训练与联邦学习结合应用于超低功耗MCU上的纳米无人机平台。
链接: https://arxiv.org/abs/2603.01850
作者: Ilenia Carboni,Elia Cereda,Lorenzo Lamberti,Daniele Malpetti,Francesco Conti,Daniele Palossi
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: This paper has been accepted for publication in the IEEE ICRA 2026 conference. ©2026 IEEE
Abstract:Sub-30g nano-sized aerial robots can leverage their agility and form factor to autonomously explore cluttered and narrow environments, like in industrial inspection and search and rescue missions. However, the price for their tiny size is a strong limit in their resources, i.e., sub-100 mW microcontroller units (MCUs) delivering \sim 100 GOps/s at best, and memory budgets well below 100 MB. Despite these strict constraints, we aim to enable complex vision-based tasks aboard nano-drones, such as dense 3D scene reconstruction: a key robotic task underlying fundamental capabilities like spatial awareness and motion planning. Top-performing 3D reconstruction methods leverage neural radiance fields (NeRF) models, which require GBs of memory and massive computation, usually delivered by high-end GPUs consuming 100s of Watts. Our work introduces Tiny-DroNeRF, a lightweight NeRF model, based on Instant-NGP, and optimized for running on a GAP9 ultra-low-power (ULP) MCU aboard our nano-drones. Then, we further empower our Tiny-DroNeRF by leveraging a collaborative federated learning scheme, which distributes the model training among multiple nano-drones. Our experimental results show a 96% reduction in Tiny-DroNeRF’s memory footprint compared to Instant-NGP, with only a 5.7 dB drop in reconstruction accuracy. Finally, our federated learning scheme allows Tiny-DroNeRF to train with an amount of data otherwise impossible to keep in a single drone’s memory, increasing the overall reconstruction accuracy. Ultimately, our work combines, for the first time, NeRF training on an ULP MCU with federated learning on nano-drones.
[CV-43] GroupEnsemble: Efficient Uncertainty Estimation for DETR-based Object Detection
【速读】:该论文旨在解决基于Transformer的检测模型(如DETR)在不确定性估计上的局限性,即其置信度分数仅反映语义不确定性,而无法捕捉同样重要的空间不确定性,从而导致对检测可靠性的评估不完整。现有方法如Deep Ensembles虽能提供高质量的空间不确定性估计,但内存开销过大;而MC-Dropout则因需多次前向传播导致推理延迟高。论文提出的解决方案是GroupEnsemble,其关键在于通过在推理阶段向Transformer解码器输入多组独立的对象查询(object queries),并利用注意力掩码阻止组间交互,使每组查询独立预测完整的检测结果,从而在单次前向传播中高效实现基于集成的不确定性估计,兼顾了准确性与效率。
链接: https://arxiv.org/abs/2603.01847
作者: Yutong Yang,Katarina Popović,Julian Wiederer,Markus Braun,Vasileios Belagiannis,Bin Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE IV 2026. 8 pages, 5 figures
Abstract:Detection Transformer (DETR) and its variants show strong performance on object detection, a key task for autonomous systems. However, a critical limitation of these models is that their confidence scores only reflect semantic uncertainty, failing to capture the equally important spatial uncertainty. This results in an incomplete assessment of the detection reliability. On the other hand, Deep Ensembles can tackle this by providing high-quality spatial uncertainty estimates. However, their immense memory consumption makes them impractical for real-world applications. A cheaper alternative, Monte Carlo (MC) Dropout, suffers from high latency due to the need of multiple forward passes during inference to estimate uncertainty. To address these limitations, we introduce GroupEnsemble, an efficient and effective uncertainty estimation method for DETR-like models. GroupEnsemble simultaneously predicts multiple individual detection sets by feeding additional diverse groups of object queries to the transformer decoder during inference. Each query group is transformed by the shared decoder in isolation and predicts a complete detection set for the same input. An attention mask is applied to the decoder to prevent inter-group query interactions, ensuring each group detects independently to achieve reliable ensemble-based uncertainty estimation. By leveraging the decoder’s inherent parallelism, GroupEnsemble efficiently estimates uncertainty in a single forward pass without sequential repetition. We validated our method under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets, respectively. The results show that a hybrid approach combining MC-Dropout and GroupEnsemble outperforms Deep Ensembles on several metrics at a fraction of the cost. The code is available at this https URL. Comments: Accepted to IEEE IV 2026. 8 pages, 5 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.01847 [cs.CV] (or arXiv:2603.01847v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.01847 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-44] FireRed-OCR Technical Report
【速读】:该论文旨在解决通用视觉语言模型(VLM)在处理复杂文档时频繁出现的“结构幻觉”问题,从而限制其在工业光学字符识别(OCR)应用中的性能。解决方案的关键在于提出FireRed-OCR框架,通过三个核心步骤实现从通用VLM到像素级结构化文档解析专家的转化:首先构建基于几何特征聚类与多维标签的“几何+语义”数据工厂,以缓解高质量结构化数据稀缺问题;其次设计三阶段渐进式训练策略,包括多任务预对齐、专用监督微调(SFT)以及格式约束的组相对策略优化(GRPO),其中GRPO引入强化学习机制以确保输出的语法正确性和结构完整性(如表格闭合、公式语法合规)。该方案显著提升了模型在文本、公式、表格及阅读顺序等维度上的综合表现,达到92.94%的OmniDocBench v1.5得分,优于多个强基线模型。
链接: https://arxiv.org/abs/2603.01840
作者: Hao Wu,Haoran Lou,Xinyue Li,Zuodong Zhong,Zhaojun Sun,Phellon Chen,Xuanhe Zhou,Kai Zuo,Yibo Chen,Xu Tang,Yao Hu,Boxiang Zhou,Jian Wu,Yongji Wu,Wenxin Yu,Yingmiao Liu,Yuhao Huang,Manjie Xu,Gang Liu,Yidong Ma,Zhichao Sun,Changhao Qiao
机构: Xiaohongshu Inc(小红书公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a Geometry + Semantics’’ Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model’s understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert’’ paradigm.
[CV-45] LEAR: Learning Edge-Aware Representations for Event-to-LiDAR Localization
【速读】:该论文旨在解决在GPS拒止和视觉退化环境中,如何将稀疏、异步的事件数据(event data)与密集的LiDAR点云地图进行可靠配准的问题。由于事件传感器与LiDAR在感知模态上的本质差异,直接建立对应关系存在根本性困难(ill-posed)。解决方案的关键在于提出一种双任务学习框架LEAR,其通过联合估计边缘结构和稠密事件深度光流场(event-depth flow fields),利用跨模态融合机制将模态不变的几何线索注入运动表征,并采用迭代精化策略强制两个任务间的相互一致性,从而生成具有边缘感知且深度对齐的光流场,显著提升基于PnP求解器的姿态估计鲁棒性和准确性。
链接: https://arxiv.org/abs/2603.01839
作者: Kuangyi Chen,Jun Zhang,Yuxi Hu,Yi Zhou,Friedrich Fraundorfer
机构: Graz University of Technology (格拉茨工业大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Event cameras offer high-temporal-resolution sensing that remains reliable under high-speed motion and challenging lighting, making them promising for localization from LiDAR point clouds in GPS-denied and visually degraded environments. However, aligning sparse, asynchronous events with dense LiDAR maps is fundamentally ill-posed, as direct correspondence estimation suffers from modality gaps. We propose LEAR, a dual-task learning framework that jointly estimates edge structures and dense event-depth flow fields to bridge the sensing-modality divide. Instead of treating edges as a post-hoc aid, LEAR couples them with flow estimation through a cross-modal fusion mechanism that injects modality-invariant geometric cues into the motion representation, and an iterative refinement strategy that enforces mutual consistency between the two tasks over multiple update steps. This synergy produces edge-aware, depth-aligned flow fields that enable more robust and accurate pose recovery via Perspective-n-Point (PnP) solvers. On several popular and challenging datasets, LEAR achieves superior performance over the best prior method. The source code, trained models, and demo videos are made publicly available online.
[CV-46] Affine Correspondences in Stereo Vision: Theory Practice and Limitations
【速读】:该论文旨在解决基于仿射变换(affine transformation)的立体视觉中3D重建精度受限的问题,特别是如何提高表面法向量(surface normal)估计的准确性。其关键解决方案在于提出两种新颖的技术:一是从对应图像方向中估计局部仿射变换的方法,二是利用图像对之间的基础矩阵(fundamental matrix)来增强估计精度。通过合成与真实数据的定量评估,验证了所提方法在典型测试场景下可实现几度误差范围内的表面法向量重建精度,从而提升了整体3D重建质量。
链接: https://arxiv.org/abs/2603.01836
作者: Levente Hajder
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Affine transformations have been recently used for stereo vision. They can be exploited in various computer vision application, e.g., when estimating surface normals, homographies, fundamental and essential matrices. Even full 3D reconstruction can be obtained by using affine correspondences. First, this paper overviews the fundamental statements for affine transformations and epipolar geometry. Then it is investigated how the transformation accuracy influences the quality of the 3D reconstruction. Besides, we propose novel techniques for estimating the local affine transformation from corresponding image directions; moreover, the fundamental matrix, related to the processed image pair, can also be exploited. Both synthetic and real quantitative evaluations are implemented based on the accuracy of the reconstructed surface normals. For the latter one, a special object, containing three perpendicular planes with chessboard patterns, is constructed. The quantitative evaluations are based on the accuracy of the reconstructed surface normals and it is concluded that the estimation accuracy is around a few degrees for realistic test cases. Special stereo poses and plane orientations are also evaluated in detail.
[CV-47] Neural Operator-Grounded Continuous Tensor Function Representation and Its Applications
【速读】:该论文旨在解决现有连续张量函数表示方法中因依赖离散且线性的模式-n 乘积(mode-n product)而导致的表达能力受限问题,从而无法充分建模真实世界数据的复杂结构并可能引入离散化伪影。解决方案的关键在于提出基于神经算子(neural operator)的连续非线性模式-n 算子,它直接将连续的核心张量函数映射到连续的目标张量函数,而非传统方式下对离散核心张量进行线性变换,从而实现真正意义上的连续表示,并显著提升对多维复杂数据(如多光谱图像、彩色视频、遥感影像及点云)的建模精度与泛化能力。
链接: https://arxiv.org/abs/2603.01812
作者: Ruoyang Su,Xi-Le Zhao,Sheng Liu,Wei-Hao Wu,Yisi Luo,Michael K. Ng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
Abstract:Recently, continuous tensor functions have attracted increasing attention, because they can unifiedly represent data both on mesh grids and beyond mesh grids. However, since mode- n product is essentially discrete and linear, the potential of current continuous tensor function representations is still locked. To break this bottleneck, we suggest neural operator-grounded mode- n operators as a continuous and nonlinear alternative of discrete and linear mode- n product. Instead of mapping the discrete core tensor to the discrete target tensor, proposed mode- n operator directly maps the continuous core tensor function to the continuous target tensor function, which provides a genuine continuous representation of real-world data and can ameliorate discretization artifacts. Empowering with continuous and nonlinear mode- n operators, we propose a neural operator-grounded continuous tensor function representation (abbreviated as NO-CTR), which can more faithfully represent complex real-world data compared with classic discrete tensor representations and continuous tensor function representations. Theoretically, we also prove that any continuous tensor function can be approximated by NO-CTR. To examine the capability of NO-CTR, we suggest an NO-CTR-based multi-dimensional data completion model. Extensive experiments across various data on regular mesh grids (multi-spectral images and color videos), on mesh girds with different resolutions (Sentinel-2 images) and beyond mesh grids (point clouds) demonstrate the superiority of NO-CTR.
[CV-48] Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments
【速读】:该论文旨在解决当前生成式 AI 在非语言交流(non-verbal communication)中生成动作数据的统计保真度问题,即判断现代生成模型是否能超越表面模仿,真正参与身体语言的无声但富有表现力的互动。其解决方案的关键在于提出并实现了一个首个能够在实时场景下从2D人体关键点生成自然人机非语言交互的框架,采用四种轻量级神经网络架构,在NVIDIA Orin Nano上达到最高100 FPS的推理速度,有效闭环感知-动作回路。实验表明,通过在合成生成的动作序列上预训练可显著降低运动误差,同时保持高效率;但即便如此,仍存在可测量的“现实差距”(reality gap),尤其在评估基于Sora和VEO等文本到视频系统生成的数据时,性能下降明显,而VEO因更强的时间一致性(temporal coherence)导致性能衰减较小,说明时间连贯性比图像保真度更影响真实世界表现。
链接: https://arxiv.org/abs/2603.01804
作者: Dragos Costea,Alina Marcu,Cristina Lazar,Marius Leordeanu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.
[CV-49] Downstream Task Inspired Underwater Image Enhancement: A Perception-Aware Study from Dataset Construction to Network Design
【速读】:该论文旨在解决当前水下图像增强(Underwater Image Enhancement, UIE)方法主要面向人类视觉感知优化,而忽视了对下游任务(如语义分割、目标检测等)至关重要的高频细节重建问题。其解决方案的关键在于提出了一种面向下游任务的水下图像增强框架(Downstream Task-Inspired Underwater Image Enhancement, DTI-UIE),该框架通过设计一个具备任务感知注意力模块的双分支网络结构实现特征融合,并引入任务驱动的感知损失函数与多阶段训练策略,从而生成更利于下游视觉任务识别的增强图像;同时,受人类感知启发,自动构建了一个任务导向的UIE数据集(Task-Inspired UIE Dataset, TI-UIED),有效提升了增强结果在实际应用中的任务性能。
链接: https://arxiv.org/abs/2603.01767
作者: Bosen Lin,Feng Gao,Yanwei Yu,Junyu Dong,Qian Du
机构: Ocean University of China (中国海洋大学); Mississippi State University (密西西比州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted for publication in IEEE TIP 2026
Abstract:In real underwater environments, downstream image recognition tasks such as semantic segmentation and object detection often face challenges posed by problems like blurring and color inconsistencies. Underwater image enhancement (UIE) has emerged as a promising preprocessing approach, aiming to improve the recognizability of targets in underwater images. However, most existing UIE methods mainly focus on enhancing images for human visual perception, frequently failing to reconstruct high-frequency details that are critical for task-specific recognition. To address this issue, we propose a Downstream Task-Inspired Underwater Image Enhancement (DTI-UIE) framework, which leverages human visual perception model to enhance images effectively for underwater vision tasks. Specifically, we design an efficient two-branch network with task-aware attention module for feature mixing. The network benefits from a multi-stage training framework and a task-driven perceptual loss. Additionally, inspired by human perception, we automatically construct a Task-Inspired UIE Dataset (TI-UIED) using various task-specific networks. Experimental results demonstrate that DTI-UIE significantly improves task performance by generating preprocessed images that are beneficial for downstream tasks such as semantic segmentation, object detection, and instance segmentation. The codes are publicly available at this https URL.
[CV-50] Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation
【速读】:该论文旨在解决零样本深度补全(zero-shot depth completion)中因依赖扩散模型进行测试时优化而导致的计算效率低下问题,以及现有基于视觉提示的方法在推理阶段仍需多次前向-反向传播导致速度缓慢的问题。其解决方案的关键在于发现深度基础模型中的深度相关信息主要集中在低维解码器子空间内,因此仅需对解码器部分进行轻量级测试时适应(test-time adaptation),并通过稀疏深度监督信号更新该子空间即可实现高效且高精度的深度补全。这一方法显著提升了推理效率并达到了当前最优的准确率与效率权衡。
链接: https://arxiv.org/abs/2603.01765
作者: Minseok Seo,Wonjun Lee,Jaehyuk Jang,Changick Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures [We achieved a new Pareto frontier in test-time depth completion.]
Abstract:Zero-shot depth completion has gained attention for its ability to generalize across environments without sensor-specific datasets or retraining. However, most existing approaches rely on diffusion-based test-time optimization, which is computationally expensive due to iterative denoising. Recent visual-prompt-based methods reduce training cost but still require repeated forward–backward passes through the full frozen network to optimize input-level prompts, resulting in slow inference. In this work, we show that adapting only the decoder is sufficient for effective test-time optimization, as depth foundation models concentrate depth-relevant information within a low-dimensional decoder subspace. Based on this insight, we propose a lightweight test-time adaptation method that updates only this low-dimensional subspace using sparse depth supervision. Our approach achieves state-of-the-art performance, establishing a new Pareto frontier between accuracy and efficiency for test-time adaptation. Extensive experiments on five indoor and outdoor datasets demonstrate consistent improvements over prior methods, highlighting the practicality of fast zero-shot depth completion.
[CV-51] Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining
【速读】:该论文旨在解决异构多模态遥感目标检测中因模态对齐与任务优化耦合导致的训练不稳定和泛化性能不佳的问题。现有方法多采用后期对齐范式,在下游微调过程中将模态对齐与特定任务优化交织在一起,造成优化复杂且效果受限。解决方案的关键在于提出BabelRS框架,其核心创新是通过显式解耦模态对齐与下游任务学习:首先利用Concept-Shared Instruction Aligning (CSIA) 模块以语言为语义枢纽,将不同传感器模态(如RGB、SAR、红外)映射到共享的语言概念空间;其次引入Layerwise Visual-Semantic Annealing (LVSA) 模块,逐步聚合多尺度视觉特征以提供细粒度语义引导,从而缓解高层语言表示与密集检测目标之间的粒度不匹配问题。该设计显著提升了训练稳定性并实现了更优的检测性能。
链接: https://arxiv.org/abs/2603.01758
作者: Yuxuan Li,Yuming Chen,Yunheng Li,Ming-Ming Cheng,Xiang Li,Jian Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: this https URL.
[CV-52] StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models
【速读】:该论文旨在解决视觉自回归(Visual AutoRegressive, VAR)模型在高分辨率下推理成本随尺度增加呈二次增长的问题,尤其针对后期尺度中高频纹理细节的冗余计算与结构信息保留不足的矛盾。解决方案的关键在于提出一种无需训练的token剪枝框架StepVAR,其核心创新是通过双准则机制联合评估token的重要性:一方面利用轻量级高通滤波器捕捉局部纹理细节以保留细粒度 fidelity,另一方面借助主成分分析(Principal Component Analysis, PCA)保持全局结构一致性;同时引入最近邻特征传播策略,在稀疏token表示基础上重建稠密特征图,从而保障后续尺度预测的有效性。该方法显著提升了VAR模型的推理效率,且在文本到图像和文本到视频生成任务中均保持高质量输出。
链接: https://arxiv.org/abs/2603.01757
作者: Keli Liu,Zhendong Wang,Wengang Zhou,Houqiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual AutoRegressive (VAR) models based on next-scale prediction enable efficient hierarchical generation, yet the inference cost grows quadratically at high resolutions. We observe that the computationally intensive later scales predominantly refine high-frequency textures and exhibit substantial spatial redundancy, in contrast to earlier scales that determine the global structural layout. Existing pruning methods primarily focus on high-frequency detection for token selection, often overlooking structural coherence and consequently degrading global semantics. To address this limitation, we propose StepVAR, a training-free token pruning framework that accelerates VAR inference by jointly considering structural and textural importance. Specifically, we employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information. This dual-criterion design enables the model to retain tokens critical for both fine-grained fidelity and overall composition. To maintain valid next-scale prediction under sparse tokens, we further introduce a nearest neighbor feature propagation strategy to reconstruct dense feature maps from pruned representations. Extensive experiments on state-of-the-art text-to-image and text-to-video VAR models demonstrate that StepVAR achieves substantial inference speedups while maintaining generation quality. Quantitative and qualitative evaluations consistently show that our method outperforms existing acceleration approaches, validating its effectiveness and general applicability across diverse VAR architectures.
[CV-53] NeuroSymb-MRG: Differentiable Abductive Reasoning with Active Uncertainty Minimization for Radiology Report Generation
【速读】:该论文旨在解决放射科报告自动生成中存在的视觉-语言偏差(visual-linguistic biases)、事实不一致性和缺乏显式多跳临床推理(multi-hop clinical reasoning)的问题。解决方案的关键在于提出一种统一框架 NeuroSymb-MRG,其核心是将神经符号归纳推理(NeuroSymbolic abductive reasoning)与主动不确定性最小化相结合:首先将图像特征映射为概率性临床概念,构建可微分的逻辑推理链,再将其解码为模板化的语句片段,并通过检索和约束语言模型编辑进一步优化文本输出;同时引入基于规则级不确定性和多样性驱动的主动采样循环,引导临床医生参与审定与提示词库(promptbook)迭代优化,从而提升报告的事实一致性与临床合理性。
链接: https://arxiv.org/abs/2603.01756
作者: Rong Fu,Yiqing Lyu,Chunlei Meng,Muge Qi,Yabin Jin,Qi Zhao,Li Bao,Juntao Gao,Fuqian Shi,Nilanjan Dey,Wei Luo,Simon Fong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 1 figure
Abstract:Automatic generation of radiology reports seeks to reduce clinician workload while improving documentation consistency. Existing methods that adopt encoder-decoder or retrieval-augmented pipelines achieve progress in fluency but remain vulnerable to visual-linguistic biases, factual inconsistency, and lack of explicit multi-hop clinical reasoning. We present NeuroSymb-MRG, a unified framework that integrates NeuroSymbolic abductive reasoning with active uncertainty minimization to produce structured, clinically grounded reports. The system maps image features to probabilistic clinical concepts, composes differentiable logic-based reasoning chains, decodes those chains into templated clauses, and refines the textual output via retrieval and constrained language-model editing. An active sampling loop driven by rule-level uncertainty and diversity guides clinician-in-the-loop adjudication and promptbook refinement. Experiments on standard benchmarks demonstrate consistent improvements in factual consistency and standard language metrics compared to representative baselines.
[CV-54] An Analysis of Multi-Task Architectures for the Hierarchic Multi-Label Problem of Vehicle Model and Make Classification
【速读】:该论文旨在解决深度学习模型在处理具有层次结构信息(如汽车品牌与型号分类)时未能有效利用语义层级结构的问题。其解决方案的关键在于引入多任务学习(Multi-Task Learning, MTL)框架,通过并行或级联的多任务架构,使模型同时学习不同层级的标签预测任务,从而增强对层次化语义信息的建模能力。实验表明,该方法在StanfordCars和CompCars两个基准数据集上均提升了卷积神经网络(CNN)和Transformer类模型的性能,尤其在CompCars数据集上表现显著改善,验证了多任务学习在层次化多标签分类任务中的有效性。
链接: https://arxiv.org/abs/2603.01746
作者: Alexandru Manole,Laura Diosan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures ,7 tables
Abstract:Most information in our world is organized hierarchically; however, many Deep Learning approaches do not leverage this semantically rich structure. Research suggests that human learning benefits from exploiting the hierarchical structure of information, and intelligent models could similarly take advantage of this through multi-task learning. In this work, we analyze the advantages and limitations of multi-task learning in a hierarchical multi-label classification problem: car make and model classification. Considering both parallel and cascaded multi-task architectures, we evaluate their impact on different Deep Learning classifiers (CNNs, Transformers) while varying key factors such as dropout rate and loss weighting to gain deeper insight into the effectiveness of this approach. The tests are conducted on two established benchmarks: StanfordCars and CompCars. We observe the effectiveness of the multi-task paradigm on both datasets, improving the performance of the investigated CNN in almost all scenarios. Furthermore, the approach yields significant improvements on the CompCars dataset for both types of models.
[CV-55] Action-Guided Attention for Video Action Anticipation ICLR2026
【速读】:该论文旨在解决视频动作预测(Action Anticipation)中因依赖显式视觉线索而导致模型过拟合、难以捕捉潜在意图从而影响泛化能力的问题。现有基于Transformer的方法通常使用像素级表示的点积注意力机制,缺乏高层语义建模能力,导致其在未见样本上的表现受限。解决方案的关键在于提出一种动作引导注意力机制(Action-Guided Attention, AGA),该机制将预测的动作序列作为查询(query)和键(key)来引导时序建模,使注意力模块能够根据未来动作聚焦过去的相关时刻,并通过专用门控函数融合当前帧嵌入,从而增强对潜在意图的建模能力并提升泛化性能。
链接: https://arxiv.org/abs/2603.01743
作者: Tsung-Ming Tai,Sofia Casarin,Andrea Pilzer,Werner Nutt,Oswald Lanz
机构: Free University of Bozen-Bolzano (博岑-博尔扎诺自由大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026
Abstract:Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.
[CV-56] Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration ICLR2026
【速读】:该论文旨在解决当前全合一图像复原(All-in-One Image Restoration, AiOIR)方法局限于单一图像域(如自然场景、医学影像或遥感图像)的问题,提出首个支持多域的AiOIR框架DATPRL-IR。其核心创新在于引入领域感知的任务提示表示学习(Domain-Aware Task Prompt Representation Learning),通过构建任务提示池(task prompt pool)和领域提示池(domain prompt pool),分别编码任务相关知识和领域先验信息,并利用提示组合机制(Prompt Composition Mechanism, PCM)动态生成实例级的任务与领域表示,最终融合形成具有领域感知能力的任务提示表示,从而在多个图像域中实现更高效且通用的图像复原。
链接: https://arxiv.org/abs/2603.01725
作者: Guanglu Dong,Chunlei Li,Chao Ren,Jingliang Hu,Yilei Shi,Xiao Xiang Zhu,Lichao Mou
机构: Sichuan University (四川大学); MedAI Technology (Wuxi) Co. Ltd. (MedAI科技(无锡)有限公司); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:Recently, significant breakthroughs have been made in all-in-one image restoration (AiOIR), which can handle multiple restoration tasks with a single model. However, existing methods typically focus on a specific image domain, such as natural scene, medical imaging, or remote sensing. In this work, we aim to extend AiOIR to multiple domains and propose the first multi-domain all-in-one image restoration method, DATPRL-IR, based on our proposed Domain-Aware Task Prompt Representation Learning. Specifically, we first construct a task prompt pool containing multiple task prompts, in which task-related knowledge is implicitly encoded. For each input image, the model adaptively selects the most relevant task prompts and composes them into an instance-level task representation via a prompt composition mechanism (PCM). Furthermore, to endow the model with domain awareness, we introduce another domain prompt pool and distill domain priors from multimodal large language models into the domain prompts. PCM is utilized to combine the adaptively selected domain prompts into a domain representation for each input image. Finally, the two representations are fused to form a domain-aware task prompt representation which can make full use of both specific and shared knowledge across tasks and domains to guide the subsequent restoration process. Extensive experiments demonstrate that our DATPRL-IR significantly outperforms existing SOTA image restoration methods, while exhibiting strong generalization capabilities. Code is available at this https URL.
[CV-57] Preoperative-to-intraoperative Liver Registration for Laparoscopic Surgery via Latent-Grounded Correspondence Constraints
【速读】:该论文旨在解决腹腔镜肝切除术中增强现实(Augmented Reality, AR)技术在跨模态配准(2D-3D)时缺乏显式建模可靠几何对应关系的问题,从而导致临床场景下对齐不稳定且可解释性差。其核心解决方案是提出Land-Reg框架,关键在于通过显式学习由潜在证据支撑的2D-3D特征点对应关系作为可解释的中间表示,并在此基础上构建刚性与非刚性配准机制:刚性配准采用跨模态潜在对齐模块将多模态特征映射至统一潜空间,结合不确定性增强的重叠特征点检测器实现鲁棒对应估计;非刚性配准则引入形状约束监督策略,以匹配点为锚点保证重投影一致性并融合局部等距正则化缓解2D-3D深度模糊性,同时通过渲染掩码对齐确保全局形状一致性。
链接: https://arxiv.org/abs/2603.01720
作者: Ruize Cui,Jialun Pei,Haiqiao Wang,Jun Zhou,Jeremy Yuen-Chun Teoh,Pheng-Ann Heng,Jing Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures
Abstract:In laparoscopic liver surgery, augmented reality technology enhances intraoperative anatomical guidance by overlaying 3D liver models from preoperative CT/MRI onto laparoscopic 2D views. However, existing registration methods lack explicit modeling of reliable 2D-3D geometric correspondences supported by latent evidence, leading to limited interpretability and potentially unstable alignment in clinical scenarios. In this work, we introduce Land-Reg, a correspondence-driven deformable registration framework that explicitly learns latent-grounded 2D-3D landmark correspondences as an interpretable intermediate representation to bridge cross-modal alignment. For rigid registration, Land-Reg embraces a Cross-modal Latent Alignment module to map multi-modal features into a unified latent space. Further, an Uncertainty-enhanced Overlap Landmark Detector with similarity matching is proposed to robustly estimate explicit 2D-3D landmark correspondences. For non-rigid registration, we design a novel shape-constrained supervision strategy that anchors shape deformation to matched landmarks through reprojection consistency and incorporates local-isometric regularization to alleviate inherent 2D-3D depth ambiguity, while a rendered-mask alignment enforces global shape consistency. Experimental results on the P2ILF dataset demonstrate the superiority of our method on both rigid pose estimation and non-rigid deformation. Our code will be available at this https URL.
[CV-58] Dual Distillation for Few-Shot Anomaly Detection ICLR2026
【速读】:该论文旨在解决医学图像中少样本异常检测(few-shot anomaly detection)问题,即在仅有少量正常参考图像的情况下,准确识别出未曾见过的病理异常。现有无监督方法通常依赖大量正常数据且难以跨解剖结构泛化,限制了其临床实用性。解决方案的关键在于提出D²4FAD框架,通过双蒸馏机制实现:利用预训练编码器作为教师网络提取支持集和查询集的多尺度特征,学生解码器则从教师网络中蒸馏查询图像的知识,并对支持图像进行自蒸馏;同时引入一个动态加权机制,根据查询图像内容自适应评估每个支持图像的参考价值,从而优化异常检测性能。该方法显著提升了在复杂多器官、多模态场景下的检测精度与泛化能力。
链接: https://arxiv.org/abs/2603.01713
作者: Le Dong,Qinzhong Tan,Chunlei Li,Jingliang Hu,Yilei Shi,Weisheng Dong,Xiao Xiang Zhu,Lichao Mou
机构: Xidian University (西安电子科技大学); MedAI Technology (Wuxi) Co. Ltd. (无锡医智科技有限公司); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:Anomaly detection is a critical task in computer vision with profound implications for medical imaging, where identifying pathologies early can directly impact patient outcomes. While recent unsupervised anomaly detection approaches show promise, they require substantial normal training data and struggle to generalize across anatomical contexts. We introduce D ^2 4FAD, a novel dual distillation framework for few-shot anomaly detection that identifies anomalies in previously unseen tasks using only a small number of normal reference images. Our approach leverages a pre-trained encoder as a teacher network to extract multi-scale features from both support and query images, while a student decoder learns to distill knowledge from the teacher on query images and self-distill on support images. We further propose a learn-to-weight mechanism that dynamically assesses the reference value of each support image conditioned on the query, optimizing anomaly detection performance. To evaluate our method, we curate a comprehensive benchmark dataset comprising 13,084 images across four organs, four imaging modalities, and five disease categories. Extensive experiments demonstrate that D ^2 4FAD significantly outperforms existing approaches, establishing a new state-of-the-art in few-shot medical anomaly detection. Code is available at this https URL.
[CV-59] WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration CVPR26
【速读】:该论文旨在解决自动驾驶中协作感知(collaborative perception)因通信带宽受限而效率低下的问题。现有方法如固定速率编码压缩特征图或基于空间选择的对象中心策略,分别存在环境适应性差和全局场景理解能力弱的缺陷。其解决方案的关键在于提出一种名为WhisperNet的带宽感知框架,采用接收端导向(receiver-centric)的新范式:发送端仅传输轻量级显著性元数据,接收端动态制定全局请求计划以分配各智能体与特征的贡献预算,从而仅检索最具信息量的特征;并通过协同特征路由模块对齐相关消息,确保融合后的结构一致性。该方法实现了高效且鲁棒的跨智能体协作感知,显著提升性能同时大幅降低通信开销。
链接: https://arxiv.org/abs/2603.01708
作者: Gong Chen,Chaokun Zhang,Xinyan Zhao
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR26
Abstract:Collaborative perception is vital for autonomous driving yet remains constrained by tight communication budgets. Earlier work reduced bandwidth by compressing full feature maps with fixed-rate encoders, which adapts poorly to a changing environment, and it further evolved into spatial selection methods that improve efficiency by focusing on salient regions, but this object-centric approach often sacrifices global context, weakening holistic scene understanding. To overcome these limitations, we introduce \textitWhisperNet, a bandwidth-aware framework that proposes a novel, receiver-centric paradigm for global coordination across agents. Senders generate lightweight saliency metadata, while the receiver formulates a global request plan that dynamically budgets feature contributions across agents and features, retrieving only the most informative features. A collaborative feature routing module then aligns related messages before fusion to ensure structural consistency. Extensive experiments show that WhisperNet achieves state-of-the-art performance, improving AP@0.7 on OPV2V by 2.4% with only 0.5% of the communication cost. As a plug-and-play component, it boosts strong baselines with merely 5% of full bandwidth while maintaining robustness under localization noise. These results demonstrate that globally-coordinated allocation across \textitwhat and \textitwhere to share is the key to achieving efficient collaborative perception.
[CV-60] Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking
【速读】:该论文旨在解决Siamese视觉追踪器在资源受限硬件上难以高效实现像素级交互的问题,从而缓解精度与效率之间的失衡。其解决方案的关键在于重新设计Siamese颈部结构,引入一种基于多层感知机(Multilayer Perception, MLP)的融合模块,该模块以最小的结构开销实现高效的像素级交互;同时,为克服MLP堆叠导致计算成本随通道宽度呈二次增长的问题,作者构建了一个分层搜索空间并提出定制化的松弛策略,使不同iable神经架构搜索(Differentiable Neural Architecture Search, DNAS)能够解耦通道宽度优化与其他架构选择,自动平衡通道宽度与深度,最终获得低复杂度且性能优越的追踪模型。
链接: https://arxiv.org/abs/2603.01706
作者: Tianqi Shen,Huakao Lin,Ning An
机构: Institute of Mining Artificial Intelligence, Chinese Institute of Coal Science (煤炭科学研究总院采矿人工智能研究所); Department of Computer Science, City University of Hong Kong (香港城市大学计算机科学系); Image Processing Center, School of Astronautics, Beihang University (北京航空航天大学宇航学院图像处理中心); State Key Laboratory of Intelligent Coal Mining and Strata Control (智能采矿与地层控制国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 12 figures, 7 tables. This work was completed in 2024 and accepted for publication in IEEE TCDS (2026)
Abstract:Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).
[CV-61] owards Principled Dataset Distillation: A Spectral Distribution Perspective
【速读】:该论文旨在解决现有数据集蒸馏(Dataset Distillation, DD)方法在长尾分布数据上性能显著下降的问题,其核心挑战在于两个方面:一是用于衡量分布差异的启发式设计选择不够合理,二是对不平衡类别采取了统一处理方式,忽视了尾部类别的特殊需求。解决方案的关键是提出类感知频谱分布匹配(Class-Aware Spectral Distribution Matching, CSDM),通过将样本映射到频率空间并利用良好行为核函数的谱特性重构分布对齐机制,从而定义出频谱分布距离(Spectral Distribution Distance, SDD)。进一步地,CSDM基于SDD的统一形式进行幅度-相位分解,自适应地提升尾部类别的真实性优先级,从而有效缓解类别不平衡问题,在CIFAR-10-LT数据集上实现显著性能提升与强鲁棒性。
链接: https://arxiv.org/abs/2603.01698
作者: Ruixi Wu,Shaobo Wang,Jiahuan Chen,Zhiyuan Liu,Yicun Yang,Zhaorun Chen,Zekai Li,Kaixin Li,Xinming Wang,Hongzhu Yi,Kai Wang,Linfeng Zhang
机构: EPIC Lab, SJTU; UChicago; NUS; CASIA; UCAS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 pages, 5 tables, 4 figures
Abstract:Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic counterparts for efficient model training. However, existing DD methods exhibit substantial performance degradation on long-tailed datasets. We identify two fundamental challenges: heuristic design choices for distribution discrepancy measure and uniform treatment of imbalanced classes. To address these limitations, we propose Class-Aware Spectral Distribution Matching (CSDM), which reformulates distribution alignment via the spectrum of a well-behaved kernel function. This technique maps the original samples into frequency space, resulting in the Spectral Distribution Distance (SDD). To mitigate class imbalance, we exploit the unified form of SDD to perform amplitude-phase decomposition, which adaptively prioritizes the realism in tail classes. On CIFAR-10-LT, with 10 images per class, CSDM achieves a 14.0% improvement over state-of-the-art DD methods, with only a 5.7% performance drop when the number of images in tail classes decreases from 500 to 25, demonstrating strong stability on long-tailed data.
[CV-62] Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning CVPR2026
【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在生成图像描述时容易遗漏或错误表述关键视觉内容的问题,即信息损失问题。解决方案的关键在于提出一种名为跨模态身份映射(Cross-modal Identity Mapping, CIM)的强化学习框架,通过量化评估两个维度的信息损失:图库表示一致性(Gallery Representation Consistency)与查询-图库图像相关性(Query-gallery Image Relevance),从而引导LVLM实现从图像到文本的精准身份映射,且无需额外标注数据即可提升图像描述质量。
链接: https://arxiv.org/abs/2603.01696
作者: Haonan Jia,Shichao Dong,Xin Dong,Zenghui Sun,Jin Wang,Jinsong Lan,Xiaoyong Zhu,Bo Zheng,Kaifu Zhang
机构: Taobao Tmall Group of Alibaba(淘宝天猫集团); The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026
Abstract:Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on this http URL code will be released when the paper is accepted.
[CV-63] MVR: Multi-view Video Reward Shaping for Reinforcement Learning ICLR2026
【速读】:该论文旨在解决强化学习中奖励设计的挑战,特别是针对视觉反馈任务中存在的三个关键问题:(1)现有基于视觉语言模型(Vision-Language Models, VLMs)的奖励增强方法通常线性叠加VLM得分与任务奖励,未进行显式策略引导,可能破坏最优策略;(2)依赖单一静态图像的奖励机制难以建模涉及复杂动态运动的任务;(3)单视角观测易导致关键行为特征被遮挡。解决方案的核心是提出多视角视频奖励塑造(Multi-View Video Reward Shaping, MVR)框架,其关键在于利用多视角视频捕捉的状态序列,通过冻结预训练VLM计算视频-文本相似度,学习一个状态相关性函数以缓解图像方法对特定静态姿态的偏差;同时引入状态依赖的奖励塑造机制,在任务目标达成后自动降低VLM引导权重,实现任务奖励与VLM语义信息的自适应融合。
链接: https://arxiv.org/abs/2603.01694
作者: Lirui Luo,Guoxi Zhang,Hongming Xu,Yaodong Yang,Cong Fang,Qing Li
机构: Peking University (北京大学); BIGAI (通用人工智能国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2026
Abstract:Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent’s behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.
[CV-64] CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions CVPR26
【速读】:该论文旨在解决多智能体协同感知(cooperative perception)在真实场景中因环境与传感器级退化(如雾霾、雨雪、传感器噪声等)导致的鲁棒性不足和泛化能力下降的问题。其核心解决方案是提出CoopDiff框架,关键在于引入基于扩散模型(diffusion model)的去噪机制:采用教师-学生架构,其中质量感知教师(Quality-Aware Teacher)通过体素级早期融合与兴趣质量加权生成干净监督特征,而双分支扩散学生(Dual-Branch Diffusion Student)在编码阶段分离自车与协作信息以重建教师输出,并借助自车引导的交叉注意力机制(Ego-Guided Cross-Attention)在解码阶段动态融合特征,从而在多种退化条件下实现稳定且高效的协同感知性能。
链接: https://arxiv.org/abs/2603.01688
作者: Gong Chen,Chaokun Zhang,Pengcheng Lv
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR26
Abstract:Cooperative perception lets agents share information to expand coverage and improve scene understanding. However, in real-world scenarios, diverse and unpredictable corruptions undermine its robustness and generalization. To address these challenges, we introduce CoopDiff, a diffusion-based cooperative perception framework that mitigates corruptions via a denoising mechanism. CoopDiff adopts a teacher-student paradigm: the Quality-Aware Teacher performs voxel-level early fusion with Quality of Interest weighting and semantic guidance, then produces clean supervision features via a diffusion denoiser. The Dual-Branch Diffusion Student first separates ego and cooperative streams in encoding to reconstruct the teacher’s clean targets. And then, an Ego-Guided Cross-Attention mechanism facilitates balanced decoding under degradation by adaptively integrating ego and cooperative features. We evaluate CoopDiff on two constructed multi-degradation benchmarks, OPV2Vn and DAIR-V2Xn, each incorporating six corruption types, including environmental and sensor-level distortions. Benefiting from the inherent denoising properties of diffusion, CoopDiff consistently outperforms prior methods across all degradation types and lowers the relative corruption error. Furthermore, it offers a tunable balance between precision and inference efficiency.
[CV-65] DiffusionXRay: A Diffusion and GAN-Based Approach for Enhancing Digitally Reconstructed Chest Radiographs MICCAI2025
【速读】:该论文旨在解决深度学习模型在肺癌自动诊断中因高质量标注数据稀缺而导致的性能受限问题,尤其是针对细微肺结节难以被准确标注的挑战。其解决方案的关键在于提出了一种名为DiffusionXRay的新型图像恢复流程,该流程融合了去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)与生成对抗网络(Generative Adversarial Networks, GANs),采用两阶段训练策略:首先通过DDPM-LQ和基于MUNIT的GAN方法生成低质量胸部X线图像以缓解数据稀缺问题;随后利用配对的高低质量图像训练一个DDPM模型,使其能够学习并还原X射线图像中的细节结构与临床特征,从而显著提升图像清晰度、对比度及诊断价值,同时保留细微但重要的影像学征象。
链接: https://arxiv.org/abs/2603.01686
作者: Aryan Goyal,Ashish Mittal,Pranav Rao,Manoj Tadepalli,Preetham Putha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at MICCAI 2025
Abstract:Deep learning-based automated diagnosis of lung cancer has emerged as a crucial advancement that enables healthcare professionals to detect and initiate treatment earlier. However, these models require extensive training datasets with diverse case-specific properties. High-quality annotated data is particularly challenging to obtain, especially for cases with subtle pulmonary nodules that are difficult to detect even for experienced radiologists. This scarcity of well-labeled datasets can limit model performance and generalization across different patient populations. Digitally reconstructed radiographs (DRR) using CT-Scan to generate synthetic frontal chest X-rays with artificially inserted lung nodules offers one potential solution. However, this approach suffers from significant image quality degradation, particularly in the form of blurred anatomical features and loss of fine lung field structures. To overcome this, we introduce DiffusionXRay, a novel image restoration pipeline for Chest X-ray images that synergistically leverages denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs). DiffusionXRay incorporates a unique two-stage training process: First, we investigate two independent approaches, DDPM-LQ and GAN-based MUNIT-LQ, to generate low-quality CXRs, addressing the challenge of training data scarcity, posing this as a style transfer problem. Subsequently, we train a DDPM-based model on paired low-quality and high-quality images, enabling it to learn the nuances of X-ray image restoration. Our method demonstrates promising results in enhancing image clarity, contrast, and overall diagnostic value of chest X-rays while preserving subtle yet clinically significant artifacts, validated by both quantitative metrics and expert radiological assessment.
[CV-66] FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters CVPR2026
【速读】:该论文旨在解决当前生成式视频模型(如Hunyuan、WanX等)在实际部署中因计算开销巨大而受限的问题,其根源在于模型参数量庞大及推理时所需的迭代多步采样过程。为应对这一挑战,作者提出FastLightGen算法,其核心创新在于构建一个协同蒸馏框架,在该框架下设计最优教师模型以最大化学生模型的性能,从而同时压缩模型规模与减少采样步骤。实验表明,在受限推理预算下,采用4步采样和30%参数剪枝的生成器可实现最佳视觉质量,且FastLightGen在效率与效果上均优于现有方法,树立了高效视频生成的新基准。
链接: https://arxiv.org/abs/2603.01685
作者: Shao Shitong,Gu Yufei,Xie Zeke
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.
[CV-67] A Diffusion-Driven Fine-Grained Nodule Synthesis Framework for Enhanced Lung Nodule Detection from Chest Radiographs
【速读】:该论文旨在解决胸部X光片(CXR)中肺结节早期检测的挑战,尤其是由于结节形态细微且具有多样性的放射学特征(如大小、纹理和边界),导致深度学习辅助诊断(CAD)系统在训练数据不足时性能受限的问题。为缓解数据稀缺问题,论文提出了一种基于扩散模型并结合低秩适配(LoRA)的可控结节生成框架,其核心创新在于通过分层控制机制实现对结节多个放射学特征的精细调控:首先利用结节掩膜条件训练基础扩散模型以控制大小与形状;随后为每个特定特征(如纹理或边界清晰度)独立训练LoRA模块,并通过引入正交性损失项优化LoRA组合策略,从而有效解决特征间注意力区域重叠和参数空间非正交带来的干扰问题,最终实现高保真、可解释且多样化的真实感合成结节数据,显著提升下游结节检测性能。
链接: https://arxiv.org/abs/2603.01659
作者: Aryan Goyal,Shreshtha Singh,Ashish Mittal,Manoj Tadepalli,Piyush Kumar,Preetham Putha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MIDL 2026 (Poster). Published on OpenReview on February 14, 2026. Proceedings version pending. OpenReview: this https URL
Abstract:Early detection of lung cancer in chest radiographs (CXRs) is crucial for improving patient outcomes, yet nodule detection remains challenging due to their subtle appearance and variability in radiological characteristics like size, texture, and boundary. For robust analysis, this diversity must be well represented in training datasets for deep learning based Computer-Assisted Diagnosis (CAD) systems. However, assembling such datasets is costly and often impractical, motivating the need for realistic synthetic data generation. Existing methods lack fine-grained control over synthetic nodule generation, limiting their utility in addressing data scarcity. This paper proposes a novel diffusion-based framework with low-rank adaptation (LoRA) adapters for characteristic controlled nodule synthesis on CXRs. We begin by addressing size and shape control through nodule mask conditioned training of the base diffusion model. To achieve individual characteristic control, we train separate LoRA modules, each dedicated to a specific radiological feature. However, since nodules rarely exhibit isolated characteristics, effective multi-characteristic control requires a balanced integration of features. We address this by leveraging the dynamic composability of LoRAs and revisiting existing merging strategies. Building on this, we identify two key issues, overlapping attention regions and non-orthogonal parameter spaces. To overcome these limitations, we introduce a novel orthogonality loss term during LoRA composition training. Extensive experiments on both in-house and public datasets demonstrate improved downstream nodule detection. Radiologist evaluations confirm the fine-grained controllability of our generated nodules, and across multiple quantitative metrics, our method surpasses existing nodule generation approaches for CXRs.
[CV-68] PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts CVPR2026
【速读】:该论文旨在解决当前零样本立体匹配方法中迭代细化阶段研究不足的问题,尤其是在利用单目深度基础模型(monocular depth foundation models)进行迭代优化时,传统基于GRU的架构因表示能力有限难以有效挖掘深度先验信息。解决方案的关键在于提出一种名为Prompt Recurrent Unit (PRU) 的新型迭代细化模块,该模块基于单目深度基础模型的解码器结构,并将单目结构与立体运动线索作为提示(prompt)引入,从而在保留原有单目深度先验的同时,向潜在表示中注入绝对立体尺度信息,实现更有效的零样本泛化性能。
链接: https://arxiv.org/abs/2603.01650
作者: Xianqi Wang,Hao Yang,Hangtian Wang,Junda Cheng,Gangwei Xu,Min Lin,Xin Yang
机构: Huazhong University of Science and Technology (华中科技大学); Optics Valley Laboratory (光谷实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Modern stereo matching methods have leveraged monocular depth foundation models to achieve superior zero-shot generalization performance. However, most existing methods primarily focus on extracting robust features for cost volume construction or disparity initialization. At the same time, the iterative refinement stage, which is also crucial for zero-shot generalization, remains underexplored. Some methods treat monocular depth priors as guidance for iteration, but conventional GRU-based architectures struggle to exploit them due to the limited representation capacity. In this paper, we propose Prompt Recurrent Unit (PRU), a novel iterative refinement module based on the decoder of monocular depth foundation models. By integrating monocular structure and stereo motion cues as prompts into the decoder, PRU enriches the latent representations of monocular depth foundation models with absolute stereo-scale information while preserving their inherent monocular depth priors. Experiments demonstrate that our PromptStereo achieves state-of-the-art zero-shot generalization performance across multiple datasets, while maintaining comparable or faster inference speed. Our findings highlight prompt-guided iterative refinement as a promising direction for zero-shot stereo matching.
[CV-69] QCAgent : An agent ic framework for quality-controllable pathology report generation from whole slide image
【速读】:该论文旨在解决当前基于全切片图像(Whole-Slide Image, WSI)的病理报告生成方法无法将细粒度诊断描述与局部视觉证据对齐,且缺乏对报告内容细节的可控性及验证机制的问题。其解决方案的关键在于提出QCAgent框架,该框架通过引入由用户定义检查清单驱动的定制化批判机制(customized critique mechanism),实现对报告内容的约束感知和可控生成;同时结合文本-图像片段语义检索(text-patch semantic retrieval)技术,迭代识别信息丰富的区域并重构报告,从而在保证临床意义的同时提升报告覆盖度与可验证性。
链接: https://arxiv.org/abs/2603.01647
作者: Rundong Wang,Wei Ba,Ying Zhou,Yingtai Li,Bowen Liu,Baizhi Wang,Yuhao Wang,Zhidong Yang,Kun Zhang,Rui Yan,S. Kevin Zhou
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent methods for pathology report generation from whole-slide image (WSI) are capable of producing slide-level diagnostic descriptions but fail to ground fine-grained statements in localized visual evidence. Furthermore, they lack control over which diagnostic details to include and how to verify them. Inspired by emerging agentic analysis paradigms and the diagnostic workflow of pathologists,who selectively examine multiple fields of view, we propose QCAgent, an agentic framework for quality-controllable WSI report generation. The core innovations of this framework are as follows: (i) it incorporates a customized critique mechanism guided by a user-defined checklist specifying required diagnostic details and constraints; (ii) it re-identifies informative regions in the WSI based on the critique feedback and text-patch semantic retrieval, a process that iteratively enriches and reconciles the report. Experiments demonstrate that by making report requirements explicitly prompt-defined, constraint-aware, and verifiable through evidence-grounded refinement, QCAgent enables controllable generation of clinically meaningful and high-coverage pathology reports from WSI.
[CV-70] MSP-ReID: Hairstyle-Robust Cloth-Changing Person Re-Identification ICASSP2026
【速读】:该论文针对衣物变化下的行人重识别(Cloth-Changing Person Re-Identification, CC-ReID)问题,旨在提升模型在不同服装条件下对同一个体的匹配准确性。现有方法通常通过去除衣物信息并聚焦头部区域来减少服装干扰,但因将头部视为整体而未区分面部与头发,导致模型过度依赖易变的发型特征,从而在发型变化时性能下降。解决方案的关键在于提出Mitigating Hairstyle Distraction and Structural Preservation (MSP)框架:其核心包括三个模块——Hairstyle-Oriented Augmentation (HSOA) 用于生成同身份下的发型多样性以降低对发型的依赖;Cloth-Preserved Random Erasing (CPRE) 在衣物区域内进行比例可控的随机擦除,抑制纹理偏差同时保留身体结构信息;以及Region-based Parsing Attention (RPA),利用语义分割先验增强面部和肢体区域注意力,抑制头发特征干扰。该方案显著提升了CC-ReID任务的鲁棒性和实用性。
链接: https://arxiv.org/abs/2603.01640
作者: Xiangyang He,Lin Wan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures. Accepted to the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)
Abstract:Cloth-Changing Person Re-Identification (CC-ReID) aims to match the same individual across cameras under varying clothing conditions. Existing approaches often remove apparel and focus on the head region to reduce clothing bias. However, treating the head holistically without distinguishing between face and hair leads to over-reliance on volatile hairstyle cues, causing performance degradation under hairstyle changes. To address this issue, we propose the Mitigating Hairstyle Distraction and Structural Preservation (MSP) framework. Specifically, MSP introduces Hairstyle-Oriented Augmentation (HSOA), which generates intra-identity hairstyle diversity to reduce hairstyle dependence and enhance attention to stable facial and body cues. To prevent the loss of structural information, we design Cloth-Preserved Random Erasing (CPRE), which performs ratio-controlled erasing within clothing regions to suppress texture bias while retaining body shape and context. Furthermore, we employ Region-based Parsing Attention (RPA) to incorporate parsing-guided priors that highlight face and limb regions while suppressing hair features. Extensive experiments on multiple CC-ReID benchmarks demonstrate that MSP achieves state-of-the-art performance, providing a robust and practical solution for long-term person re-identification.
[CV-71] DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂交通规则理解与执行能力上的不足,尤其是在多规则并发和冲突场景下表现不佳的问题。现有基准测试主要聚焦于单一规则任务(如交通标志识别),难以评估模型对真实驾驶环境中复杂规则组合及冲突处理的能力。为此,作者提出DriveCombo——一个基于文本与视觉的组合式交通规则推理基准,并构建了五级认知阶梯(Five-Level Cognitive Ladder)以系统性地量化评估从单规则理解到多规则整合与冲突解决的推理能力;其关键创新在于引入Rule2Scene Agent,通过规则构造与场景生成将语言驱动的交通规则映射至动态驾驶场景,从而实现场景级的交通规则视觉推理,显著提升了模型在复杂交通情境下的合规性和智能决策能力。
链接: https://arxiv.org/abs/2603.01637
作者: Enhui Ma,Jiahuan Zhang,Guantian Zheng,Tao Tang,Shengbo Eben Li,Yuhang Lu,Xia Zhou,Xueyang Zhang,Yifei Zhan,Kun Zhan,Zhihui Hao,Xianpeng Lang,Kaicheng Yu
机构: Autolab, Westlake University (西湖大学自动驾驶实验室); Li Auto Inc (理想汽车); Tsinghua University (清华大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose DriveCombo, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers’ cognitive development, we propose a systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal performance drops as task complexity grows, particularly during rule conflicts. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.
[CV-72] Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration CVPR2026
【速读】:该论文旨在解决扩散模型(Diffusion Models)在推理阶段因迭代次数过多而导致的计算效率低下问题,特别是现有基于局部近似特征缓存与重用的方法在大步长跳跃时误差迅速累积、导致生成质量显著下降的问题。其解决方案的关键在于提出一种无需训练的谱特征预测方法(Spectral Diffusion Feature Forecaster, Spectrum),通过将去噪器的潜在特征视为时间函数,并利用切比雪夫多项式(Chebyshev Polynomials)进行全局建模,结合岭回归(Ridge Regression)拟合基函数系数,从而实现多步未来特征的精准预测。该方法理论上具有非累积误差特性,支持长距离特征重用,显著提升推理速度的同时保持高质量输出。
链接: https://arxiv.org/abs/2603.01623
作者: Jiaqi Han,Juntong Shi,Puheng Li,Haotian Ye,Qiushan Guo,Stefano Ermon
机构: Stanford University (斯坦福大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2026
Abstract:Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79 \times speedup on FLUX.1 and 4.67 \times speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.
[CV-73] Coarse-to-Fine Monocular Re-Localization in OpenStreetMap via Semantic Alignment
【速读】:该论文旨在解决单目重定位(monocular re-localization)中传统方法依赖密集地图所面临的可扩展性限制与隐私风险问题,同时克服开放街道地图(OpenStreetMap, OSM)在实际应用中因自然图像与OSM之间存在跨模态差异以及全局地图匹配计算成本过高所带来的挑战。其解决方案的关键在于提出一种基于语义对齐的分层搜索框架:首先利用DINO-ViT模型的语义感知能力,将视觉元素解构并建立与OSM的语义关联;其次设计粗粒度到细粒度的搜索范式,替代传统的全局稠密匹配策略,实现高效渐进式优化。该方法显著提升了定位精度与速度,在单一数据集训练下,3°方位召回率已超越当前最优方法的5°召回表现。
链接: https://arxiv.org/abs/2603.01613
作者: Yuchen Zou,Xiao Hu,Dexing Zhong,Yuqing Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures
Abstract:Monocular re-localization plays a crucial role in enabling intelligent agents to achieve human-like perception. However, traditional methods rely on dense maps, which face scalability limitations and privacy risks. OpenStreetMap (OSM), as a lightweight map that protects privacy, offers semantic and geometric information with global scalability. Nonetheless, there are still challenges in using OSM for localization: the inherent cross-modal discrepancies between natural images and OSM, as well as the high computational cost of global map-based localization. In this paper, we propose a hierarchical search framework with semantic alignment for localization in OSM. First, the semantic awareness capability of DINO-ViT is utilised to deconstruct visual elements to establish semantic relationships with OSM. Second, a coarse-to-fine search paradigm is designed to replace global dense matching, enabling efficient progressive refinement. Extensive experiments demonstrate that our method significantly improves both localization accuracy and speed. When trained on a single dataset, the 3° orientation recall of our method even outperforms the 5° recall of state-of-the-art methods.
[CV-74] What Helps – and What Hurts: Bidirectional Explanations for Vision Transformers PAKDD2026
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)模型决策过程缺乏可解释性的问题。现有基于类激活映射(Class Activation Mapping, CAM)的方法通常仅关注支持性(正向)贡献,忽略了抑制性(负向)信号,导致解释不完整。解决方案的关键在于提出BiCAM(双向类激活映射),它通过保留带符号的归因信息,同时捕捉正向和负向贡献,从而生成更全面且具有对比性的解释;此外,BiCAM引入正向到负向比值(Positive-to-Negative Ratio, PNR)作为轻量级指标,在无需重新训练的情况下实现对抗样本检测,提升了模型解释的完整性与实用性。
链接: https://arxiv.org/abs/2603.01605
作者: Qin Su,Tie Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: PAKDD 2026: The 30th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Abstract:Vision Transformers (ViTs) achieve strong performance in visual recognition, yet their decision-making remains difficult to interpret. We propose BiCAM, a bidirectional class activation mapping method that captures both supportive (positive) and suppressive (negative) contributions to model predictions. Unlike prior CAM-based approaches that discard negative signals, BiCAM preserves signed attributions to produce more complete and contrastive explanations. BiCAM further introduces a Positive-to-Negative Ratio (PNR) that summarizes attribution balance and enables lightweight detection of adversarial examples without retraining. Across ImageNet, VOC, and COCO, BiCAM improves localization and faithfulness while remaining computationally efficient. It generalizes to multiple ViT variants, including DeiT and Swin. These results suggest the importance of modeling both supportive and suppressive evidence for interpreting transformer-based vision models.
[CV-75] Sparse View Distractor-Free Gaussian Splatting
【速读】:该论文旨在解决稀疏视角条件下无干扰的3D高斯溅射(distractor-free 3D Gaussian Splatting, 3DGS)方法性能显著下降的问题。其核心挑战在于,传统方法依赖颜色残差启发式策略来指导训练,在观测数据稀疏时变得不可靠。解决方案的关键在于引入丰富的先验信息:首先利用几何基础模型VGGT估计相机参数并生成密集初始3D点云;其次借助VGGT的注意力图实现高效且精确的语义实体匹配;此外,通过视觉-语言模型(Vision-Language Models, VLMs)识别并保留场景中的大尺度静态区域。这些先验信息可无缝集成至现有无干扰3DGS框架中,从而显著提升在稀疏输入下的鲁棒性和重建质量。
链接: https://arxiv.org/abs/2603.01603
作者: Yi Gu,Zhaorui Wang,Jiahang Cao,Jiaxu Wang,Mingle Zhao,Dongjun Ye,Renjing Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the reliance on the color residual heuristics to guide the training, which becomes unreliable with limited observations. In this work, we propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information. Specifically, we first adopt the geometry foundation model VGGT to estimate camera parameters and generate a dense set of initial 3D points. Then, we harness the attention maps from VGGT for efficient and accurate semantic entity matching. Additionally, we utilize Vision-Language Models (VLMs) to further identify and preserve the large static regions in the scene. We also demonstrate how these priors can be seamlessly integrated into existing distractor-free 3DGS methods. Extensive experiments confirm the effectiveness and robustness of our approach in mitigating transient distractors for sparse-view 3DGS training.
[CV-76] YCDa: YCbCr Decoupled Attention for Real-time Realistic Camouflaged Object Detection
【速读】:该论文旨在解决实时目标检测中对伪装目标(camouflaged objects)感知能力不足的问题,特别是在颜色线索不可靠的复杂视觉环境中,传统检测器易受误导性颜色噪声干扰,导致性能下降。解决方案的关键在于提出一种名为YCDa(Chrominance-Luminance Decoupling and Dynamic Attention)的早期特征处理策略,其核心机制是:在输入阶段将颜色(chrominance)与亮度/纹理(luminance)信息解耦,并通过动态通道注意力机制自适应地增强判别性特征、抑制误导性颜色噪声,从而提升模型对伪装目标的鲁棒性识别能力。该方法具有即插即用特性,仅需替换首层下采样模块即可集成至现有检测框架,且计算开销极低,实验证明其在COD10K-D等数据集上显著提升了mAP指标,实现了实时伪装目标检测的新SOTA性能。
链接: https://arxiv.org/abs/2603.01602
作者: PeiHuang Zheng,Yunlong Zhao,Zheng Cui,Yang Li
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages,6 figures
Abstract:Human vision exhibits remarkable adaptability in perceiving objects under camouflage. When color cues become unreliable, the visual system instinctively shifts its reliance from chrominance (color) to luminance (brightness and texture), enabling more robust perception in visually confusing environments. Drawing inspiration from this biological mechanism, we propose YCDa, an efficient early-stage feature processing strategy that embeds this “chrominance-luminance decoupling and dynamic attention” principle into modern real-time detectors. Specifically, YCDa separates color and luminance information in the input stage and dynamically allocates attention across channels to amplify discriminative cues while suppressing misleading color noise. The strategy is plug-and-play and can be integrated into existing detectors by simply replacing the first downsampling layer. Extensive experiments on multiple baselines demonstrate that YCDa consistently improves performance with negligible overhead as shown in Fig. Notably, YCDa-YOLO12s achieves a 112% improvement in mAP over the baseline on COD10K-D and sets new state-of-the-art results for real-time camouflaged object detection across COD-D datasets.
[CV-77] Dehallu3D: Hallucination-Mitigated 3D Generation from Single Image via Cyclic View Consistency Refinement
【速读】:该论文旨在解决大规模3D重建模型中因多视图图像稀疏生成导致的结构幻觉(hallucinations)问题,这类幻觉表现为几何异常(如奇怪的孔洞或突起),严重影响3D打印物体的完整性或虚拟场景的沉浸感。其核心解决方案是提出Dehallu3D框架,关键在于设计了一个平衡的多视角连续性约束机制:通过邻接一致性(adjacent consistency)确保跨视角的几何连续性,同时引入自适应平滑策略(adaptive smoothness)避免过度平滑而丢失锐利几何特征,从而在保留细节的同时有效消除幻觉异常点。此外,作者还提出了Outlier Risk Measure (ORM) 用于量化3D生成中由异常点引起的几何保真度损失。
链接: https://arxiv.org/abs/2603.01601
作者: Xiwen Wang,Shichao Zhang,Hailun Zhang,Ruowei Wang,Mao Li,Chenyu Zhou,Qijun Zhao,Ji-Zhe Zhou
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3D-printed objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint to enforce smooth transitions across dense intermediate viewpoints, while avoiding over-smoothing that could erase sharp geometric features. Therefore, Dehallu3D employs a plug-and-play optimization module with two key constraints: (i) adjacent consistency to ensure geometric continuity across views, and (ii) adaptive smoothness to retain fine this http URL further propose the Outlier Risk Measure (ORM) metric to quantify geometric fidelity in 3D generation from the perspective of outliers. Extensive experiments show that Dehallu3D achieves high-fidelity 3D generation by effectively preserving structural details while removing hallucinated outliers.
[CV-78] Preference Score Distillation: Leverag ing 2D Rewards to Align Text-to-3D Generation with Human Preference
【速读】:该论文旨在解决扩散模型在文本到3D生成任务中的人类偏好对齐问题,这一挑战在数据稀缺的3D领域尤为突出。现有方法通常依赖特定任务的微调,难以推广。其解决方案的核心是提出偏好得分蒸馏(Preference Score Distillation, PSD),该框架利用预训练的2D奖励模型实现无需3D训练数据的人类对齐文本到3D合成。关键创新在于:首先,认识到像素级梯度不兼容性会干扰去噪过程,从而将偏好对齐重新建模为类似无分类器引导(Classifier-Free Guidance, CFG)的机制;其次,引入自适应策略联合优化偏好得分与负向文本嵌入,通过在线更新负向文本嵌入增强对齐效果,实现了与多种生成流水线的无缝集成和强扩展性。
链接: https://arxiv.org/abs/2603.01594
作者: Jiaqi Leng,Shuyuan Tu,Haidong Cao,Sicheng Xie,Daoguo Dong,Zuxuan Wu,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human preference alignment presents a critical yet underexplored challenge for diffusion models in text-to-3D generation. Existing solutions typically require task-specific fine-tuning, posing significant hurdles in data-scarce 3D domains. To address this, we propose Preference Score Distillation (PSD), an optimization-based framework that leverages pretrained 2D reward models for human-aligned text-to-3D synthesis without 3D training data. Our key insight stems from the incompatibility of pixel-level gradients: due to the absence of noisy samples during reward model training, direct application of 2D reward gradients disturbs the denoising process. Noticing that similar issue occurs in the naive classifier guidance in conditioned diffusion models, we fundamentally rethink preference alignment as a classifier-free guidance (CFG)-style mechanism through our implicit reward model. Furthermore, recognizing that frozen pretrained diffusion models constrain performance, we introduce an adaptive strategy to co-optimize preference scores and negative text embeddings. By incorporating CFG during optimization, online refinement of negative text embeddings dynamically enhances alignment. To our knowledge, we are the first to bridge human preference alignment with CFG theory under score distillation framework. Experiments demonstrate the superiority of PSD in aesthetic metrics, seamless integration with diverse pipelines, and strong extensibility.
[CV-79] PPEDCRF: Privacy-Preserving Enhanced Dynamic CRF for Location-Privacy Protection for Sequence Videos with Minimal Detection Degradation
【速读】:该论文旨在解决自动驾驶或辅助驾驶系统采集的行车记录视频(dashcam videos)在去除显式GPS元数据后,仍可能因背景视觉线索(如建筑和道路布局)被攻击者通过与大规模街景图像匹配而泄露地理位置的问题。解决方案的关键在于提出PPEDCRF框架,其核心创新是仅对推断出的位置敏感背景区域注入校准扰动,同时保持前景目标检测与分割的性能不受显著影响;该框架包含三个关键组件:动态条件随机场(Dynamic CRF)用于跨帧追踪位置敏感区域并保证时序一致性,归一化控制惩罚(Normalized Control Penalty, NCP)依据分层敏感性模型分配扰动强度,以及保用噪声注入模块以最小化对检测任务的干扰。
链接: https://arxiv.org/abs/2603.01593
作者: Bo Ma,Jinsong Wu,Weiqi Yan,Catherine Shi,Minh Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dashcam videos collected by autonomous or assisted-driving systems are increasingly shared for safety auditing and model improvement. Even when explicit GPS metadata are removed, an attacker can still infer the recording location by matching background visual cues (e.g., buildings and road layouts) against large-scale street-view imagery. This paper studies location-privacy leakage under a background-based retrieval attacker, and proposes PPEDCRF, a privacy-preserving enhanced dynamic conditional random field framework that injects calibrated perturbations only into inferred location-sensitive background regions while preserving foreground detection utility. PPEDCRF consists of three components: (i) a dynamic CRF that enforces temporal consistency to discover and track location sensitive regions across frames, (ii) a normalized control penalty (NCP) that allocates perturbation strength according to a hierarchical sensitivity model, and (iii) a utility-preserving noise injection module that minimizes interference to object detection and segmentation. Experiments on public driving datasets demonstrate that PPEDCRF significantly reduces location-retrieval attack success (e.g., Top-k retrieval accuracy) while maintaining competitive detection performance (e.g., mAP and segmentation metrics) compared with common baselines such as global noise, white-noise masking, and feature-based anonymization. The source code is in this https URL
[CV-80] FAST-DIPS: Adjoint-Free Analytic Steps and Hard-Constrained Likelihood Correction for Diffusion-Prior Inverse Problems
【速读】:该论文旨在解决无训练扩散先验(training-free diffusion priors)在非线性前向算子下的逆问题求解效率低下问题,即传统方法依赖重复的导数计算或内层优化/MCMC循环,导致迭代次数多、计算成本高。其解决方案的关键在于:用测量空间的硬可行性约束(闭式投影)替代内层优化循环,并引入解析的、模型最优步长规则,从而在每个噪声水平下实现固定且较小的计算预算。具体而言,通过基于去噪器预测的修正项,采用无需伴随(adjoint-free)的ADMM式分裂策略,结合投影与少量最速下降更新(仅需一次向量-雅可比乘积VJP和一次JVP或前向差分探测),并辅以回溯搜索与解耦重退火机制,实现了局部模型最优性和收敛性保证。
链接: https://arxiv.org/abs/2603.01591
作者: Minwoo Kim,Seunghyeok Shin,Hongki Lim
机构: Inha University (仁川大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training-free diffusion priors enable inverse-problem solvers without retraining, but for nonlinear forward operators data consistency often relies on repeated derivatives or inner optimization/MCMC loops with conservative step sizes, incurring many iterations and denoiser/score evaluations. We propose a training-free solver that replaces these inner loops with a hard measurement-space feasibility constraint (closed-form projection) and an analytic, model-optimal step size, enabling a small, fixed compute budget per noise level. Anchored at the denoiser prediction, the correction is approximated via an adjoint-free, ADMM-style splitting with projection and a few steepest-descent updates, using one VJP and either one JVP or a forward-difference probe, followed by backtracking and decoupled re-annealing. We prove local model optimality and descent under backtracking for the step-size rule, and derive an explicit KL bound for mode-substitution re-annealing under a local Gaussian conditional surrogate. We also develop a latent variant and a one-parameter pixel \rightarrow latent hybrid schedule. Experiments achieve competitive PSNR/SSIM/LPIPS with up to 19.5 \times speedup, without hand-coded adjoints or inner MCMC.
[CV-81] InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning
【速读】:该论文旨在解决复杂多实体场景中细粒度图像编辑的挑战,尤其针对目标不具视觉显著性且需空间推理的情况。解决方案的关键在于提出InterCoG框架,其核心思想是通过文本中的空间关系信息进行对象位置推理,以显式确定待编辑目标的位置与身份;随后在像素空间中通过生成边界框和掩码完成视觉定位,并重写编辑描述以明确意图。该方法引入两个辅助训练模块——多模态接地重建监督和多模态接地推理对齐,分别提升空间定位精度与推理可解释性,从而实现高精度的细粒度编辑。
链接: https://arxiv.org/abs/2603.01586
作者: Yecong Wan,Fan Li,Chunwei Wang,Hao Wu,Mingwen Shao,Wangmeng Zuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.
[CV-82] SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis
【速读】:该论文旨在解决当前生成式AI在将逼真且结构合理的虚拟人体图像合成到现有场景时存在的系统性问题,如肢体扭曲和姿态不自然等缺陷。研究表明,这些问题源于模型缺乏对人体骨骼结构的显式推理能力。解决方案的关键在于提出SkeleGuide框架,其核心创新是通过联合训练推理与渲染阶段,学习一个内部姿态表示作为强结构先验,从而引导图像生成过程以实现高结构完整性;同时引入PoseInverter模块,将该内部潜在姿态解码为可编辑的显式格式,实现细粒度用户控制。
链接: https://arxiv.org/abs/2603.01579
作者: Chuqiao Wu,Jin Song,Yiyun Fei
机构: Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generating realistic and structurally plausible human images into existing scenes remains a significant challenge for current generative models, which often produce artifacts like distorted limbs and unnatural poses. We attribute this systemic failure to an inability to perform explicit reasoning over human skeletal structure. To address this, we introduce SkeleGuide, a novel framework built upon explicit skeletal reasoning. Through joint training of its reasoning and rendering stages, SkeleGuide learns to produce an internal pose that acts as a strong structural prior, guiding the synthesis towards high structural integrity. For fine-grained user control, we introduce PoseInverter, a module that decodes this internal latent pose into an explicit and editable format. Extensive experiments demonstrate that SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images. Our work provides compelling evidence that explicitly modeling skeletal structure is a fundamental step towards robust and plausible human image synthesis.
[CV-83] Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications
【速读】:该论文旨在解决地球观测领域中针对冰冻圈(Cryosphere)应用的生成式基础模型(Geo-Foundation Models, GFMs)缺乏系统性评估的问题,其核心挑战在于现有数据集难以支撑对GFMs在冰川、冰湖、海冰及冰崖等关键冰冻圈组件上的性能进行全面 benchmarking。为应对这一问题,作者提出了Cryo-Bench基准测试平台,涵盖多传感器、多地理区域的多种冰冻圈场景,并在此基础上评估了14种GFMs与UNet和ViT基线模型的表现。解决方案的关键在于:首先通过构建结构化、多样化的冰冻圈专用评估数据集实现可复现的性能对比;其次发现冻结编码器策略下UNet表现最优(平均mIoU 66.38),而在少样本设置(10%数据)中GFMs如DOFA和TerraMind显著优于UNet;最后指出模型微调时学习率优化是提升GFM性能的核心因素,相较固定微调策略,采用超参数调优可带来平均12.77%的相对性能提升,从而确立了“冻结编码器用于快速部署、微调+超参优化用于最佳性能”的实用策略。
链接: https://arxiv.org/abs/2603.01576
作者: Saurabh Kaushik,Lalit Maurya,Beth Tellman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbfCryo-Bench, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf66.38, followed by TerraMind at \textbf64.02 across five evluation dataset included in Cryo-Bench. In the few-shot setting (10% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf59.53, \textbf56.62, and \textbf56.60, respectively, comapred to U-Net’s 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf12.77%. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\hrefthis https URLGitHub).
[CV-84] Rate-Distortion Signatures of Generalization and Information Trade-offs
【速读】:该论文旨在解决视觉系统在面对新颖视觉条件时的泛化能力问题,尤其是如何量化和比较人类与机器视觉系统在准确率与鲁棒性之间的权衡关系。传统鲁棒性指标难以揭示系统在不同扰动下准确率变化的边际成本与非线性特征。其解决方案的关键在于引入基于信息论的率失真(rate-distortion, RD)理论框架,将刺激-响应行为建模为有效通信信道,并从混淆矩阵中推导出RD前沿;进而用两个可解释的几何签名——斜率(β,表征准确率-鲁棒性权衡的边际成本)和曲率(κ,表征权衡的突变程度)来刻画系统泛化行为。这一方法实现了对生物与人工视觉系统泛化几何结构的模型无关比较,揭示了人类系统具有更平滑、灵活的权衡特性,而现代深度网络即使在相同准确率下也处于更陡峭、脆弱的区域,且鲁棒性训练带来的β/κ变化并非总是趋向人类模式。
链接: https://arxiv.org/abs/2603.01568
作者: Leyla Roksan Caglar,Pedro A.M. Mediano,Baihan Lin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Generalization to novel visual conditions remains a central challenge for both human and machine vision, yet standard robustness metrics offer limited insight into how systems trade accuracy for robustness. We introduce a rate-distortion-theoretic framework that treats stimulus-response behavior as an effective communication channel, derives rate-distortion (RD) frontiers from confusion matrices, and summarizes each system with two interpretable geometric signatures - slope ( \beta ) and curvature ( \kappa ) - which capture the marginal cost and abruptness of accuracy-robustness trade-offs. Applying this framework to human psychophysics and 18 deep vision models under controlled image perturbations, we compare generalization geometry across model architectures and training regimes. We find that both biological and artificial systems follow a common lossy-compression principle but occupy systematically different regions of RD space. In particular, humans exhibit smoother, more flexible trade-offs, whereas modern deep networks operate in steeper and more brittle regimes even at matched accuracy. Across training regimes, robustness training induces systematic but dissociable shifts in beta/kappa, revealing cases where improved robustness or accuracy does not translate into more human-like generalization geometry. These results demonstrate that RD geometry provides a compact, model-agnostic lens for comparing generalization behavior across systems beyond standard accuracy-based metrics.
[CV-85] opoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding
【速读】:该论文旨在解决现有道路拓扑理解方法中2D预测受限于离散化伪影(discretization artifacts)以及地理数据泄露(geographic data leakage)导致模型泛化能力不足的问题。其关键解决方案在于提出TopoMaskV3,通过引入两个新型密集预测头:一是用于亚网格精度校正的密集偏移场(dense offset field),在保持BEV分辨率的前提下提升空间精度;二是直接估计3D坐标的密集高度图(dense height map),实现无需参数化头部的端到端3D道路中心线预测。此外,论文首次通过地理上互斥的数据划分和±100米长距离基准测试来消除地理过拟合,显著提升了模型在真实场景中的鲁棒性和泛化性能。
链接: https://arxiv.org/abs/2603.01558
作者: Muhammet Esat Kalfaoglu,Halil Ibrahim Ozturk,Ozsel Kilinc,Alptekin Temizel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mask-based paradigms for road topology understanding, such as TopoMaskV2, offer a complementary alternative to query-based methods by generating centerlines via a dense rasterized intermediate representation. However, prior work was limited to 2D predictions and suffered from severe discretization artifacts, necessitating fusion with parametric heads. We introduce TopoMaskV3, which advances this pipeline into a robust, standalone 3D predictor via two novel dense prediction heads: a dense offset field for sub-grid discretization correction within the existing BEV resolution, and a dense height map for direct 3D estimation. Beyond the architecture, we are the first to address geographic data leakage in road topology evaluation by introducing (1) geographically distinct splits to prevent memorization and ensure fair generalization, and (2) a long-range (+/-100 m) benchmark. TopoMaskV3 achieves state-of-the-art 28.5 OLS on this geographically disjoint benchmark, surpassing all prior methods. Our analysis shows that the mask representation is more robust to geographic overfitting than Bezier, while LiDAR fusion is most beneficial at long range and exhibits larger relative gains on the overlapping original split, suggesting overlap-induced memorization effects.
[CV-86] Align-cDAE: Alzheimers Disease Progression Modeling with Attention-Aligned Conditional Diffusion Auto-Encoder
【速读】:该论文旨在解决现有基于扩散模型的生成式AI框架在模拟阿尔茨海默病(Alzheimer’s disease)脑影像纵向进展时存在的两个关键问题:一是非影像条件模态(如临床评分、基因信息等)与图像特征之间缺乏显式的语义对齐,导致生成图像中无法精准调控特定病变区域的变化;二是扩散自动编码器的潜在表示空间结构不够明确,难以实现对疾病进展相关特征与个体特异性信息的有效分离和控制。解决方案的关键在于:首先引入一个显式的对齐目标函数,强制模型关注与疾病进展相关的区域,从而提升多模态条件信息在生成过程中的语义一致性;其次设计了一种潜在空间结构化机制,将潜在空间划分为独立子空间——一个用于整合疾病进展相关的条件信息,另一个用于保留受试者的个体身份特征,从而实现更精确、可控的图像生成。
链接: https://arxiv.org/abs/2603.01552
作者: Ayantika Das,Keerthi Ram,Mohanasankar Sivaprakasam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative AI framework-based modeling and prediction of longitudinal human brain images offer an efficient mechanism to track neurodegenerative progression, essential for the assessment of diseases like Alzheimer’s. Among the existing generative approaches, recent diffusion-based models have emerged as an effective alternative to generate disease progression images. Incorporating multi-modal and non-imaging attributes as conditional information into diffusion frameworks has been shown to improve controllability during such generations. However, existing methods do not explicitly ensure that information from non-imaging conditioning modalities is meaningfully aligned with image features to introduce desirable changes in the generated images, such as modulation of progression-specific regions. Further, more precise control over the generation process can be achieved by introducing progression-relevant structure into the internal representations of the model, lacking in the existing approaches. To address these limitations, we propose a diffusion autoencoder-based framework for disease progression modeling that explicitly enforces alignment between different modalities. The alignment is enforced by introducing an explicit objective function that enables the model to focus on the regions exhibiting progression-related changes. Further, we devise a mechanism to better structure the latent representational space of the diffusion auto-encoding framework. Specifically, we assign separate latent subspaces for integrating progression-related conditions and retaining subject-specific identity information, allowing better-controlled image generation. These results demonstrate that enforcing alignment and better structuring of the latent representational space of diffusion auto-encoding framework leads to more anatomically precise modeling of Alzheimer’s disease progression.
[CV-87] Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在物理交互中缺乏对时空动态(spatiotemporal dynamics)理解的问题,即模型虽具备良好的语义理解能力,但难以捕捉动作与环境之间随时间演化的物理关系。其解决方案的关键在于提出Pri4R方法,通过在训练阶段引入轻量级的3D点轨迹预测头(point track head),利用特权4D信息(如视频序列中的3D点运动轨迹)引导VLA模型学习场景几何的演化规律,并将此类动态信息隐式嵌入共享表示空间中,从而增强模型对物理世界响应的理解能力。该设计不改变原VLA架构,推理时无额外计算开销,且在仿真与真实场景中显著提升了复杂操作任务的性能。
链接: https://arxiv.org/abs/2603.01549
作者: Jisoo Kim,Jungbin Cho,Sanghyeok Chu,Ananya Bal,Jinhyung Kim,Gunhee Lee,Sihaeng Lee,Seung Hwan Kim,Bohyung Han,Hyunmin Lee,Laszlo A. Jeni,Seungryong Kim
机构: KAIST AI; LG AI Research; Yonsei University; Seoul National University; Carnegie Mellon University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.
[CV-88] PathMoE: Interpretable Multimodal Interaction Experts for Pediatric Brain Tumor Classification
【速读】:该论文旨在解决儿科中枢神经系统肿瘤(pediatric central nervous system tumors)精准分类难题,其核心挑战在于组织病理学表现复杂且训练数据有限。现有基于病理图像的基础模型(pathology foundation models)虽能处理全切片图像(WSI),但未能有效融合临床文本和组织微结构等互补信息。解决方案的关键在于提出 PathMoE 框架——一个可解释的多模态学习架构,通过交互感知的专家混合(interaction-aware mixture-of-experts)机制,整合 H&E 染色切片、病理报告文本及细胞级图结构(nuclei-level cell graphs)三类模态数据;该框架利用输入依赖的门控机制动态加权不同模态间的独特性、冗余性和协同效应,从而实现样本级别的可解释性,显著提升分类性能并揭示驱动单个预测的具体模态交互关系。
链接: https://arxiv.org/abs/2603.01547
作者: Jian Yu,Joakim Nguyen,Jinrui Fang,Awais Naeem,Zeyuan Cao,Sanjay Krishnan,Nicholas Konz,Tianlong Chen,Chandra Krishnan,Hairong Wang,Edward Castillo,Ying Ding,Ankita Shukla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate classification of pediatric central nervous system tumors remains challenging due to histological complexity and limited training data. While pathology foundation models have advanced whole-slide image (WSI) analysis, they often fail to leverage the rich, complementary information found in clinical text and tissue microarchitecture. To this end, we propose PathMoE, an interpretable multimodal framework that integrates H\E slides, pathology reports, and nuclei-level cell graphs via an interaction-aware mixture-of-experts architecture built on state-of-the-art foundation models for each modality. By training specialized experts to capture modality uniqueness, redundancy, and synergy, PathMoE employs an input-dependent gating mechanism that dynamically weights these interactions, providing sample-level interpretability. We evaluate our framework on two dataset-specific classification tasks on an internal pediatric brain tumor dataset (PBT) and external TCGA datasets. PathMoE improves macro-F1 from 0.762 to 0.799 (+0.037) on PBT when integrating WSI, text, and graph modalities; on TCGA, augmenting WSI with graph knowledge improves macro-F1 from 0.668 to 0.709 (+0.041). These results demonstrate significant performance gains over state-of-the-art image-only baselines while revealing the specific modality interactions driving individual predictions. This interpretability is particularly critical for rare tumor subtypes, where transparent model reasoning is essential for clinical trust and diagnostic validation.
[CV-89] raining-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory AAAI2026
【速读】:该论文旨在解决推理视频目标分割(Reasoning Video Object Segmentation, ReasonVOS)任务中两个核心问题:一是现有方法依赖对多模态大语言模型(Multimodal Large Language Models, MLLMs)进行微调,计算资源消耗大;二是现有方法在时空信息处理上耦合紧密,影响分割结果的时间稳定性。解决方案的关键在于提出一种无需训练的时空解耦推理视频分割框架(Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory, SDAM),其核心创新包括:1)通过自适应对象记忆模块(Adaptive Object Memory)基于运动线索动态选择并存储关键对象,实现跨帧对象一致性;2)采用时空解耦机制,在空间域实现精准的目标定位与分割,在时间域利用关键对象的时序信息驱动稳定跨帧传播,从而在不依赖微调的前提下显著提升分割精度与时间稳定性。
链接: https://arxiv.org/abs/2603.01545
作者: Zhengtong Zhu,Jiaqing Fan,Zhixuan Liu,Fanzhang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by AAAI2026
Abstract:Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free \textbfSpatio-temporal \textbfDecoupled Reasoning Video Segmentation with \textbfAdaptive Object \textbfMemory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS.
[CV-90] RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry
【速读】:该论文旨在解决生成式 AI(Generative AI)所产图像日益逼真导致下游识别系统可靠性下降的问题,特别是传统依赖外观特征(appearance-driven)的检测方法因图像视觉线索弱化而稳定性不足。解决方案的关键在于提出“鲁棒性不对称性”(robustness asymmetry)这一行为驱动的新信号:自然图像在小规模结构化扰动下保持稳定的语义表征,而生成图像则表现出显著更大的特征漂移。基于此现象,作者设计了鲁棒性不对称检测(RA-Det)框架,将该行为差异转化为可靠的判别信号,其优势在于无需生成模型指纹、数据和模型无关,并具备跨未见生成器的迁移能力,从而实现通用且高效的合成图像检测。
链接: https://arxiv.org/abs/2603.01544
作者: Xinchang Wang,Yunhao Chen,Yuechen Zhang,Congcong Bian,Zihao Guo,Xingjun Ma,Hui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent image generators produce photo-realistic content that undermines the reliability of downstream recognition systems. As visual appearance cues become less pronounced, appearance-driven detectors that rely on forensic cues or high-level representations lose stability. This motivates a shift from appearance to behavior, focusing on how images respond to controlled perturbations rather than how they look. In this work, we identify a simple and universal behavioral signal. Natural images preserve stable semantic representations under small, structured perturbations, whereas generated images exhibit markedly larger feature drift. We refer to this phenomenon as robustness asymmetry and provide a theoretical analysis that establishes a lower bound connecting this asymmetry to memorization tendencies in generative models, explaining its prevalence across architectures. Building on this insight, we introduce Robustness Asymmetry Detection (RA-Det), a behavior-driven detection framework that converts robustness asymmetry into a reliable decision signal. Evaluated across 14 diverse generative models and against more than 10 strong detectors, RA-Det achieves superior performance, improving the average performance by 7.81 percent. The method is data- and model-agnostic, requires no generator fingerprints, and transfers across unseen generators. Together, these results indicate that robustness asymmetry is a stable, general cue for synthetic-image detection and that carefully designed probing can turn this cue into a practical, universal detector. The source code is publicly available at Github.
[CV-91] Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing
【速读】:该论文旨在解决语义分割模型在复杂多变场景下鲁棒性不足的问题,尤其是在面对对象级和图像级属性变化(如颜色、材质、尺寸、位置及天气、风格等)时的性能退化问题。解决方案的关键在于构建一个名为Gen4Seg的自动化数据生成流水线,利用扩散模型对真实图像进行可控的视觉属性编辑,同时保持结构信息不变,从而复用原始分割标签,显著降低人工标注成本;该方法不仅可用于压力测试模型,还可作为数据增强工具提升模型在分布内和分布外场景下的性能表现。
链接: https://arxiv.org/abs/2603.01535
作者: Zijin Yin,Bing Li,Kongming Liang,Hao Sun,Zhongjiang He,Zhanyu Ma,Jun Guo
机构: Beijing University of Posts and Telecommunications (北京邮电大学); China Telecom Artificial Intelligence Technology Co. Ltd (中国电信人工智能技术有限公司); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE TPAMI, under review
Abstract:Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a wide variety of semantic segmentation models, spanning from closed-set models to open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.
[CV-92] Boosting AI Reliability with an FSM-Driven Streaming Inference Pipeline: An Industrial Case
【速读】:该论文旨在解决工业场景中人工智能(AI)模型在面对训练数据未覆盖的场景时,因缺乏鲁棒性而导致预测偏差和安全漏洞的问题。解决方案的关键在于提出了一种新颖的流式推理流水线,通过显式融合先验知识来增强数据驱动模型的性能;具体而言,该方法将目标检测模型与有限状态机(Finite State Machine, FSM)相结合,利用FSM对操作场景的知识编码来引导和修正流式数据上的AI预测,从而提升模型在真实复杂环境中的稳定性与准确性。
链接: https://arxiv.org/abs/2603.01528
作者: Yutian Zhang,Zhongyi Pei,Yi Mao,Chen Wang,Lin Liu,Jianmin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. The work was done in 2024
Abstract:The widespread adoption of AI in industry is often hampered by its limited robustness when faced with scenarios absent from training data, leading to prediction bias and vulnerabilities. To address this, we propose a novel streaming inference pipeline that enhances data-driven models by explicitly incorporating prior knowledge. This paper presents the work on an industrial AI application that automatically counts excavator workloads from surveillance videos. Our approach integrates an object detection model with a Finite State Machine (FSM), which encodes knowledge of operational scenarios to guide and correct the AI’s predictions on streaming data. In experiments on a real-world dataset of over 7,000 images from 12 site videos, encompassing more than 300 excavator workloads, our method demonstrates superior performance and greater robustness compared to the original solution based on manual heuristic rules. We will release the code at this https URL.
[CV-93] Better Matching Less Forgetting: A Quality-Guided Matcher for Transformer-based Incremental Object Detection AAAI2026
【速读】:该论文旨在解决增量目标检测(Incremental Object Detection, IOD)中的灾难性遗忘问题,尤其是由DETR类架构特有的“背景前景化”(background foregrounding)现象引发的遗忘。该现象源于匈牙利匹配器(Hungarian matcher)的穷尽约束,即强制将每个真实标注目标分配给一个预测结果,即使预测主要覆盖背景区域(低IoU),从而导致模型错误地将背景特征分类为特定前景类别,破坏已有知识并加速遗忘。解决方案的关键在于提出一种质量引导的最小费用最大流匹配器(Quality-guided Min-Cost Max-Flow, Q-MCMF):通过构建流图并基于几何质量剪枝不合理匹配,再优化得到最小代价且最大化有效匹配的最终分配策略,从而消除来自背景前景化的有害监督信号,同时最大化前景学习信号。
链接: https://arxiv.org/abs/2603.01524
作者: Qirui Wu,Shizhou Zhang,De Cheng,Yinghui Xing,Lingyan Ran,Dahu Shi,Peng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in AAAI2026
Abstract:Incremental Object Detection (IOD) aims to continuously learn new object classes without forgetting previously learned ones. A persistent challenge is catastrophic forgetting, primarily attributed to background shift in conventional detectors. While pseudo-labeling mitigates this in dense detectors, we identify a novel, distinct source of forgetting specific to DETR-like architectures: background foregrounding. This arises from the exhaustiveness constraint of the Hungarian matcher, which forcibly assigns every ground truth target to one prediction, even when predictions primarily cover background regions (i.e., low IoU). This erroneous supervision compels the model to misclassify background features as specific foreground classes, disrupting learned representations and accelerating forgetting. To address this, we propose a Quality-guided Min-Cost Max-Flow (Q-MCMF) matcher. To avoid forced assignments, Q-MCMF builds a flow graph and prunes implausible matches based on geometric quality. It then optimizes for the final matching that minimizes cost and maximizes valid assignments. This strategy eliminates harmful supervision from background foregrounding while maximizing foreground learning signals. Extensive experiments on the COCO dataset under various incremental settings demonstrate that our method consistently outperforms existing state-of-the-art approaches.
[CV-94] FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
【速读】:该论文旨在解决自回归模型在三维网格(3D mesh)生成中因将网格展平为顶点坐标序列而导致的计算成本过高问题,从而限制了高质量几何体的高效合成。其解决方案的关键在于提出了一种新的自回归自动编码器(Autoregressive Autoencoder, ARAE)框架——FACE,该框架通过“一面对应一个标记”(one-face-one-token)策略,将三角面片(triangle face)作为基本单元视为单一语义令牌进行建模,显著缩短了序列长度(压缩比达0.11),从而大幅降低计算开销,同时保持甚至提升了重建质量。这一设计革新了传统基于顶点的生成范式,实现了效率与精度的双重突破。
链接: https://arxiv.org/abs/2603.01515
作者: Hanxiao Wang,Yuan-Chen Guo,Ying-Tian Liu,Zi-Xin Zou,Biao Zhang,Weize Quan,Ding Liang,Yan-Pei Cao,Dong-Ming Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.
[CV-95] Retrieval Refinement and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling ICLR
【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)生成模型对输入提示(prompt)高度敏感的问题,即当前方法在提升视频生成质量方面存在局限:要么依赖复杂的后处理模型易引入伪影,要么需要昂贵的主生成器微调,限制了可扩展性和可用性。解决方案的关键在于提出3R框架——一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的提示优化方法,无需训练任何T2V模型即可实现高效优化;其核心由三个策略构成:基于RAG的修饰符提取以增强上下文锚定、基于扩散模型的偏好优化以对齐人类偏好、以及时间帧插值以保障时序一致性,从而显著提升生成视频的静态保真度与动态连贯性。
链接: https://arxiv.org/abs/2603.01509
作者: Zillur Rahman,Alex Sheng,Cristian Meo
机构: Algoverse AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 2026 ICLR TTU Workshop
Abstract:While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for aligning outputs with human preferences, and temporal frame interpolation for producing temporally consistent visual contents. Together, these components enable more accurate, efficient, and contextually aligned text-to-video generation. Experimental results demonstrate the efficacy of 3R in enhancing the static fidelity and dynamic coherence of generated videos, underscoring the importance of optimizing user prompts.
[CV-96] OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
【速读】:该论文旨在解决从单张图像中快速重建可驱动的3D头部动画模型(animatable 3D head reconstruction)的问题,尤其针对现有方法在重建质量、计算效率和硬件适配性方面的不足。解决方案的关键在于提出OMG-Avatar,一种基于多级细节(Multi-LOD, Level-of-Detail)高斯表示的一次性(One-shot)方法,其核心创新包括:(1)采用基于Transformer的架构进行全局特征提取与投影采样获取局部特征,并通过深度缓冲区(depth buffer)引导融合以保证遮挡合理性;(2)引入粗到精的学习范式支持LOD功能并增强层次细节感知;(3)设计多区域分解策略,将头部与肩部独立预测后通过跨区域组合集成,从而有效建模非头部区域(如肩膀),提升整体重建的真实性与灵活性。该方法可在0.2秒内完成高质量重建,且适用于不同硬件平台和推理速度需求。
链接: https://arxiv.org/abs/2603.01506
作者: Jianqiang Ren,Lin Liu,Steven Hoi
机构: Tongyi Lab, Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decomposition scheme in which the head and shoulders are predicted separately and then integrated through cross-region combination. Extensive experiments demonstrate that OMG-Avatar outperforms state-of-the-art methods in reconstruction quality, reenactment performance, and computational efficiency.
[CV-97] ri-path DINO: Feature Complementary Learning for Remote Sensing Multi-Class Change Detection
【速读】:该论文旨在解决遥感影像中多类变化检测(Multi-class Change Detection, MCD)任务长期面临的挑战,即复杂场景变化与细粒度标注数据稀缺问题。其解决方案的关键在于提出了一种三路径互补特征学习架构(Tripath DINO),通过三个并行路径协同优化特征表示:首先利用预训练的DINOv3模型提取粗粒度语义特征;其次设计一个辅助的孪生结构路径,逐步聚合中间特征以增强细粒度结构细节的学习;最后引入多尺度注意力机制增强解码器对不同感受野下上下文信息的自适应捕捉能力。该架构实现了对复杂垂直领域快速适应,并在Gaza设施损毁评估数据集和SECOND数据集上均取得最优性能,同时GradCAM可视化验证了主路径关注粗粒度语义变化、辅助路径聚焦细粒度结构细节的互补特性,从而提供了一种鲁棒且可解释的先进变化检测方案。
链接: https://arxiv.org/abs/2603.01498
作者: Kai Zheng,Hang-Cheng Dong,Zhenkai Wu,Fupeng Wei,Wei Zhang
机构: Zhejiang University (浙江大学); Harbin Institute of Technology (哈尔滨工业大学); Suzhou Research Institute, Harbin Institute of Technology (哈尔滨工业大学苏州研究院); School of Software Technology, Zhejiang University (浙江大学软件学院); North China University of Water Resources and Electric Power (华北水利水电大学); The University of Auckland (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In remote sensing imagery, multi class change detection (MCD) is crucial for fine grained monitoring, yet it has long been constrained by complex scene variations and the scarcity of detailed annotations. To address this, we propose the Tripath DINO architecture, which adopts a three path complementary feature learning strategy to facilitate the rapid adaptation of pre trained foundation models to complex vertical domains. Specifically, we employ the DINOv3 pre trained model as the backbone feature extraction network to learn coarse grained features. An auxiliary path also adopts a siamese structure, progressively aggregating intermediate features from the siamese encoder to enhance the learning of fine grained features. Finally, a multi scale attention mechanism is introduced to augment the decoder network, where parallel convolutions adaptively capture and enhance contextual information under different receptive fields. The proposed method achieves optimal performance on the MCD task on both the Gaza facility damage assessment dataset (Gaza change) and the classic SECOND dataset. GradCAM visualizations further confirm that the main and auxiliary paths naturally focus on coarse grained semantic changes and fine grained structural details, respectively. This synergistic complementarity provides a robust and interpretable solution for advanced change detection tasks, offering a basis for rapid and accurate damage assessment.
[CV-98] Radiometrically Consistent Gaussian Surfels for Inverse Rendering ICLR2026
【速读】:该论文旨在解决基于高斯点绘(Gaussian Splatting)的逆渲染方法中,难以准确分离材质属性与复杂全局光照效应(尤其是间接光照)的问题。现有方法依赖于为新视角合成预训练的高斯原型,但由于这些原型仅在有限视角下受监督,缺乏对未观测视角下间接辐射的建模能力。其解决方案的关键在于引入辐射一致性(radiometric consistency)——一种基于物理的约束机制,通过最小化每个高斯原型学习到的辐射与其物理渲染对应物之间的残差,为未观测视角提供监督信号。这一机制建立了自校正反馈回路,融合了物理渲染与新视角合成的双重监督,从而实现对多次反射的精确建模。在此基础上,作者提出了Radiometrically Consistent Gaussian Surfels (RadioGS),利用高斯面元(Gaussian surfels)和二维高斯射线追踪高效整合该约束,并进一步设计了基于微调的快速重照明策略,在分钟级时间内适应新光照条件,同时保持极低的渲染成本(10ms)。
链接: https://arxiv.org/abs/2603.01491
作者: Kyu Beom Han,Jaeyoon Kim,Woo Jae Kim,Jinhwan Seo,Sung-eui Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 9 pages, 6 figures, ICLR 2026 Oral paper
Abstract:Inverse rendering with Gaussian Splatting has advanced rapidly, but accurately disentangling material properties from complex global illumination effects, particularly indirect illumination, remains a major challenge. Existing methods often query indirect radiance from Gaussian primitives pre-trained for novel-view synthesis. However, these pre-trained Gaussian primitives are supervised only towards limited training viewpoints, thus lack supervision for modeling indirect radiances from unobserved views. To address this issue, we introduce radiometric consistency, a novel physically-based constraint that provides supervision towards unobserved views by minimizing the residual between each Gaussian primitive’s learned radiance and its physically-based rendered counterpart. Minimizing the residual for unobserved views establishes a self-correcting feedback loop that provides supervision from both physically-based rendering and novel-view synthesis, enabling accurate modeling of inter-reflection. We then propose Radiometrically Consistent Gaussian Surfels (RadioGS), an inverse rendering framework built upon our principle by efficiently integrating radiometric consistency by utilizing Gaussian surfels and 2D Gaussian ray tracing. We further propose a finetuning-based relighting strategy that adapts Gaussian surfel radiances to new illuminations within minutes, achieving low rendering cost (10ms). Extensive experiments on existing inverse rendering benchmarks show that RadioGS outperforms existing Gaussian-based methods in inverse rendering, while retaining the computational efficiency.
[CV-99] ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models ICRA2026
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在推理过程中依赖显式推理机制所面临的效率低、数据依赖性强及部署复杂等问题。现有方法通常需要大量标注数据(如链式思维Chain-of-Thought, CoT注释或视觉定位标注),并涉及耗时的重新训练流程,导致推理序列变长且性能提升有限。解决方案的关键在于提出一种无需训练的隐式推理框架ATA(Attention-guided and Action-guided Reasoning),通过互补注意力引导与动作引导策略,将注意力图与基于动作的感兴趣区域(Region of Interest, RoI)融合,从而自适应地优化视觉输入,实现高效、轻量且无需额外标注的推理增强。
链接: https://arxiv.org/abs/2603.01490
作者: Cheng Yang,Jianhao Jiao,Lingyi Huang,Jinqi Xiao,Zhexiang Tang,Yu Gong,Yibiao Ying,Yang Sui,Jintian Lin,Wen Huang,Bo Yuan
机构: Rutgers University (罗格斯大学); University College London (伦敦大学学院); Rice University (莱斯大学); TCL High-Tech Development Co., Ltd. (TCL高科技发展有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICRA 2026
Abstract:Vision-Language-Action (VLA) models rely on current observations, including images, language instructions, and robot states, to predict actions and complete tasks. While accurate visual perception is crucial for precise action prediction and execution, recent work has attempted to further improve performance by introducing explicit reasoning during inference. However, such approaches face significant limitations. They often depend on data-intensive resources such as Chain-of-Thought (CoT) style annotations to decompose tasks into step-by-step reasoning, and in many cases require additional visual grounding annotations (e.g., bounding boxes or masks) to highlight relevant image regions. Moreover, they involve time-consuming dataset construction, labeling, and retraining, which ultimately results in longer inference sequences and reduced efficiency. To address these challenges, we propose ATA, a novel training-free framework that introduces implicit reasoning into VLA inference through complementary attention-guided and action-guided strategies. Unlike CoT or explicit visual-grounding methods, ATA formulates reasoning implicitly by integrating attention maps with an action-based region of interest (RoI), thereby adaptively refining visual inputs without requiring extra training or annotations. ATA is a plug-and-play implicit reasoning approach for VLA models, lightweight yet effective. Extensive experiments show that it consistently improves task success and robustness while preserving, and even enhancing, inference efficiency.
[CV-100] SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout
【速读】:该论文旨在解决LiDAR-based tracking-by-attention (TBA) 方法中普遍存在高假负例(false negative)错误的问题,从而缩小其与传统LiDAR-based tracking-by-detection (TBD) 方法之间的性能差距。解决方案的关键在于提出两种架构无关的训练策略:Second Chance Assignment 和 Track Query Dropout。前者通过在二分图匹配前将未分配的track query拼接至proposal query,使这些track query获得重新分配至真实目标的“第二次机会”,缓解检测与跟踪任务间的冲突;后者则通过随机丢弃track query来多样化监督配置,提升解码器对不同track query集合的鲁棒性,增强对缺失或新生目标的处理能力。实验表明,所提方法SCATR在nuScenes数据集上显著优于现有LiDAR-TBA方法,AMOTA指标提升7.6%,有效弥合了TBA与TBD方法间的长期性能鸿沟。
链接: https://arxiv.org/abs/2603.01485
作者: Brian Cheong,Letian Wang,Sandro Papais,Steven L. Waslander
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LiDAR-based tracking-by-attention (TBA) frameworks inherently suffer from high false negative errors, leading to a significant performance gap compared to traditional LiDAR-based tracking-by-detection (TBD) methods. This paper introduces SCATR, a novel LiDAR-based TBA model designed to address this fundamental challenge systematically. SCATR leverages recent progress in vision-based tracking and incorporates targeted training strategies specifically adapted for LiDAR. Our work’s core innovations are two architecture-agnostic training strategies for TBA methods: Second Chance Assignment and Track Query Dropout. Second Chance Assignment is a novel ground truth assignment that concatenates unassigned track queries to the proposal queries before bipartite matching, giving these track queries a second chance to be assigned to a ground truth object and effectively mitigating the conflict between detection and tracking tasks inherent in tracking-by-attention. Track Query Dropout is a training method that diversifies supervised object query configurations to efficiently train the decoder to handle different track query sets, enhancing robustness to missing or newborn tracks. Experiments on the nuScenes tracking benchmark demonstrate that SCATR achieves state-of-the-art performance among LiDAR-based TBA methods, outperforming previous works by 7.6% AMOTA and successfully bridging the long-standing performance gap between LiDAR-based TBA and TBD methods. Ablation studies further validate the effectiveness and generalization of Second Chance Assignment and Track Query Dropout. Code can be found at the following link: \hrefthis https URLthis https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.01485 [cs.CV] (or arXiv:2603.01485v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.01485 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Brian Cheong [view email] [v1] Mon, 2 Mar 2026 05:50:54 UTC (329 KB) Full-text links: Access Paper: View a PDF of the paper titled SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout, by Brian Cheong and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-101] WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments ICRA
【速读】:该论文旨在解决当前机器人感知研究中缺乏适用于复杂非结构化自然环境的多模态基准数据集的问题。现有数据集主要来自结构化的城市环境,难以满足在野外等真实自然场景下进行视觉、激光雷达及跨模态定位与度量深度估计的需求。其解决方案的关键在于提出WildCross——一个大规模自然环境中用于位姿识别和度量深度估计的跨模态基准,包含超过476K连续RGB帧,配有半密集深度图和表面法向量标注,并与精确的6自由度(6DoF)位姿及同步的密集激光雷达子地图对齐,从而为多模态机器人感知任务提供具有挑战性的评测平台。
链接: https://arxiv.org/abs/2603.01475
作者: Joshua Knights,Joseph Reid,Kaushik Roy,David Hall,Mark Cox,Peyman Moghadam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE International Conference on Robotics Automation (ICRA) 2026
Abstract:Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.
[CV-102] UltraStar: Semantic-Aware Star Graph Modeling for Echocardiography Navigation
【速读】:该论文旨在解决超声心动图(Echocardiography)中因缺乏熟练操作人员而导致的探头自动导航难题,尤其是在历史扫描数据存在噪声和探索性轨迹的情况下,现有方法因将历史信息建模为顺序链而易过拟合,导致长序列下性能下降。解决方案的关键在于提出UltraStar框架,其核心创新是将探头导航从路径回归重构为基于锚点的全局定位问题:通过构建星形图(Star Graph)结构,将历史关键帧作为空间锚点直接关联当前视图,显式建模几何约束以实现精准定位;同时引入语义感知采样策略,从海量历史日志中主动筛选代表性地标,降低冗余并提升锚定精度,从而在含噪探索场景下实现更鲁棒、可扩展的导航性能。
链接: https://arxiv.org/abs/2603.01461
作者: Teng Wang,Haojun Jiang,Chenxi Li,Diwen Wang,Yihang Tang,Zhenguo Sun,Yujiao Deng,Shiji Song,Gao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Echocardiography is critical for diagnosing cardiovascular diseases, yet the shortage of skilled sonographers hinders timely patient care, due to high operational difficulties. Consequently, research on automated probe navigation has significant clinical potential. To achieve robust navigation, it is essential to leverage historical scanning information, mimicking how experts rely on past feedback to adjust subsequent maneuvers. Practical scanning data collected from sonographers typically consists of noisy trajectories inherently generated through trial-and-error exploration. However, existing methods typically model this history as a sequential chain, forcing models to overfit these noisy paths, leading to performance degradation on long sequences. In this paper, we propose UltraStar, which reformulates probe navigation from path regression to anchor-based global localization. By establishing a Star Graph, UltraStar treats historical keyframes as spatial anchors connected directly to the current view, explicitly modeling geometric constraints for precise positioning. We further enhance the Star Graph with a semantic-aware sampling strategy that actively selects the representative landmarks from massive history logs, reducing redundancy for accurate anchoring. Extensive experiments on a dataset with over 1.31 million samples demonstrate that UltraStar outperforms baselines and scales better with longer input lengths, revealing a more effective topology for history modeling under noisy exploration.
[CV-103] VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models
【速读】:该论文旨在解决视频大语言模型(Video-LLMs)在安全关键场景中面临的能量-延迟攻击(Energy-Latency Attacks, ELAs)问题,此类攻击通过消耗大量计算资源导致系统失效。现有基于图像的方法因时间聚合机制稀释单帧扰动而失效,且实时性要求使得逐实例优化不可行。解决方案的关键在于提出首个面向Video-LLMs的通用ELA框架VidDoS,其核心创新包括:利用**掩码教师强制(masked teacher forcing)引导模型生成高成本目标序列,结合拒绝惩罚(refusal penalty)和提前终止抑制(early-termination suppression)**以克服模型对简洁输出的偏好,从而实现无需推理时梯度计算的实例无关触发器生成,最终在多个主流Video-LLMs和真实场景中引发超过205倍token膨胀和15倍延迟增长,显著危及自动驾驶等系统的安全性。
链接: https://arxiv.org/abs/2603.01454
作者: Duoxun Tang,Dasen Dai,Jiyao Wang,Xiao Yang,Jianyu Wang,Siqi Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video-LLMs are increasingly deployed in safety-critical applications but are vulnerable to Energy-Latency Attacks (ELAs) that exhaust computational resources. Current image-centric methods fail because temporal aggregation mechanisms dilute individual frame perturbations. Additionally, real-time demands make instance-wise optimization impractical for continuous video streams. We introduce VidDoS, which is the first universal ELA framework tailored for Video-LLMs. Our method leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation. We achieve this through \textitmasked teacher forcing to steer models toward expensive target sequences, combined with a \textitrefusal penalty and \textitearly-termination suppression to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205 \times and inflates the inference latency by more than 15 \times relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.
[CV-104] Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection
【速读】:该论文旨在解决深度伪造(Deepfake)生成技术快速发展所带来的公共安全威胁问题,特别是现有检测方法在面对新型伪造模式时泛化能力不足的局限性。解决方案的关键在于提出一种名为Deepfake Forensics Adapter (DFA) 的双流框架,其核心创新是将预训练的视觉-语言基础模型CLIP与针对性的取证分析相结合:通过全局特征适配器捕捉图像内容中的全局不一致性、局部异常流利用面部结构先验增强对局部伪造线索的感知能力,并借助交互式融合分类器通过Transformer编码器实现全局与局部特征的深度交互与融合,从而在不修改CLIP参数的前提下显著提升检测性能和泛化能力。
链接: https://arxiv.org/abs/2603.01450
作者: Jianfeng Liao,Yichen Wei,Raymond Chan Ching Bon,Shulan Wang,Kam-Pui Chow,Kwok-Yan Lam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICDF2C 2025
Abstract:The rapid advancement of deepfake generation techniques poses significant threats to public safety and causes societal harm through the creation of highly realistic synthetic facial media. While existing detection methods demonstrate limitations in generalizing to emerging forgery patterns, this paper presents Deepfake Forensics Adapter (DFA), a novel dual-stream framework that synergizes vision-language foundation models with targeted forensics analysis. Our approach integrates a pre-trained CLIP model with three core components to achieve specialized deepfake detection by leveraging the powerful general capabilities of CLIP without changing CLIP parameters: 1) A Global Feature Adapter is used to identify global inconsistencies in image content that may indicate forgery, 2) A Local Anomaly Stream enhances the model’s ability to perceive local facial forgery cues by explicitly leveraging facial structure priors, and 3) An Interactive Fusion Classifier promotes deep interaction and fusion between global and local features using a transformer encoder. Extensive evaluations of frame-level and video-level benchmarks demonstrate the superior generalization capabilities of DFA, particularly achieving state-of-the-art performance in the challenging DFDC dataset with frame-level AUC/EER of 0.816/0.256 and video-level AUC/EER of 0.836/0.251, representing a 4.8% video AUC improvement over previous methods. Our framework not only demonstrates state-of-the-art performance, but also points out a feasible and effective direction for developing a robust deepfake detection system with enhanced generalization capabilities against the evolving deepfake threats. Our code is available at this https URL
[CV-105] Unifying Language-Action Understanding and Generation for Autonomous Driving
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在端到端自动驾驶中面临的两大关键问题:一是语言指令与动作输出之间存在的持续性错位(misalignment),二是典型自回归方式生成动作序列的固有低效性。解决方案的核心在于提出LinkVLA架构,其关键创新包括:首先,通过将语言和动作标记统一至共享离散码本(discrete codebook),并在单一多模态模型中处理,从结构上强制跨模态一致性;其次,引入辅助的动作理解目标,训练模型从轨迹生成描述性文本,从而建立双向的语言-动作语义映射;最后,采用两阶段粗到精(coarse-to-fine, C2F)生成策略替代逐步自回归解码,显著提升推理效率,实现86%的推理时间节省。
链接: https://arxiv.org/abs/2603.01441
作者: Xinyang Wang,Qian Liu,Wenjie Ding,Zhao Yang,Wei Li,Chang Liu,Bailin Li,Kun Zhan,Xianpeng Lang,Wei Chen
机构: Zhejiang University (浙江大学); Li Auto (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.
[CV-106] DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis
【速读】:该论文旨在解决文档伪造检测(Document Forgery Detection)领域缺乏统一、零样本评估基准的问题,以真实场景中无标注训练数据的部署需求为导向,系统性地评估14种方法在文本篡改、收据伪造和身份文件篡改等多类任务上的表现。其关键解决方案在于揭示了现有方法普遍存在校准失败(calibration failure)问题:尽管多数方法在像素级AUC指标上表现尚可(=0.76),但因篡改区域仅占图像像素的0.27–4.17%(远低于自然图像基准),导致标准阈值τ=0.5严重失准,造成像素F1分数接近于零。通过控制实验验证,仅需在N=10张目标域图像上调整单一阈值即可恢复39–55%的最优F1差距,表明校准而非特征表示才是当前技术瓶颈,因此“阈值自适应”是实现实际部署的关键缺失步骤。
链接: https://arxiv.org/abs/2603.01433
作者: Zengqi Zhao,Weidi Xia,Peter Wei,Yan Zhang,Yiyi Zhang,Jane Mo,Tiannan Zhang,Yuanqin Dai,Zexi Chen,Simiao Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation – a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images – an order of magnitude less than in natural image benchmarks – making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation – not retraining – is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.
[CV-107] SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation
【速读】:该论文旨在解决当前音频-视觉实例分割(Audio-Visual Instance Segmentation, AVIS)方法普遍采用离线处理范式,无法在连续视频流中跨片段关联检测到的实例,从而限制了其在真实场景中的应用问题。解决方案的关键在于提出首个面向在线处理的框架SeaVIS,其核心创新包括:1)引入因果交叉注意力融合(Causal Cross Attention Fusion, CCAF)模块,在严格因果约束下整合当前帧的视觉特征与完整的音频历史信息,实现高效在线推理;2)设计音频引导对比学习(Audio-Guided Contrastive Learning, AGCL)策略,构建同时编码视觉外观和发声活动的实例原型,有效区分发声与静默状态下的实例,从而在实例关联过程中抑制非发声对象的误分割,显著提升模型的音频跟随能力。
链接: https://arxiv.org/abs/2603.01431
作者: Yingjian Zhu,Ying Wang,Yuyang Hong,Ruohao Guo,Kun Ding,Xin Gu,Bin Fan,Shiming Xiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Machine Intelligence Research
Abstract:Recently, an audio-visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real-world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio-visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance-based instance association fails to distinguish between an object’s sounding and silent states, resulting in the incorrect segmentation of silent objects. To tackle this, we employ an Audio-Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity. In this way, instances preserved during per-frame prediction that do not emit sound can be effectively suppressed during instance association process, thereby significantly enhancing the audio-following capability of SeaVIS. Extensive experiments conducted on the AVISeg dataset demonstrate that SeaVIS surpasses existing state-of-the-art models across multiple evaluation metrics while maintaining a competitive inference speed suitable for real-time processing.
[CV-108] UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation CVPR2026
【速读】:该论文旨在解决当前先进音视频生成模型(如Veo3和Sora2)因闭源特性导致架构与训练范式不可访问的问题,从而限制了研究者对高性能生成技术的复现与改进。解决方案的关键在于提出UniTalking——一个统一的端到端扩散框架,其核心创新是采用多模态Transformer模块(Multi-Modal Transformer Blocks),通过共享自注意力机制显式建模音频与视频潜在表示之间的细粒度时间对应关系,同时利用预训练视频生成模型的强大先验知识实现高保真视觉质量并提升训练效率。此外,该框架还集成个性化语音克隆能力,仅需短音频参考即可生成目标风格的语音,显著提升了应用灵活性与表现力。
链接: https://arxiv.org/abs/2603.01418
作者: Hebeizi Li,Zihao Liang,Benyuan Sun,Zihao Yin,Xiao Sha,Chenliang Wang,Yi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注: Accepted at CVPR 2026 (Findings Track)
Abstract:While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.
[CV-109] UETrack: A Unified and Efficient Framework for Single Object Tracking
【速读】:该论文旨在解决现有单目标跟踪方法在多模态场景下效率与性能不足的问题,特别是针对RGB输入局限性以及当前多模态跟踪方法设计复杂、计算开销大、难以部署于资源受限环境的瓶颈。解决方案的关键在于提出UETrack框架,其核心创新包括:(1)基于Token-Pooling的专家混合机制(Token-Pooling-based Mixture-of-Experts),通过特征聚合与专家专业化提升模型表达能力;(2)目标感知自适应蒸馏策略(Target-aware Adaptive Distillation),依据样本特性选择性地进行知识蒸馏,减少冗余监督并提升跟踪精度。该设计实现了在多种模态(RGB、深度、热成像、事件流、语言)下的高效鲁棒跟踪,在多个基准和硬件平台上均展现出优越的速度-精度平衡。
链接: https://arxiv.org/abs/2603.01412
作者: Ben Kang,Jie Zhao,Xin Chen,Wanting Geng,Bin Zhang,Lu Zhang,Dong Wang,Huchuan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With growing real-world demands, efficient tracking has received increasing attention. However, most existing methods are limited to RGB inputs and struggle in multi-modal scenarios. Moreover, current multi-modal tracking approaches typically use complex designs, making them too heavy and slow for resource-constrained deployment. To tackle these limitations, we propose UETrack, an efficient framework for single object tracking. UETrack demonstrates high practicality and versatility, efficiently handling multiple modalities including RGB, Depth, Thermal, Event, and Language, and addresses the gap in efficient multi-modal tracking. It introduces two key components: a Token-Pooling-based Mixture-of-Experts mechanism that enhances modeling capacity through feature aggregation and expert specialization, and a Target-aware Adaptive Distillation strategy that selectively performs distillation based on sample characteristics, reducing redundant supervision and improving performance. Extensive experiments on 12 benchmarks across 3 hardware platforms show that UETrack achieves a superior speed-accuracy trade-off compared to previous methods. For instance, UETrack-B achieves 69.2% AUC on LaSOT and runs at 163/56/60 FPS on GPU/CPU/AGX, demonstrating strong practicality and versatility. Code is available at this https URL.
[CV-110] oken Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models CVPR2026
【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, VLLMs)在处理视频理解任务时因冗余视觉标记(visual tokens)导致的计算效率低下问题。现有剪枝方法主要针对帧内空间冗余或在大语言模型(LLM)内部进行浅层剪枝,难以实现有效的时空联合压缩,并常忽略合并或剪枝后仍具信息量的细微上下文。其解决方案的关键在于提出一种基于局部-全局最优传输(local-global Optimal Transport, AOT)的新视角:首先在每帧内通过注意力机制建立局部与全局感知的标记锚点(token anchors),利用最优传输聚合剪枝后的信息以构建帧内锚点;进而,在时间帧片段(temporal frame clips)基础上,将每个片段的第一帧作为关键帧锚点,通过最优传输整合连续帧中的相似信息,同时保留差异性标记以表征时间动态变化,从而实现无需训练的高效标记压缩。此方法在多个短/长视频基准上均展现出优异性能与显著的计算效率提升。
链接: https://arxiv.org/abs/2603.01400
作者: Jinlong Li,Liyuan Jiang,Haonan Zhang,Nicu Sebe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbfAnchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbfOptimal \textbfTransport (\textbfAOT). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: \hrefthis https URLAOT.
[CV-111] Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis CVPR2026
【速读】:该论文旨在解决大气湍流对远距离成像造成的几何畸变和与曝光时间相关的模糊问题,这些问题不仅降低视觉质量,还影响高级视觉任务的性能。现有合成湍流效应的方法通常简化了模糊与曝光时间的关系,假设固定或二值化的曝光设置,导致合成数据不真实且模型泛化能力有限。解决方案的关键在于重新审视调制传递函数(Modulation Transfer Function, MTF)的建模方式,提出一种新的曝光时间依赖的MTF(Exposure-Time-dependent MTF, ET-MTF),将模糊建模为连续的曝光时间函数;并由此推导出一个无倾斜敏感性的点扩散函数(Point Spread Function, PSF),结合空间变化的模糊宽度场,实现对湍流引起的模糊进行物理上准确的建模与合成。基于此合成流程构建了ET-Turb数据集,其在不同光学和大气条件下显式地引入连续曝光时间建模,从而显著提升了训练模型在真实湍流数据上的还原效果与泛化能力。
链接: https://arxiv.org/abs/2603.01398
作者: Junwei Zeng,Dong Liang,Sheng-Jun Huang,Kun Zhan,Songcan Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026!
Abstract:Atmospheric turbulence significantly degrades long-range imaging by introducing geometric warping and exposure-time-dependent blur, which adversely affects both visual quality and the performance of high-level vision tasks. Existing methods for synthesizing turbulence effects often oversimplify the relationship between blur and exposure-time, typically assuming fixed or binary exposure settings. This leads to unrealistic synthetic data and limited generalization capability of trained models. To address this gap, we revisit the modulation transfer function (MTF) formulation and propose a novel Exposure-Time-dependent MTF (ET-MTF) that models blur as a continuous function of exposure-time. For blur synthesis, we derive a tilt-invariant point spread function (PSF) from the ET-MTF, which, when integrated with a spatially varying blur-width field, provides a comprehensive and physically accurate characterization of turbulence-induced blur. Building on this synthesis pipeline, we construct ET-Turb, a large-scale synthetic turbulence dataset that explicitly incorporates continuous exposure-time modeling across diverse optical and atmospheric conditions. The dataset comprises 5,083 videos (2,005,835 frames), partitioned into 3,988 training and 1,095 test videos. Extensive experiments demonstrate that models trained on ET-Turb produce more realistic restorations and achieve superior generalization on real-world turbulence data compared to those trained on other datasets. The dataset is publicly available at: this http URL.
[CV-112] IMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity
【速读】:该论文旨在解决图像到三维多实例生成(Image-to-3D Multi-Instance Generation)中空间保真度(spatial fidelity)不足的问题,尤其是在现有方法依赖大规模多实例数据微调预训练模型时,存在训练开销大且难以保证空间结构准确性的局限。解决方案的关键在于提出一个无需训练(training-free)的框架TIMI,其核心创新包括:1)引入实例感知分离引导模块(Instance-aware Separation Guidance, ISG),在去噪早期阶段促进实例解缠;2)设计空间稳定几何自适应更新模块(Spatial-stabilized Geometry-adaptive Update, SGU),在保持各实例几何特征的同时稳定其相对空间关系,从而实现高保真度的多实例三维生成。
链接: https://arxiv.org/abs/2603.01371
作者: Xiao Cai,Lianli Gao,Pengpeng Zeng,Ji Zhang,Heng Tao Shen,Jingkuan Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Precise spatial fidelity in Image-to-3D multi-instance generation is critical for downstream real-world applications. Recent work attempts to address this by fine-tuning pre-trained Image-to-3D (I23D) models on multi-instance datasets, which incurs substantial training overhead and struggles to guarantee spatial fidelity. In fact, we observe that pre-trained I23D models already possess meaningful spatial priors, which remain underutilized as evidenced by instance entanglement issues. Motivated by this, we propose TIMI, a novel Training-free framework for Image-to-3D Multi-Instance generation that achieves high spatial fidelity. Specifically, we first introduce an Instance-aware Separation Guidance (ISG) module, which facilitates instance disentanglement during the early denoising stage. Next, to stabilize the guidance introduced by ISG, we devise a Spatial-stabilized Geometry-adaptive Update (SGU) module that promotes the preservation of the geometric characteristics of instances while maintaining their relative relationships. Extensive experiments demonstrate that our method yields better performance in terms of both global layout and distinct local instances compared to existing multi-instance methods, without requiring additional training and with faster inference speed.
[CV-113] MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention CVPR2026
【速读】:该论文旨在解决现有卷积神经网络(CNN)、Transformer 和 Mamba 基于模型在裂缝分割任务中仅能捕捉部分空间或结构信息,导致对复杂裂缝模式建模不足的问题。解决方案的关键在于提出 MixerCSeg 架构,其核心是 TransMixer 模块——通过融合 CNN 的局部纹理感知能力、Transformer 的全局依赖建模能力以及 Mamba 的序列上下文建模特性,在单一编码器内实现局部与全局信息的协同表达;同时引入空间块处理策略、方向引导边缘门控卷积(DEGConv)和空间细化多级融合(SRF)模块,在不显著增加计算复杂度的前提下提升边缘敏感性和多尺度结构保真度,从而实现高精度且高效的裂缝像素级分割。
链接: https://arxiv.org/abs/2603.01361
作者: Zilong Zhao,Zhengming Ding,Pei Niu,Wenhao Sun,Feng Guo
机构: Shandong University (山东大学); Tulane University (杜兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026
Abstract:Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba’s latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at this https URL.
[CV-114] Perspective-Equivariant Fine-tuning for Multispectral Demosaicing without Ground Truth
【速读】:该论文旨在解决多光谱去马赛克(multispectral demosaicing)问题,即如何从快照式马赛克测量中重建全分辨率的光谱图像,以支持从神经外科到自动驾驶等实时成像场景。传统方法存在模糊问题,而监督学习方法则依赖于昂贵且缓慢的线扫描系统获取的地面真值(ground truth, GT)。其解决方案的关键在于提出了一种名为 Perspective-Equivariant Fine-tuning for Demosaicing (PEFD) 的框架,该框架通过利用基于相机成像系统的射影几何结构,比以往方法更充分地利用群结构来恢复更多零空间信息,并通过适配为1-3通道成像设计的预训练基础模型,在无需GT的情况下实现高效学习,从而在术中和汽车数据集上显著提升细节恢复能力与光谱保真度,逼近监督方法性能。
链接: https://arxiv.org/abs/2603.01332
作者: Andrew Wang,Mike Davies
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multispectral demosaicing is crucial to reconstruct full-resolution spectral images from snapshot mosaiced measurements, enabling real-time imaging from neurosurgery to autonomous driving. Classical methods are blurry, while supervised learning requires costly ground truth (GT) obtained from slow line-scanning systems. We propose Perspective-Equivariant Fine-tuning for Demosaicing (PEFD), a framework that learns multispectral demosaicing from mosaiced measurements alone. PEFD a) exploits the projective geometry of camera-based imaging systems to leverage a richer group structure than previous demosaicing methods to recover more null-space information, and b) learns efficiently without GT by adapting pretrained foundation models designed for 1-3 channel imaging. On intraoperative and automotive datasets, PEFD recovers fine details such as blood vessels and preserves spectral fidelity, substantially outperforming recent approaches, nearing supervised performance.
[CV-115] You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image
【速读】:该论文旨在解决从低质量单视角盲脸图像(Blind Face)中生成一致且高质量的新视角人脸图像(Novel-View Synthesis)的问题。传统方法通常依赖于两阶段流程:先恢复图像至高分辨率,再基于恢复结果进行新视角合成,但其性能高度依赖于恢复质量,易导致输出不一致或失真。本文提出了一种新颖的一阶段方法 NVB-Face,其核心在于直接从盲脸图像中提取单视角特征,并引入一个特征操纵器(feature manipulator),将这些特征映射为具备3D感知能力的多视角潜在表示(multi-view latent representations),从而利用扩散模型的强大生成能力,实现高保真、一致性的新视角人脸图像合成。
链接: https://arxiv.org/abs/2603.01328
作者: Taoyue Wang,Xiang Zhang,Xiaotian Li,Huiyuan Yang,Lijun Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a novel one-stage method, NVB-Face, for generating consistent Novel-View images directly from a single Blind Face image. Existing approaches to novel-view synthesis for objects or faces typically require a high-resolution RGB image as input. When dealing with degraded images, the conventional pipeline follows a two-stage process: first restoring the image to high resolution, then synthesizing novel views from the restored result. However, this approach is highly dependent on the quality of the restored image, often leading to inaccuracies and inconsistencies in the final output. To address this limitation, we extract single-view features directly from the blind face image and introduce a feature manipulator that transforms these features into 3D-aware, multi-view latent representations. Leveraging the powerful generative capacity of a diffusion model, our framework synthesizes high-quality, consistent novel-view face images. Experimental results show that our method significantly outperforms traditional two-stage approaches in both consistency and fidelity.
[CV-116] Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding
【速读】:该论文旨在解决灾后场景理解中自动化图像解析的挑战,尤其是面对杂乱环境、视觉变异性和跨事件域偏移(cross-event domain shift)时,传统监督学习方法因依赖昂贵且任务特定的标注数据而受限。其解决方案的关键在于对比评估监督学习与开放词汇(open-vocabulary)视觉模型在灾后语义分割和目标检测任务中的性能表现,揭示二者在不同数据集(如FloodNet+、RescueNet、DFire和LADD)上的适用性差异,强调当标签空间固定且标注可用时,监督学习仍是小物体识别和复杂场景边界精确定义中最可靠的方法,而开放词汇模型则凭借大规模预训练和视觉-语言表征降低了对固定标签集的依赖,更适合数据稀缺和概念模糊的灾后场景。
链接: https://arxiv.org/abs/2603.01324
作者: Anna Michailidou,Georgios Angelidis,Vasileios Argyriou,Panagiotis Sarigiannidis,Georgios Th. Papadopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures
Abstract:Aerial imagery is critical for large-scale post-disaster damage assessment. Automated interpretation remains challenging due to clutter, visual variability, and strong cross-event domain shift, while supervised approaches still rely on costly, task-specific annotations with limited coverage across disaster types and regions. Recent open-vocabulary and foundation vision models offer an appealing alternative, by reducing dependence on fixed label sets and extensive task-specific annotations. Instead, they leverage large-scale pretraining and vision-language representations. These properties are particularly relevant for post-disaster domains, where visual concepts are ambiguous and data availability is constrained. In this work, we present a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding, focusing on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD. We examine performance trends, failure modes, and practical trade-offs between different learning paradigms, providing insight into their applicability for real-world disaster response. The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach (i.e., when the label space is fixed and annotations are available), especially for small objects and fine boundary delineation in cluttered scenes.
[CV-117] AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
【速读】:该论文旨在解决基于大模型(Large Multimodal Models, LMMs)的零样本视觉异常分割(Zero-Shot Visual Anomaly Segmentation, ZSAS)中存在的两个核心问题:一是异常概念本身具有抽象性和上下文依赖性,缺乏稳定的视觉原型;二是高层语义嵌入与像素级空间特征之间对齐薄弱,导致异常定位精度不足。解决方案的关键在于提出AG-VAS框架,其核心创新包括:引入三个可学习的语义锚点标记([SEG]、[NOR]、[ANO]),分别作为绝对语义锚和相对上下文对比锚,将抽象异常语义转化为空间上明确的视觉实体并建模类别间正常与异常模式的对比关系;同时设计Semantic-Pixel Alignment Module (SPAM)增强跨模态对齐,并通过Anchor-Guided Mask Decoder (AGMD)实现锚点条件下的掩码预测,从而提升异常定位准确性。此外,构建Anomaly-Instruct20K数据集以结构化方式组织异常知识,支持语义锚的有效学习与集成。
链接: https://arxiv.org/abs/2603.01305
作者: Zhen Qu,Xian Tao,Xiaoyi Bao,Dingrong Wang,ShiChen Qu,Zhengtao Zhang,Xingang Wang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Casivision; Weiqiao-UCAS Science and Technology Park (韦桥-国科大科技园)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens-[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.
[CV-118] When Does RL Help Medical VLMs? Disentangling Vision SFT and RL Gains
【速读】:该论文试图解决的问题是:在医疗视觉语言模型(Medical Vision-Language Models, VLMs)中,强化学习(Reinforcement Learning, RL)后训练是否真正提升了视觉推理能力,还是仅对监督微调(Supervised Fine-Tuning, SFT)已诱导的行为进行了优化。为厘清这一问题,作者设计了一个受控实验,从视觉感知、SFT支持和RL增强三个维度进行解耦分析。解决方案的关键在于发现:RL的有效性高度依赖于模型是否具备非平凡的推理支持(即高Pass@K),其主要作用是锐化输出分布以提升准确率(Accuracy@1)和采样效率,而SFT则负责扩展推理支持空间,使RL能够发挥效用。基于此,作者提出了一种边界感知(boundary-aware)的训练策略,并通过在PMC多选题VQA小样本平衡子集上对OctoMed初始化模型进行RL后训练,实现了六个医疗VQA基准上的平均性能显著提升。
链接: https://arxiv.org/abs/2603.01301
作者: Ahmadreza Jeddi,Kimia Shaban,Negin Baghbanzadeh,Natasha Sharan,Abhishek Moturu,Elham Dolatabadi,Babak Taati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.
[CV-119] Multi-Level Bidirectional Decoder Interaction for Uncertainty-Aware Breast Ultrasound Analysis
【速读】:该论文旨在解决乳腺超声图像中病灶分割与组织分类任务中存在的任务干扰问题以及传统多任务学习方法缺乏对实例预测难度自适应调整能力的问题。其解决方案的关键在于提出一种基于多层次解码器交互和不确定性感知自适应协调机制的多任务框架:首先,通过在解码器各层级引入任务交互模块(Task Interaction Modules),实现分割与分类任务间的双向信息流通,利用注意力加权池化和乘法调制机制增强空间重建过程中的任务协同;其次,设计不确定性代理注意力机制(Uncertainty-Proxy Attention),基于特征激活方差动态调整每层基础特征与增强特征的权重,从而实现逐样本、逐层的任务平衡,无需人工调参;此外,通过多尺度上下文融合机制捕捉不同大小病灶的形态学线索,进一步提升模型对复杂病例的适应性。
链接: https://arxiv.org/abs/2603.01295
作者: Abdullah Al Shafi,Md Kawsar Mahmud Khan Zunayed,Safin Ahmmed,Sk Imran Hossain,Engelbert Mephu Nguifo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 2 tables. The code is available at: this https URL
Abstract:Breast ultrasound interpretation requires simultaneous lesion segmentation and tissue classification. However, conventional multi-task learning approaches suffer from task interference and rigid coordination strategies that fail to adapt to instance-specific prediction difficulty. We propose a multi-task framework addressing these limitations through multi-level decoder interaction and uncertainty-aware adaptive coordination. Task Interaction Modules operate at all decoder levels, establishing bidirectional segmentation-classification communication during spatial reconstruction through attention weighted pooling and multiplicative modulation. Unlike prior single-level or encoder-only approaches, this multi-level design captures scale specific task synergies across semantic-to-spatial scales, producing complementary task interaction streams. Uncertainty-Proxy Attention adaptively weights base versus enhanced features at each level using feature activation variance, enabling per-level and per-sample task balancing without heuristic tuning. To support instance-adaptive prediction, multi-scale context fusion captures morphological cues across varying lesion sizes. Evaluation on multiple publicly available breast ultrasound datasets demonstrates competitive performance, including 74.5% lesion IoU and 90.6% classification accuracy on BUSI dataset. Ablation studies confirm that multi-level task interaction provides significant performance gains, validating that decoder-level bidirectional communication is more effective than conventional encoder-only parameter sharing. The code is available at: this https URL.
[CV-120] FoSS: Modeling Long Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier State Space Integration CVPR2026
【速读】:该论文旨在解决自动驾驶中轨迹预测的建模能力与计算效率难以平衡的问题,现有方法中基于注意力机制的架构存在与代理数量呈二次复杂度的缺陷,而循环模型则难以捕捉长程依赖和细粒度局部动态。解决方案的关键在于提出一种双分支框架FoSS,其核心创新包括:在频域分支中通过离散傅里叶变换(Discrete Fourier Transform, DFT)将轨迹分解为表征全局意图的幅值分量与刻画局部变化的相位分量,并引入渐进式螺旋重排序模块保持频谱顺序;结合两种选择性状态空间模块(Coarse2Fine-SSM 和 SpecEvolve-SSM)以线性时间复杂度 O(N) 精细优化频域特征;同时在时域分支设计动态选择性状态空间模块(Dynamic Selective SSM),以线性时间重构自注意力行为以保留长程时序上下文;最后通过交叉注意力融合时域与频域表示,并利用可学习查询生成多候选轨迹及加权融合头表达运动不确定性。该设计在Argoverse 1和2基准上实现SOTA精度的同时,计算量减少22.5%,参数量降低超40%。
链接: https://arxiv.org/abs/2603.01284
作者: Yizhou Huang,Gengze Jiang,Yihua Cheng,Kezhi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attention-based architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with O(N) complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.
[CV-121] Certifiable Estimation with Factor Graphs
【速读】:该论文旨在解决机器人状态估计中两个主流方法——局部优化因子图(factor graph)推理与可验证最优性(certifiable estimation)之间的割裂问题。传统因子图方法虽具模块化和高效性,但易陷入局部最优,可靠性不足;而基于凸松弛的可验证方法虽能保证全局最优解,却因大规模计算需求及复杂实现限制了实际部署。解决方案的关键在于发现并利用Shor松弛和Burer-Monteiro因子分解在结构上的保真特性:将原二次约束优化问题(QCQP)通过这些数学变换后,其对应的提升问题仍保持原始因子图的连接结构,变量与因子仅发生代数映射关系。这一结构性不变性使得原本复杂的可验证估计可通过现有成熟、高性能的因子图求解器实现,从而无缝集成至机器人与计算机视觉领域的标准工作流中,显著降低部署门槛。
链接: https://arxiv.org/abs/2603.01267
作者: Zhexin Xu,Nikolas R. Sanderson,Hanna Jiamei Zhang,David M. Rosen
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Factor graphs provide a convenient modular modeling language that enables practitioners to design and deploy high-performance robotic state estimation systems by composing simple, reusable building blocks. However, inference in these models is typically performed using local optimization methods that can converge to suboptimal solutions, a serious reliability concern in safety-critical applications. Conversely, certifiable estimators based on convex relaxation can recover verifiably globally optimal solutions in many practical settings, but the computational cost of solving their large-scale relaxations necessitates specialized, structure-exploiting solvers that require substantial expertise to implement, significantly hampering practical deployment. In this paper, we show that these two paradigms, which have thus far been treated as independent in the literature, can be naturally synthesized into a unified framework for certifiable factor graph optimization. The key insight is that factor graph structure is preserved under Shor’s relaxation and Burer-Monteiro factorization: applying these transformations to a QCQP with an associated factor graph representation yields a lifted problem admitting a factor graph model with identical connectivity, in which variables and factors are simple one-to-one algebraic transformations of those in the original QCQP. This structural preservation enables the Riemannian Staircase methodology for certifiable estimation to be implemented using the same mature, highly-performant factor graph libraries and workflows already ubiquitously employed throughout robotics and computer vision, making certifiable estimation as straightforward to design and deploy as conventional factor graph inference. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.01267 [cs.RO] (or arXiv:2603.01267v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.01267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-122] Cross-Modal Guidance for Fast Diffusion-Based Computed Tomography ICASSP
【速读】:该论文旨在解决在稀疏数据条件下(如中子计算机断层扫描,neutron CT)利用扩散模型进行高质量重建的难题,尤其针对测量成本高昂导致难以获取充足数据的情况。其解决方案的关键在于:无需重新训练扩散模型,即可通过引入一种易获取的互补模态(如X射线CT)来提供跨模态引导,从而显著提升重建质量。研究进一步分析了次优侧模态对跨模态引导效果的影响,验证了该方法在实际应用中的鲁棒性和有效性。
链接: https://arxiv.org/abs/2603.01253
作者: Timofey Efimov,Singanallur Venkatakrishnan,Maliha Hossain,Haley Duba-Sullivan,Amirkoushyar Ziabari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Abstract:Diffusion models have emerged as powerful priors for solving inverse problems in computed tomography (CT). In certain applications, such as neutron CT, it can be expensive to collect large amounts of measurements even for a single scan, leading to sparse data sets from which it is challenging to obtain high quality reconstructions even with diffusion models. One strategy to mitigate this challenge is to leverage a complementary, easily available imaging modality; however, such approaches typically require retraining the diffusion model with large datasets. In this work, we propose incorporating an additional modality without retraining the diffusion prior, enabling accelerated imaging of costly modalities. We further examine the impact of imperfect side modalities on cross-modal guidance. Our method is evaluated on sparse-view neutron computed tomography, where reconstruction quality is substantially improved by incorporating X-ray computed tomography of the same samples.
[CV-123] he MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction
【速读】:该论文旨在解决当前用于乳腺磁共振成像(Breast MRI)的人工智能(Artificial Intelligence, AI)模型普遍存在的泛化能力不足和潜在的群体差异性问题,尤其是基于单中心数据训练且仅依赖整体性能指标评估所导致的公平性缺失。其解决方案的关键在于发起MAMA-MIA挑战赛,构建一个大规模多中心基准数据集(训练集来自美国多个机构共1,506例患者,测试集来自欧洲三个独立中心共574例患者),并引入统一评分框架,同时评估模型在主要任务(原发肿瘤分割与病理完全缓解预测)上的预测性能以及在年龄、绝经状态和乳腺密度等亚组中的表现一致性,从而推动开发具备跨地域、跨机构鲁棒性和公平性的生成式AI系统。
链接: https://arxiv.org/abs/2603.01250
作者: Lidia Garrucho,Smriti Joshi,Kaisar Kushibar,Richard Osuala,Maciej Bobowicz,Xavier Bargalló,Paulius Jaruševičius,Kai Geissler,Raphael Schäfer,Muhammad Alberb,Tony Xu,Anne Martel,Daniel Sleiman,Navchetan Awasthi,Hadeel Awwad,Joan C. Vilanova,Robert Martí,Daan Schouten,Jeong Hoon Lee,Mirabela Rusu,Eleonora Poeta,Luisa Vargas,Eliana Pastor,Maria A. Zuluaga,Jessica Kächele,Dimitrios Bounias,Alexandra Ertl,Katarzyna Gwoździewicz,Maria-Laura Cosaka,Pasant M. Abo-Elhoda,Sara W. Tantawy,Shorouq S. Sakrana,Norhan O. Shawky-Abdelfatah,Amr Muhammad Abdo-Salem,Androniki Kozana,Eugen Divjak,Gordana Ivanac,Katerina Nikiforaki,Michail E. Klontzas,Rosa García-Dosdá,Meltem Gulsun-Akpinar,Oğuz Lafcı,Carlos Martín-Isla,Oliver Díaz,Laura Igual,Karim Lekadir
机构: Barcelona Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona, Barcelona, Spain; 2nd Department of Radiology, Medical University of Gdansk, Gdansk, Poland; Department of Radiology, Hospital Clínic of Barcelona, Barcelona, Spain; Department of Radiology, Lithuanian University of Health Sciences, Kaunas, Lithuania; Fraunhofer Institute for Digital Medicine MEVIS, Germany; Department of Medical Biophysics, University of Toronto, Canada; Sunnybrook Research Institute, Toronto, Canada; University of Amsterdam, Amsterdam, The Netherlands; Amsterdam UMC, Amsterdam, The Netherlands; Indian Institute of Technology, Jodhpur, Rajasthan, India; Department of Radiology, Clínica Girona, Institute of Diagnostic Imaging (IDI), Girona, Spain; Computer Vision and Robotics Institute (ViCOROB), University of Girona, Girona, Spain; Stanford University, Stanford, CA, USA; Politecnico di Torino, Turin, Italy; EURECOM, France; German Cancer Research Center (DKFZ), Division of Medical Image Computing, and the Medical Faculty Heidelberg, Heidelberg University, Germany; German Cancer Consortium (DKTK), DKFZ, Core Center Heidelberg, Heidelberg, Germany; Centro Mamario Instituto Alexander Fleming, Buenos Aires, Argentina; Department of Diagnostic and Interventional Radiology and Molecular Imaging, Faculty of Medicine, Ain Shams University, Cairo, Egypt; Department of Radiology, University Hospital of Heraklion, Heraklion, Greece; Department of Diagnostic and Interventional Radiology, University Hospital Dubrava, and the University of Zagreb School of Medicine, Zagreb, Croatia; Computational BioMedicine Laboratory, Institute of Computer Science, Foundation for Research and Technology–Hellas, Heraklion, Greece; Department of Radiology, School of Medicine, University of Crete, Heraklion, Greece; Medical Imaging and Radiology, Universitary and Politechnic Hospital La Fe, Valencia, Spain; Department of Radiology, Hacettepe University Faculty of Medicine, Ankara, Turkey; Department of Biomedical Imaging and Image-guided Therapy, Medical University of Vienna, Vienna, Austria; Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are often developed using single-center data and evaluated using aggregate performance metrics, limiting their generalizability and obscuring potential performance disparities across demographic subgroups. The MAMA-MIA Challenge was designed to address these limitations by introducing a large-scale benchmark that jointly evaluates primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under external testing and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.
[CV-124] AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models ICLR2026
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在处理海量视觉标记序列时所面临的显著计算开销问题,核心在于优化视觉标记剪枝策略以提升效率与性能。其解决方案的关键在于通过深入的实证分析揭示了基于注意力机制和基于多样性两种剪枝方法的本质差异:首先发现多数多样性导向的剪枝方法实际保留的特征多样性远低于预期,且与幻觉频率显著正相关;其次指出注意力驱动的方法在视觉证据集中的简单图像上更有效,而多样性方法则更适合特征分布复杂的图像。基于此洞察,论文提出将图像感知调整引入现有混合剪枝策略,并设计了一个简洁的自适应剪枝机制,显著提升了跨标准基准和幻觉评估下的稳定性和性能表现。
链接: https://arxiv.org/abs/2603.01236
作者: Changwoo Baek,Jouwon Song,Sohyeon Kim,Kyeongbo Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICLR 2026
Abstract:Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches’ characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at this https URL.
[CV-125] owards Policy-Adaptive Image Guardrail: Benchmark and Method
【速读】:该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)在有害图像防护(harmful image guardrail)任务中对特定安全政策过度拟合、难以泛化至未见政策的问题,尤其在政策动态演进场景下,传统基于固定类别分类器的方案需频繁重训练,而现有VLM方法虽具潜力却缺乏跨政策适应能力,甚至丧失基础指令遵循与通用知识。解决方案的关键在于:首先构建SafeEditBench评估基准,利用图像编辑模型生成符合不同安全政策的“安全-不安全”图像对,并通过人工标注实现细粒度的跨政策泛化性能评测;其次提出SafeGuard-VL方法,采用基于可验证奖励(verifiable rewards)的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)框架,在训练中引入政策锚定的奖励机制,而非仅依赖固定政策下的监督微调(Supervised Fine-Tuning, SFT),从而实现对多政策环境的鲁棒适应与持续优化。
链接: https://arxiv.org/abs/2603.01228
作者: Caiyong Piao,Zhiyuan Yan,Haoming Xu,Yunzhen Zhao,Kaiqing Lin,Feiyang Xu,Shuigeng Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with SafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe-unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.
[CV-126] CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM -Guided Canonical Spatial Modeling
【速读】:该论文旨在解决开放世界可提示3D语义分割中语义推理依赖输入传感器坐标系导致的脆弱性问题,即当前方法在不同姿态下难以稳定识别物体部件(如翅膀、把手、腿等)的功能性角色。其解决方案的关键在于引入一个从数据中直接学习的潜在规范参考框架(canonical reference frame),通过双分支架构实现模型内部的规范性建模:一是利用规范图锚定(canonical map anchoring)捕捉跨类别空间规律,二是通过规范框校准(canonical box calibration)消除姿态变化与对称性干扰,从而将输入姿态空间映射到稳定的规范嵌入空间,显著提升分割结果的鲁棒性和迁移能力。
链接: https://arxiv.org/abs/2603.01205
作者: Li Jin,Weikai Chen,Yujie Wang,Yingda Yin,Zeyu Hu,Runze Zhang,Keyang Luo,Shengju Qian,Xin Wang,Xueying Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space – wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose \methodName, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical embedding yields far more stable and transferable part semantics. Experimental results show that \methodName establishes new state of the art in open-world promptable 3D segmentation.
[CV-127] VisNec: Measuring and Leverag ing Visual Necessity for Multimodal Instruction Tuning
【速读】:该论文旨在解决多模态指令微调(multimodal instruction tuning)中因训练样本存在视觉冗余(visually redundant)和模态错位监督(multimodally misaligned supervision)而导致模型学习效率低下与性能下降的问题。其核心解决方案是提出VisNec(Visual Necessity Score),一种基于边际贡献的判别性数据选择框架,通过对比有无视觉上下文时的预测损失,量化每个样本对视觉信息的依赖程度,从而识别出真正需要视觉推理的“视觉必要”样本、冗余样本及错位样本。为保持任务多样性,VisNec进一步结合语义聚类,在每个簇内选取高必要性样本进行训练,显著提升训练效率与模型鲁棒性——实验证明,仅用LLaVA-665K数据集中15%的VisNec筛选样本即可达到全量数据99.8%的性能,且在Vision-Flan-186K上甚至超越全量训练15.8%。
链接: https://arxiv.org/abs/2603.01195
作者: Mingkang Dong,Hongyi Cai,Jie Li,Sifan Zhou,Bin Ren,Kunyu Peng,Yuqian Fu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures
Abstract:The effectiveness of multimodal instruction tuning depends not only on dataset scale, but critically on whether training samples genuinely require visual reasoning. However, existing instruction datasets often contain a substantial portion of visually redundant samples (solvable from text alone), as well as multimodally misaligned supervision that can degrade learning. To address this, we propose VisNec (Visual Necessity Score), a principled data selection framework that measures the marginal contribution of visual input during instruction tuning. By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned. To preserve task diversity, we combine VisNec with semantic clustering and select high-necessity samples within each cluster. Across 10 downstream benchmarks, training on only 15% of the LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance. On the smaller Vision-Flan-186K dataset, our selection not only further reduces data size but also surpasses full-data training by 15.8%. These results demonstrate that measuring and leveraging visual necessity provides an effective solution for both efficient and robust multimodal instruction tuning. Codes and selected subsets will be released upon acceptance.
[CV-128] RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations CVPR2026
【速读】:该论文旨在解决从有限视角的2D观测中推断完整3D结构的挑战,即如何在仅观察到部分几何信息的情况下,准确重建可见区域并合理生成不可见区域的几何与外观。其解决方案的关键在于提出RnG(Reconstruction and Generation)模型,该模型通过一种重构引导的因果注意力机制(reconstruction-guided causal attention mechanism),在注意力层面分离重建与生成任务,并将KV缓存(KV-cache)视为隐式3D表示。这一设计使得任意姿态能够高效查询该缓存以生成高保真度的新视角RGBD输出,从而实现对可见与不可见区域的统一建模和高质量渲染。
链接: https://arxiv.org/abs/2603.01194
作者: Mochu Xiang,Zhelun Shen,Xuesong Li,Jiahui Ren,Jing Zhang,Chen Zhao,Shanshan Liu,Haocheng Feng,Jingdong Wang,Yuchao Dai
机构: Northwestern Polytechnical University (西北工业大学); Baidu Inc. (百度公司); Australian National University (澳大利亚国立大学); CSIRO (澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications. Project page: this https URL
[CV-129] VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification
【速读】:该论文旨在解决高光谱影像(Hyperspectral Imagery, HSI)分类中因高维光谱数据与标注样本极度稀缺之间的矛盾所导致的性能瓶颈问题。其解决方案的关键在于提出了一种名为VP-Hype的新型混合架构,该架构通过将状态空间模型(State-Space Models, SSMs)的线性时间效率与Transformer的关系建模能力相结合,构建了一个融合Mamba-Transformer的骨干网络,从而在显著降低计算开销的同时有效捕捉长程依赖关系;此外,为缓解标签稀缺问题,引入了双模态视觉与文本提示(Visual and Textual Prompts),提供上下文感知的特征提取引导,最终在极低数据量下实现了卓越的分类性能。
链接: https://arxiv.org/abs/2603.01174
作者: Abdellah Zakaria Sellam,Fadi Abdeladhim Zidi,Salah Eddine Bekhouche,Ihssen Houhou,Marouane Tliba,Cosimo Distante,Abdenour Hadid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate classification of hyperspectral imagery (HSI) is often frustrated by the tension between high-dimensional spectral data and the extreme scarcity of labeled training samples. While hierarchical models like LoLA-SpecViT have demonstrated the power of local windowed attention and parameter-efficient fine-tuning, the quadratic complexity of standard Transformers remains a barrier to scaling. We introduce VP-Hype, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture. Building on a robust 3D-CNN spectral front-end, VP-Hype replaces conventional attention blocks with a Hybrid Mamba-Transformer backbone to capture long-range dependencies with significantly reduced computational overhead. Furthermore, we address the label-scarcity problem by integrating dual-modal Visual and Textual Prompts that provide context-aware guidance for the feature extraction process. Our experimental evaluation demonstrates that VP-Hype establishes a new state of the art in low-data regimes. Specifically, with a training sample distribution of only 2%, the model achieves Overall Accuracy (OA) of 99.69% on the Salinas dataset and 99.45% on the Longkou dataset. These results suggest that the convergence of hybrid sequence modeling and multi-modal prompting provides a robust path forward for high-performance, sample-efficient remote sensing.
[CV-130] ripleSumm: Adaptive Triple-Modality Fusion for Video Summarization ICLR2026
【速读】:该论文旨在解决当前视频摘要技术在处理复杂视频时表现不佳的问题,其核心瓶颈在于现有方法多采用静态或模态无关的融合策略,无法有效捕捉视频中不同模态(视觉、文本、音频)在帧级别上的动态显著性差异。解决方案的关键在于提出TripleSumm架构,该架构能够在帧级别上自适应地加权和融合多模态信息,从而更精准地提取关键内容;同时,为弥补多模态视频摘要研究中缺乏全面评估基准的不足,作者构建了首个大规模多模态视频摘要基准MoSu(Most Replayed Multimodal Video Summarization),包含三种模态数据,支持更可靠的性能验证。实验表明,TripleSumm在四个基准上均达到最先进水平,显著优于现有方法。
链接: https://arxiv.org/abs/2603.01169
作者: Sumin Kim,Hyemin Jeong,Mingu Kang,Yejin Kim,Yoori Oh,Joonseok Lee
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published as a Conference Paper at ICLR 2026
Abstract:The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality-agnostic fusion strategies. These methods fail to account for the dynamic, frame-dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at this https URL.
[CV-131] FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing
【速读】:该论文旨在解决图像驱动的视频编辑(image-driven video editing)中因注意力注入强度控制不当导致的语义冲突与源视频信息保留不足的问题。现有方法通常通过预训练的图像到视频(image-to-video, I2V)模型将源视频反演为噪声,并利用编辑后的首帧引导采样过程,但直接在去噪阶段注入注意力机制易引发两个问题:过度注入造成源视频与编辑内容语义冲突,而注入不足则无法有效保留源视频的运动和布局特征。解决方案的关键在于提出一种编辑感知的注意力注入方法(Editing-awaRE, REE),其核心是根据首帧像素差异生成编辑掩码,并结合光流跟踪该掩码在整个视频中的变化,从而动态调节每个token的注入强度——在编辑区域内不进行注意力注入,确保编辑内容主导,同时保留非编辑区域的源视频结构信息。基于此机制,作者进一步构建了无需微调的零样本视频编辑框架FREE-Edit,依托新兴的修正流(rectified-Flow)模型实现高质量视频编辑。
链接: https://arxiv.org/abs/2603.01164
作者: Maomao Li,Yunfei Liu,Yu Li
机构: The University of Hong Kong (香港大学); International Digital Economy Academy (IDEA) (国际数字经济发展研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
Abstract:Image-driven video editing aims to propagate edit contents from the modified first frame to the rest frames. The existing methods usually invert the source video to noise using a pre-trained image-to-video (I2V) model and then guide the sampling process using the edited first frame. Generally, a popular choice for maintaining motion and layout from the source video is intervening in the denoising process by injecting attention during reconstruction. However, such injection often leads to unsatisfactory results, where excessive injection leads to conflicting semantics from the source video while insufficient injection brings limited source representation. Recognizing this, we propose an Editing-awaRE (REE) injection method to modulate injection intensity of each token. Specifically, we first compute the pixel difference between the source and edited first frame to form a corresponding editing mask. Next, we track the editing area throughout the entire video by using optical flow to warp the first-frame mask. Then, editing-aware feature injection intensity for each token is generated accordingly, where injection is not conducted on editing areas. Building upon REE injection, we further propose a zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit. Without fine-tuning or training, our FREE-Edit demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques. Project page: this https URL.
[CV-132] BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling CVPR2026
【速读】:该论文旨在解决人脸修饰(face retouching)中长期存在的矛盾:监督学习方法受限于像素级标签拟合,难以捕捉人类主观审美偏好;而在线强化学习(reinforcement learning, RL)虽能更好对齐审美偏好,但其随机探索机制与高保真度需求冲突,易引入显著噪声伪影。解决方案的关键在于提出BeautyGRPO框架,其核心创新包括:构建包含五个关键修饰维度的细粒度偏好数据集FRPref-10K,并训练专用奖励模型以评估细微感知差异;引入动态路径引导(Dynamic Path Guidance, DPG)机制,通过动态计算基于锚点的常微分方程(ODE)路径并在每步采样时重规划引导轨迹,有效校正随机漂移,同时保持可控探索,从而在保证图像保真度的前提下实现更符合人类审美的修饰效果。
链接: https://arxiv.org/abs/2603.01163
作者: Jiachen Yang,Xianhui Lin,Yi Dong,Zebiao Zheng,Xing Liu,Hong Gu,Yanmei Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.
[CV-133] GRAD-Former: Gated Robust Attention-based Differential Transformer for Change Detection
【速读】:该论文旨在解决遥感图像变化检测(Change Detection, CD)中现有深度学习方法在高分辨率(Very High-Resolution, VHR)卫星影像上难以精确分割变化区域的问题,尤其是基于Transformer的方法因二次计算复杂度和小样本下性能下降而无法充分挖掘VHR数据中的空间信息。解决方案的关键在于提出GRAD-Former框架,其核心创新是引入自适应特征相关性与精炼模块(Adaptive Feature Relevance and Refinement, AFRAR),该模块通过选择性嵌入增强(Selective Embedding Amplification, SEA)和全局-局部特征精炼(Global-Local Feature Refinement, GLFR)两个组件,利用门控机制和差异注意力机制生成多个softmax堆栈,从而有效提取关键特征并抑制冗余信息,同时保持模型轻量化。实验表明,GRAD-Former在多个挑战性数据集上均优于当前最优模型,且参数量更少,建立了新的遥感变化检测性能基准。
链接: https://arxiv.org/abs/2603.01161
作者: Durgesh Ameta,Ujjwal Mishra,Praful Hambarde,Amit Shukla
机构: IIT Mandi (印度理工学院曼迪分校); IIIT Una (印度信息技术学院乌纳分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Change detection (CD) in remote sensing aims to identify semantic differences between satellite images captured at different times. While deep learning has significantly advanced this field, existing approaches based on convolutional neural networks (CNNs), transformers and Selective State Space Models (SSMs) still struggle to precisely delineate change regions. In particular, traditional transformer-based methods suffer from quadratic computational complexity when applied to very high-resolution (VHR) satellite images and often perform poorly with limited training data, leading to under-utilization of the rich spatial information available in VHR imagery. We present GRAD-Former, a novel framework that enhances contextual understanding while maintaining efficiency through reduced model size. The proposed framework consists of a novel encoder with Adaptive Feature Relevance and Refinement (AFRAR) module, fusion and decoder blocks. AFRAR integrates global-local contextual awareness through two proposed components: the Selective Embedding Amplification (SEA) module and the Global-Local Feature Refinement (GLFR) module. SEA and GLFR leverage gating mechanisms and differential attention, respectively, which generates multiple softmax heaps to capture important features while minimizing the captured irreverent features. Multiple experiments across three challenging CD datasets (LEVIR-CD, CDD, DSIFN-CD) demonstrate GRAD-Former’s superior performance compared to existing approaches. Notably, GRAD-Former outperforms the current state-of-the-art models across all the metrics and all the datasets while using fewer parameters. Our framework establishes a new benchmark for remote sensing change detection performance. Our code will be released at: this https URL
[CV-134] D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping ICLR2026
【速读】:该论文旨在解决仿真与现实世界之间存在的“sim-to-real gap”问题,特别是在物理参数识别(如物体质量)和基于力感知的抓取策略学习方面的挑战。其解决方案的关键在于提出了一种“real-to-sim-to-real”引擎,利用高斯点阵(Gaussian Splat)表示构建可微分的仿真环境,从而实现从真实世界的视觉观测和机器人控制信号中自动识别物体质量,并同步优化抓取策略;同时,通过将有限的人类示范转化为模拟机器人示范,训练出鲁棒的力感知抓取策略,显著提升了数字孪生的物理保真度与实际应用性能。
链接: https://arxiv.org/abs/2603.01151
作者: Haozhe Lou,Mingtong Zhang,Haoran Geng,Hanyang Zhou,Sicheng He,Zhiyuan Gao,Siheng Zhao,Jiageng Mao,Pieter Abbeel,Jitendra Malik,Daniel Seita,Yue Wang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: ICLR 2026 Poster
Abstract:Simulation provides a cost-effective and flexible platform for data generation and policy learning to develop robotic systems. However, bridging the gap between simulation and real-world dynamics remains a significant challenge, especially in physical parameter identification. In this work, we introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine, enabling object mass identification from real-world visual observations and robot control signals, while enabling grasping policy learning simultaneously. Through optimizing the mass of the manipulated object, our method automatically builds high-fidelity and physically plausible digital twins. Additionally, we propose a novel approach to train force-aware grasping policies from limited data by transferring feasible human demonstrations into simulated robot demonstrations. Through comprehensive experiments, we demonstrate that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values. Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping, effectively reducing the sim-to-real gap.
[CV-135] ConVibNet: Needle Detection during Continuous Insertion via Frequency-Inspired Features
【速读】:该论文旨在解决超声引导下穿刺针定位中因针体可见度低、图像伪影、遮挡及对比度差等问题导致的实时连续跟踪困难,从而影响临床干预成功率的问题。其解决方案的关键在于提出ConVibNet模型,该模型通过引入一种新颖的交集与差值损失函数(intersection-and-difference loss),显式建模连续帧间的运动相关性,增强对针尖运动的时序感知能力,并结合多帧时序依赖关系实现针尖位置与针体角度的连续、高精度估计,显著提升了在动态场景下的检测鲁棒性和准确性,同时保持了实时推理性能。
链接: https://arxiv.org/abs/2603.01147
作者: Jiamei Guo,Zhehao Duan,Maria Neiiendam,Dianye Huang,Nassir Navab,Zhongliang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IPCAI
Abstract:Purpose: Ultrasound-guided needle interventions are widely used in clinical practice, but their success critically depends on accurate needle placement, which is frequently hindered by the poor and intermittent visibility of needles in ultrasound images. Existing approaches remain limited by artifacts, occlusions, and low contrast, and often fail to support real-time continuous insertion. To overcome these challenges, this study introduces a robust real-time framework for continuous needle detection. Methods: We present ConVibNet, an extension of VibNet for detecting needles with significantly reduced visibility, addressing real-time, continuous needle tracking during insertion. ConVibNet leverages temporal dependencies across successive ultrasound frames to enable continuous estimation of both needle tip position and shaft angle in dynamic scenarios. To strengthen temporal awareness of needle-tip motion, we introduce a novel intersection-and-difference loss that explicitly leverages motion correlations across consecutive frames. In addition, we curated a dedicated dataset for model development and evaluation. Results: The performance of the proposed ConVibNet model was evaluated on our dataset, demonstrating superior accuracy compared to the baseline VibNet and UNet-LSTM models. Specifically, ConVibNet achieved a tip error of 2.80±2.42 mm and an angle error of 1.69±2.00 deg. These results represent a 0.75 mm improvement in tip localization accuracy over the best-performing baseline, while preserving real-time inference capability. Conclusion: ConVibNet advances real-time needle detection in ultrasound-guided interventions by integrating temporal correlation modeling with a novel intersection-and-difference loss, thereby improving accuracy and robustness and demonstrating high potential for integration into autonomous insertion systems. Comments: Accepted by IPCAI Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.01147 [cs.CV] (or arXiv:2603.01147v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.01147 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dianye Huang [view email] [v1] Sun, 1 Mar 2026 15:16:25 UTC (992 KB)
[CV-136] C-SSA: Token Compression via Semantic Slot Aggregation for Gigapixel Pathology Reasoning
【速读】:该论文旨在解决大视觉语言模型在计算病理学中应用时面临的计算瓶颈问题,即全切片图像(Whole Slide Images, WSI)通常包含超过10⁵个图像块(patch),导致序列长度远超标准Transformer架构的处理能力。现有方法多依赖空间采样,可能丢失诊断关键信息。其解决方案的关键在于提出TC-SSA(Token Compression via Semantic Slot Aggregation)——一种可学习的令牌压缩框架,通过语义槽聚合机制将图像块特征压缩为固定数量的语义槽;其中,门控路由模块采用稀疏Top-2路由策略分配图像块至槽位,并进行加权聚合,在严格token预算下实现全局切片覆盖,最终将视觉令牌数量减少至原始序列的1.7%,同时保留诊断相关特征,显著提升效率与诊断性能的平衡。
链接: https://arxiv.org/abs/2603.01143
作者: Zhuo Chen,Shawn Young,Lijian Xu
机构: Shenzhen University of Advanced Technology(深圳大学先进技术研究院); University of Nottingham NingBo China(诺丁汉大学宁波分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 2 tables
Abstract:The application of large vision-language models to computational pathology holds great promise for diagnostic assistants but faces a critical computational bottleneck: the gigapixel scale of Whole Slide Images (WSIs). A single WSI typically contains over 105 patches, creating sequence lengths that exceed the constraints of standard Transformer architectures. Existing solutions often resort to spatial sampling, which risks discarding diagnostically critical evidence. To address this, we propose TC-SSA (Token Compression via Semantic Slot Aggregation), a learnable token compression framework that aggregates patch features into a fixed number of semantic slots. A gated routing module assigns patches to slots using sparse Top-2 routing, followed by weighted aggregation, enabling global slide coverage under a strict token budget. The resulting representation retains diagnostically relevant information while reducing the number of visual tokens to 1.7% of the original sequence. On the SlideBench(TCGA), our model achieves 78.34% overall accuracy and 77.14% on the diagnosis subset, outperforming sampling-based baselines under comparable token budgets. The method also generalizes to MIL classification, reaching AUC of 95.83% on TCGA-BRCA, 98.27% on TCGA-NSCLC and 79.80% on PANDA. These results suggest that learnable semantic aggregation provides an effective trade-off between efficiency and diagnostic performance for gigapixel pathology reasoning.
[CV-137] ArtLLM : Generating Articulated Assets via 3D LLM
【速读】:该论文旨在解决现有方法在生成交互式数字环境所需关节结构化3D对象时的局限性:优化-based重建方法需逐对象进行关节拟合,效率低且仅适用于简单单关节物体;而基于检索的方法则受限于固定部件库,导致几何重复且泛化能力差。解决方案的关键在于提出ArtLLM框架,其核心是一个基于大规模关节数据集训练的3D多模态大语言模型(Multimodal Large Language Model, MLLM),能够从完整的3D点云中统一推断可变数量的部件与关节及其运动学结构,并以此条件驱动一个3D生成模型合成高保真部件几何,从而实现端到端、高质量且具备强泛化能力的关节结构化资产生成。
链接: https://arxiv.org/abs/2603.01142
作者: Penghao Wang,Siyuan Xie,Hongyu Yan,Xianghui Yang,Jingwei Huang,Chunchao Guo,Jiayuan Gu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object’s point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.
[CV-138] acher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers
【速读】:该论文旨在解决传统图像去噪模型在学习环境因素与噪声模式之间虚假相关性时导致的鲁棒性下降问题,尤其是在分布偏移下难以区分细微纹理与随机噪声,从而引发细节过度去除或残留噪声伪影的问题。其解决方案的关键在于引入因果干预机制,提出教师引导的因果解耦网络(TCD-Net),通过在视觉Transformer框架内对特征空间进行结构化干预,实现内容与噪声的显式解耦:首先利用环境偏差调整模块(EBA)将特征投影至去中心化的稳定子空间以消除全局环境偏置;其次采用双分支解耦头并施加正交约束,强制内容与噪声表征严格分离;最后借助Nano Banana Pro生成模型提供的因果先验,引导内容表示回归到自然图像流形,从而缓解结构模糊性。该方法显著提升了去噪质量与效率,在多个基准测试中优于主流方法,并实现单张RTX 5090 GPU上104.2 FPS的实时性能。
链接: https://arxiv.org/abs/2603.01140
作者: Kuai Jiang,Zhaoyan Ding,Guijuan Zhang,Dianjie Lu,Zhuoran Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conventional image denoising models often inadvertently learn spurious correlations between environmental factors and noise patterns. Moreover, due to high-frequency ambiguity, they struggle to reliably distinguish subtle textures from stochastic noise, resulting in over-removed details or residual noise artifacts. We therefore revisit denoising via causal intervention, arguing that purely correlational fitting entangles intrinsic content with extrinsic noise, which directly degrades robustness under distribution shifts. Motivated by this, we propose the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework. Specifically, our method integrates three key components: (1) An Environmental Bias Adjustment (EBA) module projects features into a stable, de-centered subspace to suppress global environmental bias (de-confounding). (2) A dual-branch disentanglement head employs an orthogonality constraint to force a strict separation between content and noise representations, preventing information leakage. (3) To resolve structural ambiguity, we leverage Nano Banana Pro, Google’s reasoning-guided AI image generation model, to guide a causal prior, effectively pulling content representations back onto the natural-image manifold. Extensive experiments demonstrate that TCD-Net outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving a real-time speed of 104.2 FPS on a single RTX 5090 GPU.
[CV-139] Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations
【速读】:该论文旨在解决复杂视觉关系(Compositional Visual Relations, CVR)推理任务中因规则多样性与复杂性导致的建模难题,特别是识别在四张图像中遵循相同组合规则但存在异常的一张图像的问题。其解决方案的关键在于提出预测-验证范式(predict-and-verify paradigm),结合增强异常对比学习(Augmented Anomaly Contrastive Learning, A^2CL)机制:首先通过A^2CL从正常样本中提取判别性强且泛化能力好的特征,最大化正常实例间的相似性并最小化与异常样本的相似性;随后引入一系列预测异常推理模块(Predictive Anomaly Reasoning Blocks, PARBs),利用三张图像特征迭代预测第四张图像特征,并在验证阶段逐步定位违反组合规则的具体差异,从而实现基于规则的精准推理。
链接: https://arxiv.org/abs/2603.01125
作者: Chengtai Li,Yuting He,Jianfeng Ren,Ruibin Bai,Yitian Zhao,Heng Yu,Xudong Jiang
机构: University of Nottingham Ningbo China (诺丁汉大学宁波分校); Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences (中国科学院宁波材料技术与工程研究所); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Multimedia
Abstract:While visual reasoning for simple analogies has received significant attention, compositional visual relations (CVR) remain relatively unexplored due to their greater complexity. To solve CVR tasks, we propose Predictive Reasoning with Augmented Anomaly Contrastive Learning (PR-A ^2 CL), \ie, to identify an outlier image given three other images that follow the same compositional rules. To address the challenge of modelling abundant compositional rules, an Augmented Anomaly Contrastive Learning is designed to distil discriminative and generalizable features by maximizing similarity among normal instances while minimizing similarity between normal and anomalous outliers. More importantly, a predict-and-verify paradigm is introduced for rule-based reasoning, in which a series of Predictive Anomaly Reasoning Blocks (PARBs) iteratively leverage features from three out of the four images to predict those of the remaining one. Throughout the subsequent verification stage, the PARBs progressively pinpoint the specific discrepancies attributable to the underlying rules. Experimental results on SVRT, CVR and MC ^2 R datasets show that PR-A ^2 CL significantly outperforms state-of-the-art reasoning models.
[CV-140] ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models
【速读】:该论文旨在解决医学视觉语言模型(Medical Vision-Language Models)在临床决策支持中因缺乏局部病理证据的充分锚定而导致的事实性幻觉问题。现有对齐方法主要通过偏好优化在输出层面提升正确性,但未能有效连接中间推理过程与视觉区域;尽管链式思维(Chain-of-Thought, CoT)增强了多模态推理能力,仍以文本为中心,难以充分利用临床视觉线索。解决方案的关键在于提出 ClinCoT 框架——一种面向临床的视觉链式思维方法,将偏好优化从响应层修正转变为视觉驱动的推理过程。其核心创新包括:1)基于假设驱动的区域提议自动生成临床锚定的偏好数据对;2)引入评分导向的边际感知优化策略,融合排序信息与分数差异以精炼区域级推理轨迹;3)采用迭代学习机制动态重生成偏好数据,确保训练过程中模型策略演进时仍保持对齐一致性。
链接: https://arxiv.org/abs/2603.01124
作者: Xiwei Liu,Yulong Li,Xinlin Zhuang,Xuhui Li,Jianxu Chen,Haolin Yang,Imran Razzak,Yutong Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model’s policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
[CV-141] Improved MambdaBDA Framework for Robust Building Damage Assessment Across Disaster Domains
【速读】:该论文旨在解决灾后建筑损伤评估(Building Damage Assessment, BDA)中因类别不平衡、背景杂乱以及灾害类型与地理区域间域偏移(domain shift)导致的可靠性下降问题。其解决方案的关键在于对MambaBDA模型引入三个模块化改进:(i) 使用Focal Loss缓解类别不平衡问题;(ii) 引入轻量级注意力门(Attention Gates)抑制无关上下文干扰;(iii) 设计紧凑的对齐模块(Alignment Module),在解码前将灾前特征空间配准至灾后内容。这些改进显著提升了模型在同分布和跨数据集场景下的性能,尤其增强了系统在未见灾害类型上的泛化能力。
链接: https://arxiv.org/abs/2603.01116
作者: Alp Eren Gençoğlu,Hazım Kemal Ekenel
机构: Istanbul Technical University (伊斯坦布尔技术大学); New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable post-disaster building damage assessment (BDA) from satellite imagery is hindered by severe class imbalance, background clutter, and domain shift across disaster types and geographies. In this work, we address these problems and explore ways to improve the MambaBDA, the BDA network of ChangeMamba architecture, one of the most successful BDA models. The approach enhances the MambaBDA with three modular components: (i) Focal Loss to mitigate class imbalance damage classification, (ii) lightweight Attention Gates to suppress irrelevant context, and (iii) a compact Alignment Module to spatially warp pre-event features toward post-event content before decoding. We experiment on multiple satellite imagery datasets, including xBD, Pakistan Flooding, Turkey Earthquake, and Ida Hurricane, and conduct in-domain and crossdataset tests. The proposed modular enhancements yield consistent improvements over the baseline model, with 0.8% to 5% performance gains in-domain, and up to 27% on unseen disasters. This indicates that the proposed enhancements are especially beneficial for the generalization capability of the system.
[CV-142] GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation
【速读】:该论文旨在解决基础视觉模型(Foundation Vision Models)在医学图像分割任务中因领域偏移(Domain Shift)而导致性能下降的问题,尤其是在未进行充分微调或轻量适配时难以有效对齐医学图像分割需求。解决方案的关键在于提出GuiDINO框架,其核心创新是将预训练的基础模型(如DINOv3)重构为视觉引导生成器(Visual Guidance Generator),通过轻量级TokenBook机制将模型提取的视觉特征转换为空间引导掩码(Guide Mask),该掩码用于调控下游分割骨干网络的特征激活,从而注入基础模型先验知识,同时保留医学专用架构的归纳偏置与计算效率。此外,该方法引入引导监督损失(Guide Supervision Objective Loss)并可选地结合边界聚焦铰链损失(Boundary-Focused Hinge Loss)以提升细粒度结构的分割鲁棒性,并支持基于LoRA的参数高效适应策略,显著提升了跨数据集的分割质量与边界精度。
链接: https://arxiv.org/abs/2603.01115
作者: Zhuonan Liang,Wei Guo,Jie Gan,Yaxuan Song,Runnan Chen,Hang Chang,Weidong Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures, 3 tables
Abstract:Foundation vision models are increasingly adopted in medical image analysis. Due to domain shift, these pretrained models misalign with medical image segmentation needs without being fully fine-tuned or lightly adapted. We introduce GuiDINO, a framework that repositions native foundation model to acting as a visual guidance generator for downstream segmentation. GuiDINO extracts visual feature representation from DINOv3 and converts them into a spatial guide mask via a lightweight TokenBook mechanism, which aggregates token-prototype similarities. This guide mask gates feature activations in multiple segmentation backbones, thereby injecting foundation-model priors while preserving the inductive biases and efficiency of medical dedicated architectures. Training relies on a guide supervision objective loss that aligns the guide mask to ground-truth regions, optionally augmented by a boundary-focused hinge loss to sharpen fine structures. GuiDINO also supports parameter-efficient adaptation through LoRA on the DINOv3 guide backbone. Across diverse medical datasets and nnUNet-style inference, GuiDINO consistently improves segmentation quality and boundary robustness, suggesting a practical alternative to fine-tuning and offering a new perspective on how foundation models can best serve medical vision. Code is available at this https URL
[CV-143] DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles CVPR2026
【速读】:该论文旨在解决当前提示学习(Prompt Learning)方法在适配预训练视觉语言模型(Vision-Language Models, VLMs)时存在的核心问题:现有方法通常基于层中心视角(layer-centric view),假设浅层捕捉通用特征、深层处理任务特定知识,这种粗粒度划分导致可学习提示 token 与原始 token 之间产生不可控交互,进而削弱模型的零样本泛化能力,并在任务适应性和通用性之间形成权衡。解决方案的关键在于挑战这一层中心假设,提出 DeAR 框架——通过**分解注意力头角色(Decomposing Attention Head Roles)**实现细粒度适配。具体而言,作者引入概念熵(Concept Entropy)作为新指标,将深层注意力头系统性地划分为属性(Attribute)、泛化(Generalization)和混合(Mixed)三类功能角色,并据此设计专用属性 token 和基于角色的注意力掩码机制(Role-Based Attention Mask),以隔离泛化头与任务特定知识流;同时结合推理阶段的任务自适应融合策略(Task-Adaptive Fusion Strategy),从而在多个下游任务上实现更优的任务适应与零样本泛化平衡。
链接: https://arxiv.org/abs/2603.01111
作者: Yiming Ma,Hongkun Yang,Lionel Z. Wang,Bin Chen,Weizhi Xian,Jianzhi Teng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model’s core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \textbfDeAR, a framework that achieves fine-grained VLM adaptation by \textbfDecomposing \textbfAttention head \textbfRoles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \textitAttribute, \textitGeneralization, and \textitMixed. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.
[CV-144] GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation
【速读】:该论文旨在解决当前外科视觉-语言模型在临床场景中缺乏对特定器械实例进行语义定位与功能关联的能力问题,即现有基准主要评估类别级分割,无法满足实际手术决策中基于功能角色、空间关系或解剖交互能力对具体器械实例的识别需求。解决方案的关键在于提出首个语言条件下的实例级外科场景接地(grounding)基准——GroundedSurg,其核心创新包括:每张外科图像配以自然语言描述单个器械的目标参考,并提供结构化的空间标注(边界框和点级锚点),覆盖眼科、腹腔镜、机器人及开放手术等多种术式,从而实现对视觉-语言模型在多器械复杂场景下语言参考解析与像素级定位能力的系统性评估,推动具备临床意义的视觉-语言推理在智能外科系统中的发展。
链接: https://arxiv.org/abs/2603.01108
作者: Tajamul Ashraf,Abrar Ul Riyaz,Wasif Tak,Tavaheed Tariq,Sonia Yadav,Moloud Abdar,Janibul Bashir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Clinically reliable perception of surgical scenes is essential for advancing intelligent, context-aware intraoperative assistance such as instrument handoff guidance, collision avoidance, and workflow-aware robotic support. Existing surgical tool benchmarks primarily evaluate category-level segmentation, requiring models to detect all instances of predefined instrument classes. However, real-world clinical decisions often require resolving references to a specific instrument instance based on its functional role, spatial relation, or anatomical interaction capabilities not captured by current evaluation paradigms. We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark. Each instance pairs a surgical image with a natural-language description targeting a single instrument, accompanied by structured spatial grounding annotations including bounding boxes and point-level anchors. The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities. By jointly evaluating linguistic reference resolution and pixel-level localization, GroundedSurg enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes. Extensive experiments demonstrate substantial performance gaps across modern segmentation and VLMs, highlighting the urgent need for clinically grounded vision-language reasoning in surgical AI systems. Code and data are publicly available at this https URL
[CV-145] Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting
【速读】:该论文旨在解决生成式AI(Generative AI)在处理视觉原语(如笔触或纹理)时因数据稀缺导致的表达能力和可控性不足的问题,从而限制了其在过程感知内容创作中的应用。解决方案的关键在于提出StrokeDiff框架,该框架基于扩散模型并引入平滑正则化(Smooth Regularization, SmR),通过在训练过程中注入随机视觉先验,在不改变推理流程的前提下稳定模型在稀疏监督下的学习过程;同时,通过贝塞尔曲线(Bézier-based)条件模块实现对生成笔触的可控性,并构建完整的基于笔触的绘画流水线,涵盖预测、生成、排序与合成,从而实现高效且结构一致的多媒体内容生成。
链接: https://arxiv.org/abs/2603.01103
作者: Dantong Qin,Alessandro Bozzon,Xian Yang,Xun Zhang,Yike Guo,Pan Wang
机构: Delft University of Technology (代尔夫特理工大学); The University of Manchester (曼彻斯特大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many creative multimedia systems are built upon visual primitives such as strokes or textures, which are difficult to collect at scale and fundamentally different from natural image data. This data scarcity makes it challenging for modern generative models to learn expressive and controllable primitives, limiting their use in process-aware content creation. In this work, we study the problem of learning human-like brushstroke generation from a small set of hand-drawn samples (n=470) and propose StrokeDiff, a diffusion-based framework with Smooth Regularization (SmR). SmR injects stochastic visual priors during training, providing a simple mechanism to stabilize diffusion models under sparse supervision without altering the inference process. We further show how the learned primitives can be made controllable through a Bézier-based conditioning module and integrated into a complete stroke-based painting pipeline, including prediction, generation, ordering, and compositing. This demonstrates how data-efficient primitive modeling can support expressive and structured multimedia content creation. Experiments indicate that the proposed approach produces diverse and structurally coherent brushstrokes and enables paintings with richer texture and layering, validated by both automatic metrics and human evaluation.
[CV-146] HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在稀疏视角(sparse-view)条件下因监督信号不足而导致的高斯分布不规则问题,具体表现为全局覆盖稀疏、背景模糊及高频区域失真。解决方案的关键在于提出HeroGS框架,通过多层级引导机制实现对高斯分布的有效约束与优化:在图像层面将稀疏监督转化为伪密集引导,全局规整高斯分布;在特征层面引入特征自适应密度调整与修剪(FADP),利用低层特征增强高频细节并自适应 densify 背景区域;在参数层面采用协同修剪几何一致性(CPG)策略,通过参数冻结与协同修剪去除不一致的高斯点,从而提升结构保真度和渲染质量。
链接: https://arxiv.org/abs/2603.01099
作者: Jiashu Li,Xumeng Han,Zhaoyang Wei,Zipeng Wang,Kuiran Wang,Guorong Li,Zhenjun Han,Jianbin Jiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with real-time efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions, characterized by globally sparse coverage, blurred background, and distorted high-frequency areas. To address this, we propose HeroGS, Hierarchical Guidance for Robust 3D Gaussian Splatting, a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature level leverages low-level features to refine high-frequency details and adaptively densifies Gaussians in background regions. The optimized distributions then support Co-Pruned Geometry Consistency (CPG) at parameter level, which guides geometric consistency through parameter freezing and co-pruning, effectively removing inconsistent splats. The hierarchical guidance strategy effectively constrains and optimizes the overall Gaussian distributions, thereby enhancing both structural fidelity and rendering quality. Extensive experiments demonstrate that HeroGS achieves high-fidelity reconstructions and consistently surpasses state-of-the-art baselines under sparse-view conditions.
[CV-147] Differential privacy representation geometry for medical image analysis
【速读】:该论文旨在解决差分隐私(Differential Privacy, DP)在医学影像领域应用中,因隐私保护导致模型性能下降的机制不明确问题。传统评估方法仅关注端到端性能,忽视了隐私扰动对表征空间结构的影响。其解决方案的关键在于提出 Differential Privacy Representation Geometry for Medical Imaging (DP-RGMI) 框架,将 DP 视为表征空间的结构化变换,并将其性能退化分解为编码器几何特性(representation displacement 和谱有效维度 spectral effective dimension)与任务头利用效率(utilization gap,即线性探测与端到端性能的差距)。通过系统分析多数据集和预训练初始化下的表征变化,揭示了 DP 主要通过改变表征各向异性而非均匀压缩特征来影响性能,且几何量可捕捉先验和数据集条件下的额外变异,从而为诊断隐私诱导的失效模式和优化隐私模型选择提供可复现的分析工具。
链接: https://arxiv.org/abs/2603.01098
作者: Soroosh Tayebi Arasteh,Marziyeh Mohammadi,Sven Nebelung,Daniel Truhn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Differential privacy (DP)'s effect in medical imaging is typically evaluated only through end-to-end performance, leaving the mechanism of privacy-induced utility loss unclear. We introduce Differential Privacy Representation Geometry for Medical Imaging (DP-RGMI), a framework that interprets DP as a structured transformation of representation space and decomposes performance degradation into encoder geometry and task-head utilization. Geometry is quantified by representation displacement from initialization and spectral effective dimension, while utilization is measured as the gap between linear-probe and end-to-end utility. Across over 594,000 images from four chest X-ray datasets and multiple pretrained initializations, we show that DP is consistently associated with a utilization gap even when linear separability is largely preserved. At the same time, displacement and spectral dimension exhibit non-monotonic, initialization- and dataset-dependent reshaping, indicating that DP alters representation anisotropy rather than uniformly collapsing features. Correlation analysis reveals that the association between end-to-end performance and utilization is robust across datasets but can vary by initialization, while geometric quantities capture additional prior- and dataset-conditioned variation. These findings position DP-RGMI as a reproducible framework for diagnosing privacy-induced failure modes and informing privacy model selection.
[CV-148] Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark Evaluation and Dataset Perspective ICLR2026
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在图形设计美学质量评估中的能力不足问题,当前研究受限于狭隘的评价维度、粗粒度的评估协议、缺乏系统性的VLM对比以及训练数据匮乏。其解决方案的关键在于构建一个名为AesEval-Bench的综合性基准,涵盖四个维度、十二个指标和三项可量化任务(美学判断、区域选择与精确定位),并在此基础上系统评估多种类型的VLM(专有模型、开源模型及推理增强模型),揭示其在美学评估中显著的性能差距;同时,通过人类引导的VLM标注方法构建大规模训练数据集,并引入指标导向的推理机制,将抽象美学指标与具体设计元素关联,从而首次建立图形设计美学质量评估的系统性框架。
链接: https://arxiv.org/abs/2603.01083
作者: Arctanx An,Shizhao Sun,Danqing Huang,Mingxi Cheng,Yan Gao,Ji Li,Yu Qiao,Jiang Bian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:Assessing the aesthetic quality of graphic design is central to visual communication, yet remains underexplored in vision language models (VLMs). We investigate whether VLMs can evaluate design aesthetics in ways comparable to humans. Prior work faces three key limitations: benchmarks restricted to narrow principles and coarse evaluation protocols, a lack of systematic VLM comparisons, and limited training data for model improvement. In this work, we introduce AesEval-Bench, a comprehensive benchmark spanning four dimensions, twelve indicators, and three fully quantifiable tasks: aesthetic judgment, region selection, and precise localization. Then, we systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment. Moreover, we construct a training dataset to fine-tune VLMs for this domain, leveraging human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to tie abstract indicators to concrete design this http URL, our work establishes the first systematic framework for aesthetic quality assessment in graphic design. Our code and dataset will be released at: \hrefthis https URLthis https URL
[CV-149] Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation ICLR2026
【速读】:该论文旨在解决恶劣天气条件下激光雷达(LiDAR)点云语义分割网络性能显著下降的问题,其核心挑战在于天气干扰引入了较大的分布偏移(distribution shift),而现有基于数据增强的方法因在轻微与剧烈增强之间存在权衡,难以充分挖掘增强潜力。解决方案的关键在于提出A3Point框架,其核心创新包括:1)语义混淆先验(Semantic Confusion Prior, SCP)隐式学习机制,用于捕捉模型内在的语义混淆信息;2)语义偏移区域(Semantic Shift Region, SSR)定位模块,通过解耦语义混淆与语义偏移,实现针对不同干扰强度的自适应优化策略,从而有效利用多样化的增强方式并缓解由增强引入的语义意义改变问题。
链接: https://arxiv.org/abs/2603.01074
作者: Wangkai Li,Zhaoyang Li,Yuwen Pan,Rui Sun,Yujia Chen,Tianzhu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by International Conference on Learning Representations (ICLR 2026)
Abstract:Adverse weather conditions significantly degrade the performance of LiDAR point cloud semantic segmentation networks by introducing large distribution shifts. Existing augmentation-based methods attempt to enhance robustness by simulating weather interference during training. However, they struggle to fully exploit the potential of augmentations due to the trade-off between minor and aggressive augmentations. To address this, we propose A3Point, an adaptive augmentation-aware latent learning framework that effectively utilizes a diverse range of augmentations while mitigating the semantic shift, which refers to the change in the semantic meaning caused by augmentations. A3Point consists of two key components: semantic confusion prior (SCP) latent learning, which captures the model’s inherent semantic confusion information, and semantic shift region (SSR) localization, which decouples semantic confusion and semantic shift, enabling adaptive optimization strategies for different disturbance levels. Extensive experiments on multiple standard generalized LiDAR segmentation benchmarks under adverse weather demonstrate the effectiveness of our method, setting new state-of-the-art results.
[CV-150] Flow Matching-enabled Test-Time Refinement for Unsupervised Cardiac MR Registration
【速读】:该论文旨在解决基于扩散模型的无监督图像配准在心脏电影磁共振成像(cardiac cine MR)中因多步推理成本高而难以实际应用的问题。其解决方案的关键在于提出FlowReg框架,该框架采用位移场空间中的流匹配(flow-matching)机制,仅需两步即可实现强配准性能,并支持通过增加步骤进一步优化;同时引入“热启动-重流”(warmup-reflow)训练策略,使单步网络作为教师指导学生从任意中间状态进行精修,无需预训练模型;此外,通过初始猜测(Initial Guess)策略将模型预测反馈为下一阶段起点,显著提升第二步及以后的精修效果,从而在保持极低参数增量(仅0.7%)和无需分割标签的前提下,显著优于当前最优方法,在六项任务中平均Dice分数提升0.6%,左心室Dice提升达1.09%,且左心室射血分数(LVEF)估计误差降低2.58个百分点。
链接: https://arxiv.org/abs/2603.01073
作者: Yunguan Fu,Wenjia Bai,Wen Yan,Matthew J Clarkson,Rhodri Huw Davies,Yipeng Hu
机构: University College London (伦敦大学学院); InstaDeep; Imperial College London (帝国理工学院); Barts Health NHS Trust (巴茨健康NHS信托)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion-based unsupervised image registration has been explored for cardiac cine MR, but expensive multi-step inference limits practical use. We propose FlowReg, a flow-matching framework in displacement field space that achieves strong registration in as few as two steps and supports further refinement with more steps. FlowReg uses warmup-reflow training: a single-step network first acts as a teacher, then a student learns to refine from arbitrary intermediate states, removing the need for a pre-trained model as in existing methods. An Initial Guess strategy feeds back the model prediction as the next starting point, improving refinement from step two onward. On ACDC and MM2 across six tasks (including cross-dataset generalization), FlowReg outperforms the state of the art on five tasks (+0.6% mean Dice score on average), with the largest gain in the left ventricle (+1.09%), and reduces LVEF estimation error on all six tasks (-2.58 percentage points), using only 0.7% extra parameters and no segmentation labels. Anonymized code is available at this https URL.
[CV-151] SHIELD8-UAV: Sequential 8-bit Hardware Implementation of a Precision-Aware 1D-F-CNN for Low-Energy UAV Acoustic Detection and Temporal Tracking
【速读】:该论文旨在解决边缘计算场景下无人机(UAV)实时声学检测中因硬件资源受限、功耗严格及延迟要求高所导致的低效推理问题。其关键解决方案在于提出一种顺序执行的8位硬件实现架构SHIELD8-UAV,结合精度感知量化(precision-aware quantisation)与结构化通道剪枝(structured channel pruning):前者支持FP32、BF16、INT8和FXP8多精度模式以灵活平衡精度与能效,后者通过降低特征维度(从35,072降至8,704)减少密集层串行化周期;整体设计在共享多精度数据通路上实现逐层执行,避免冗余处理单元,从而在Pynq-Z2 FPGA上实现<116 ms端到端延迟和<0.94 W功耗,相较QuantMAC和LPRE分别降低37.8%和49.6%延迟,且逻辑资源占用比并行设计低5–9%,验证了无需大规模并行即可实现高效低功耗边缘推理的可行性。
链接: https://arxiv.org/abs/2603.01069
作者: Susmita Ghanta,Karan Nathwani,Rohit Chaurasiya
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Numerical Analysis (math.NA)
备注: Preprint of work submitted to ISVLSI 2026
Abstract:Real-time unmanned aerial vehicle (UAV) acoustic detection at the edge demands low-latency inference under strict power and hardware limits. This paper presents SHIELD8-UAV, a sequential 8-bit hardware implementation of a precision-aware 1D feature-driven CNN (1D-F-CNN) accelerator for continuous acoustic monitoring. The design performs layer-wise execution on a shared multi-precision datapath, eliminating the need for replicated processing elements. A layer-sensitivity quantisation framework supports FP32, BF16, INT8, and FXP8 modes, while structured channel pruning reduces the flattened feature dimension from 35,072 to 8,704 (75%), thereby lowering serialised dense-layer cycles. The model achieves 89.91% detection accuracy in FP32 with less than 2.5% degradation in 8-bit modes. The accelerator uses 2,268 LUTs and 0.94 W power with 116 ms end-to-end latency, achieving 37.8% and 49.6% latency reduction compared with QuantMAC and LPRE, respectively, on a Pynq-Z2 FPGA, and 5-9% lower logic usage than parallel designs. ASIC synthesis in UMC 40 nm technology shows a maximum operating frequency of 1.56 GHz, 3.29 mm2 core area, and 1.65 W total power. These results demonstrate that sequential execution combined with precision-aware quantisation and serialisation-aware pruning enables practical low-energy edge inference without relying on massive parallelism.
[CV-152] LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model
【速读】:该论文旨在解决多模态理解与生成任务中现有扩散模型在处理不同模态(如文本和图像)时存在架构冗余、长度不适应以及跨模态耦合效率低的问题。其解决方案的关键在于提出一种基于混合扩散(Mixture of Diffusion, MoD)框架的统一建模方法:通过解耦离散掩码扩散用于文本理解与连续扩散用于视觉生成,并利用一个共享的、简洁高效的注意力主干网络实现两者的耦合,从而减少固定条件下的冗余计算;同时引入数据驱动的长度自适应策略,在无需修改模型结构的前提下支持多模态场景下的灵活长度解码,显著提升了模型在文本到图像生成等任务上的性能表现(如在DPG-Bench上达到87.04分)。
链接: https://arxiv.org/abs/2603.01068
作者: Zebin You,Xiaolu Zhang,Jun Zhou,Chongxuan Li,Ji-Rong Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We present \textbfLLaDA-o, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at this https URL.
[CV-153] Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶强化学习(Reinforcement Learning, RL)优化过程中性能停滞的问题,尤其针对长尾场景中探索能力受限导致的零奖励信号无法定位根本故障原因的瓶颈。其解决方案的关键在于提出ELF-VLA框架,通过引入显式失败诊断反馈机制,将原本模糊的标量奖励转化为结构化的、可解释的失败模式报告(如规划错误、推理缺陷或轨迹执行失误),并基于此生成反馈引导的策略修正样本,将其注入RL训练批次以提供精准梯度更新,从而显著提升VLA模型在复杂长尾场景下的鲁棒性和决策准确性。
链接: https://arxiv.org/abs/2603.01063
作者: Yuechen Luo,Qimao Chen,Fang Li,Shaoqing Xu,Jaxin Liu,Ziying Song,Zhi-xin Yang,Fuxi Wen
机构: Tsinghua University (清华大学); University of Macau (澳门大学); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization. This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT), leading to persistent failures in long-tail scenarios. In these critical situations, all explored actions yield a zero-value driving score. This information-sparse reward signals a failure, yet fails to identify its root cause – whether it is due to incorrect planning, flawed reasoning, or poor trajectory execution. To address this limitation, we propose VLA with Explicit Learning from Failures (ELF-VLA), a framework that augments RL with structured diagnostic feedback. Instead of relying on a vague scalar reward, our method produces detailed, interpretable reports that identify the specific failure mode. The VLA policy then leverages this explicit feedback to generate a Feedback-Guided Refinement. By injecting these corrected, high-reward samples back into the RL training batch, our approach provides a targeted gradient, which enables the policy to solve critical scenarios that unguided exploration cannot. Extensive experiments demonstrate that our method unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public NAVSIM benchmark for overall PDMS, EPDMS score and high-level planning accuracy.
[CV-154] MM-DeepResearch: A Simple and Effective Multimodal Agent ic Search Baseline
【速读】:该论文旨在解决多模态深度研究代理(multimodal deep research agent)在实际应用中面临的三大挑战:一是缺乏密集搜索型多模态问答(QA)数据,二是缺少有效的搜索路径规划机制,三是使用在线搜索API进行训练成本高昂。为应对这些问题,论文提出三个关键解决方案:首先,设计Hyper-Search方法,基于超图建模跨模态的视觉与文本节点关系,生成需要调用多种搜索工具才能解答的多模态QA对;其次,引入DR-TTS框架,按搜索工具类型对任务进行分解并分别优化专用工具专家,再通过树搜索重组专家以探索高效搜索轨迹;最后构建一个支持多工具的离线搜索引擎,实现无需依赖昂贵在线API的智能体强化学习训练。这些设计共同支撑了MM-DeepResearch系统的开发,并在多个基准测试中展现出显著优势。
链接: https://arxiv.org/abs/2603.01050
作者: Huanjin Yao,Qixiang Yin,Min Yang,Ziwang Zhao,Yibo Wang,Haotian Luo,Jingyi Zhang,Jiaxing Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical report
Abstract:We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at this https URL
[CV-155] From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
【速读】:该论文旨在解决面部识别(Face Recognition)在面对呈现攻击(Presentation Attacks)时的脆弱性问题,尤其是现有多模态大模型(Multimodal Large Language Models, MLLMs)驱动的防伪(Face Anti-Spoofing, FAS)方法在跨域泛化能力上的局限性。这些问题源于当前基于MLLM的FAS方法仅能捕捉直观语义线索(如口罩轮廓),难以感知细微的视觉模式。解决方案的关键在于引入外部视觉工具以增强模型对细粒度伪造线索的探究能力,提出Tool-Augmented Reasoning FAS(TAR-FAS)框架,将FAS任务重构为带有视觉工具的思维链(Chain-of-Thought with Visual Tools, CoT-VT)范式,使模型能够从初始观察出发,自适应调用外部视觉工具进行深入分析,并通过设计工具增强的数据标注流程和工具感知训练策略(DT-GRPO),实现高效、可解释的细粒度推理与检测。
链接: https://arxiv.org/abs/2603.01038
作者: Haoyuan Zhang,Keyao Wang,Guosheng Zhang,Haixiao Yue,Zhiwen Tan,Siran Peng,Tianshuo Zhang,Xiao Tan,Kunbin Chen,Wei He,Jingdong Wang,Ajian Liu,Xiangyu Zhu,Zhen Lei
机构: SAI, UCAS; MAIS, CASIA; Baidu Inc; CAIR, HKSIS, CAS; M.U.S.T
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Keywords: Biometrics, Face Anti-Spoofing, MLLM
Abstract:Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.
[CV-156] SMR-Net:Robot Snap Detection Based on Multi-Scale Features and Self-Attention Network
【速读】:该论文旨在解决机器人自动化装配中扣件(snap)检测与定位精度不足的问题,尤其针对传统视觉方法在处理透明或低对比度扣件等复杂场景时鲁棒性差、定位误差大的缺陷。解决方案的关键在于提出一种专用传感器配合SMR-Net——一种基于自注意力机制的多尺度目标检测算法:其核心创新为引入注意力增强的多尺度特征融合架构,通过嵌入注意力机制的特征提取器强化关键扣件特征并抑制噪声;采用标准卷积与空洞卷积并行处理三组多尺度特征图以统一维度且保留分辨率;并通过自适应重加权网络动态分配融合权重,生成兼具细节信息与全局语义的精细化表征,从而显著提升复杂环境下扣件检测与定位的准确性和稳定性。
链接: https://arxiv.org/abs/2603.01036
作者: Kuanxu Hou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: snap assembly, snap detection and localization, object detection, multi-scale feature fusion, self-attention
Abstract:In robot automated assembly, snap assembly precision and efficiency directly determine overall production quality. As a core prerequisite, snap detection and localization critically affect subsequent assembly success. Traditional visual methods suffer from poor robustness and large localization errors when handling complex scenarios (e.g., transparent or low-contrast snaps), failing to meet high-precision assembly demands. To address this, this paper designs a dedicated sensor and proposes SMR-Net, an self-attention-based multi-scale object detection algorithm, to synergistically enhance detection and localization performance. SMR-Net adopts an attention-enhanced multi-scale feature fusion architecture: raw sensor data is encoded via an attention-embedded feature extractor to strengthen key snap features and suppress noise; three multi-scale feature maps are processed in parallel with standard and dilated convolution for dimension unification while preserving resolution; an adaptive reweighting network dynamically assigns weights to fused features, generating fine representations integrating details and global semantics. Experimental results on Type A and Type B snap datasets show SMR-Net outperforms traditional Faster R-CNN significantly: Intersection over Union (IoU) improves by 6.52% and 5.8%, and mean Average Precision (mAP) increases by 2.8% and 1.5% respectively. This fully demonstrates the method’s superiority in complex snap detection and localization tasks.
[CV-157] Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery CVPR2026
【速读】:该论文旨在解决张量环(Tensor Ring, TR)分解在处理非网格数据(non-meshgrid data)时的局限性,即传统TR分解仅适用于固定网格上的离散数据,难以建模连续空间中的高阶结构。其核心解决方案是提出一种基于隐式神经表示(Implicit Neural Representations, INRs)的TR函数分解方法,将TR因子参数化为INRs形式,从而实现对任意连续数据的建模。关键创新在于通过频域分析揭示了TR因子的频谱结构对重建张量高频成分的限制,并设计了一种重参数化策略:每个TR因子由一个可学习的潜在张量与一个固定基底结构化组合而成,理论上改善了训练动态性;同时提出了合理的固定基初始化方案并证明模型的Lipschitz连续性,显著提升了对细粒度特征的建模能力。
链接: https://arxiv.org/abs/2603.01034
作者: Yangyang Xu,Junbo Ke,You-Wei Wen,Chao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 18 figures, 12 tables. Accepted by CVPR 2026
Abstract:Tensor Ring (TR) decomposition is a powerful tool for high-order data modeling, but is inherently restricted to discrete forms defined on fixed meshgrids. In this work, we propose a TR functional decomposition for both meshgrid and non-meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). However, optimizing this continuous framework to capture fine-scale details is intrinsically difficult. Through a frequency-domain analysis, we demonstrate that the spectral structure of TR factors determines the frequency composition of the reconstructed tensor and limits the high-frequency modeling capacity. To mitigate this, we propose a reparameterized TR functional decomposition, in which each TR factor is a structured combination of a learnable latent tensor and a fixed basis. This reparameterization is theoretically shown to improve the training dynamics of TR factor learning. We further derive a principled initialization scheme for the fixed basis and prove the Lipschitz continuity of our proposed model. Extensive experiments on image inpainting, denoising, super-resolution, and point cloud recovery demonstrate that our method achieves consistently superior performance over existing approaches. Code is available at this https URL.
[CV-158] Vision-Language Feature Alignment for Road Anomaly Segmentation
【速读】:该论文旨在解决复杂环境中自主系统对道路异常区域分割的鲁棒性问题,特别是现有方法依赖像素级统计导致在语义正常的背景区域(如天空或植被)产生高误报率,且对分布外(Out-of-distribution, OOD)实例召回率低的问题。解决方案的关键在于提出VL-Anomaly框架,其核心创新是利用预训练视觉-语言模型(Vision-Language Models, VLMs)中的语义先验信息,设计了一个基于提示学习的对齐模块,将Mask2Forme的视觉特征映射到CLIP文本嵌入空间中已知类别的语义表示,从而有效抑制背景区域的虚假异常响应;同时在推理阶段引入多源融合策略,结合文本引导相似度、CLIP图像-文本相似度和检测置信度,通过互补信息提升异常预测的可靠性。
链接: https://arxiv.org/abs/2603.01029
作者: Zhuolin He,Jiacheng Tang,Jian Pu,Xiangyang Xue
机构: Fudan University (复旦大学); Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University (复旦大学脑科学与智能技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme’s visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and this http URL is released on this https URL.
[CV-159] Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)中存在的谱偏差(spectral bias)问题,即INRs在建模高频率细节时能力受限。现有方法通常依赖固定频率基的傅里叶特征(Fourier features)来缓解此问题,但这种固定性导致多层感知机(MLP)需低效地组合所需频率,从而限制其表达能力。解决方案的关键在于提出内容感知频率编码(Content-Aware Frequency Encoding, CAFE),其通过多个并行线性层结合哈达玛积(Hadamard product)显式且高效地合成更广范围的频率基,并利用学习到的权重实现对任务相关频率的选择;进一步扩展为CAFE+,引入切比雪夫特征(Chebyshev features)作为傅里叶基的补充,增强了频率表示的稳定性和表达能力。
链接: https://arxiv.org/abs/2603.01028
作者: Junbo Ke,Yangyang Xu,You-Wei Wen,Chao Wang
机构: Hunan Normal University (湖南师范大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 22 figures, 8 tables
Abstract:Implicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture high-frequency details. Existing methods partially mitigate this issue by using Fourier-based features, which usually rely on fixed frequency bases. This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task-relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev features as a complementary component to Fourier bases. This combination provides a stronger and more stable frequency representation. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach, consistently achieving superior performance over existing methods. Our code is available at this https URL.
[CV-160] RaUF: Learning the Spatial Uncertainty Field of Radar
【速读】:该论文旨在解决毫米波雷达(Millimeter-wave radar)在恶劣天气下虽具优势但面临空间保真度低、方位模糊严重及杂波引发虚假回波等问题,尤其关注现有方法因忽略特征到标签映射的不确定性而导致几何推理病态,进而影响下游感知任务可靠性的问题。其解决方案的关键在于提出RaUF框架,通过建模雷达测量的物理驱动各向异性特性来构建空间不确定性场;设计各向异性概率模型以学习细粒度不确定性,从而缓解冲突的特征-标签映射;并引入双向域注意力机制(Bidirectional Domain Attention),利用空间结构与多普勒一致性之间的互补性,有效抑制虚假或多径反射,显著提升检测可靠性与可扩展性。
链接: https://arxiv.org/abs/2603.01026
作者: Shengpeng Wang,Kuangyu Wang,Wei Wang
机构: Huazhong University of Science and Technology (华中科技大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Millimeter-wave radar offers unique advantages in adverse weather but suffers from low spatial fidelity, severe azimuth ambiguity, and clutter-induced spurious returns. Existing methods mainly focus on improving spatial perception effectiveness via coarse-to-fine cross-modal supervision, yet often overlook the ambiguous feature-to-label mapping, which may lead to ill-posed geometric inference and pose fundamental challenges to downstream perception tasks. In this work, we propose RaUF, a spatial uncertainty field learning framework that models radar measurements through their physically grounded anisotropic properties. To resolve conflicting feature-to-label mapping, we design an anisotropic probabilistic model that learns fine-grained uncertainty. To further enhance reliability, we propose a Bidirectional Domain Attention mechanism that exploits the mutual complementarity between spatial structure and Doppler consistency, effectively suppressing spurious or multipath-induced reflections. Extensive experiments on public benchmarks and real-world datasets demonstrate that RaUF delivers highly reliable spatial detections with well-calibrated uncertainty. Moreover, downstream case studies further validate the enhanced reliability and scalability of RaUF under challenging real-world driving scenarios.
[CV-161] Implementation of Licensed Plate Detection and Noise Removal in Image Processing
【速读】:该论文旨在解决马来西亚日益增长的车辆数量所带来的交通管理与监控需求,特别是通过高效、自动化的手段实现对车辆牌照的识别与管理。其解决方案的关键在于利用图像处理技术构建车牌识别系统(Car License Plate Recognition System),该系统可集成于电子停车收费、高速公路收费、交通监控及警务执法等多个场景,并具备与其他领域技术(如生物识别、航空航天)融合的潜力,从而实现对特定问题的精准解决。
链接: https://arxiv.org/abs/2603.01016
作者: Yiquan Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: 13 pages. This is the author’s version, accepted manuscript
Abstract:Car license plate recognition system is an image processing technology used to identify vehicles by capturing their Car License Plates. The car license plate recognition technology is also known as automatic number-plate recognition, automatic vehicle identification, car license plate recognition or optical character recognition for cars. In Malaysia, as the number of vehicle is increasing rapidly nowadays, a pretty great number of vehicle on the road has brought about the considerable demands of car license plate recognition system. Car license plate recognition system can be implemented in electronic parking payment system, highway toll-fee system, traffic surveillance system and as police enforcement tools. Additionally, car license plate recognition system technology also has potential to be combined with various techniques in other different fields like biology, aerospace and so on to achieve the goal of solving some specialized problems.
[CV-162] GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis CVPR2026
【速读】:该论文旨在解决生成式模型在新视角合成(Novel View Synthesis, NVS)中因依赖随机噪声到数据的转换而导致的视点间一致性不足的问题。其解决方案的关键在于提出一种数据到数据的流匹配框架(Data-to-Data Flow Matching),通过学习成对视图之间的确定性变换来增强视点一致性;进一步引入概率密度测地线流匹配(Probability Density Geodesic Flow Matching, PDG-FM),利用预训练扩散模型的概率密度度量构造测地线插值路径,约束流轨迹以对齐数据流形的高密度区域,从而提升几何一致性与合成图像的真实性。
链接: https://arxiv.org/abs/2603.01010
作者: Xuqin Wang,Tao Wu,Yanfeng Zhang,Lu Liu,Mingwei Sun,Yongliang Wang,Niclas Zeller,Daniel Cremers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions. We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling. To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples. Empirically, our method surpasses diffusion-based NVS baselines, demonstrating improved structural coherence and smoother transitions across views. These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.
[CV-163] Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving CVPR2026
【速读】:该论文针对3D语义占据预测(3D semantic occupancy prediction)中存在的两大挑战展开研究:一是视图变换中的几何错位问题,源于缺乏像素级精确的深度估计;二是严重的空间类别不平衡问题,表现为语义类别在空间上呈现强各向异性。解决方案的关键在于提出一种深度与区域引导的占据预测框架(depth- and region-guided occupancy prediction framework),其中包含两个核心组件:一是深度引导的2D到3D视图变换器(D²-VFormer),利用MoGe-2提供的高质量稠密深度线索构建可靠的几何先验,实现体素特征的精准几何对齐;二是区域引导的专家变换器(R/R²-EFormer),受Mixture-of-Experts(MoE)启发,自适应地为不同空间区域分配特定专家,有效应对空间语义差异。二者协同作用,分别保障几何一致性与语义学习能力,实验表明该方法在Occ3D-nuScenes基准上使BEVDet4D基线模型的mIoU提升7.43%,全视觉设置下IoU提升3.09%。
链接: https://arxiv.org/abs/2603.01007
作者: Xubo Zhu,Haoyang Zhang,Fei He,Rui Wu,Yanhu Shan,Wen Yang,Huai Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures. Accepted at CVPR 2026
Abstract:3D semantic occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to the lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose this http URL, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D ^2 -VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R ^2 -EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D-nuScenes benchmark demonstrate that \textbfthis http URL improves the strong baseline BEVDet4D by 7.43% mIoU and 3.09% IoU under the full vision-only setting.
[CV-164] Let Your Image Move with Your Motion! – Implicit Multi-Object Multi-Motion Transfer
【速读】:该论文旨在解决生成式视频(I2V)中多对象、多运动模式转移的难题,现有方法通常局限于单对象场景,在多个对象需独立运动时难以实现精确控制。其核心解决方案是提出FlexiMMT框架,关键创新在于引入运动解耦掩码注意力机制(Motion Decoupled Mask Attention Mechanism),通过对象特定掩码约束注意力范围,确保运动token仅作用于对应区域;同时提出差异化掩码传播机制(Differentiated Mask Propagation Mechanism),直接从扩散模型注意力中提取对象掩码并跨帧高效传播,从而有效解耦不同对象间的运动信息,支持灵活的运动-对象映射与组合。
链接: https://arxiv.org/abs/2603.01000
作者: Yuze Li,Dong Gong,Xiao Cao,Junchao Yuan,Dongsheng Li,Lei Zhou,Yun Sing Koh,Cheng Yan,Xinyu Zhang
机构: Tianjin University (天津大学); University of New South Wales (新南威尔士大学); University of Electronic Science and Technology of China (电子科技大学); Hainan University (海南大学); University of Auckland (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 11 figures, see this https URL
Abstract:Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.
[CV-165] MLRecon: Robust Markerless Freehand 3D Ultrasound Reconstruction via Coarse-to-Fine Pose Estimation
【速读】:该论文旨在解决自由手式三维超声(Freehand 3D ultrasound, 3D US)重建中长期存在的“三难困境”:基于标记的跟踪系统成本过高,内向外方法需侵入性地附加传感器,而无传感器方法则易受累积漂移影响。解决方案的关键在于提出MLRecon框架,其核心创新包括:(1)利用视觉基础模型(vision foundation models)实现仅依赖单个商用RGB-D相机的鲁棒无标记6自由度(6D)探头位姿跟踪;(2)引入视觉引导的发散检测器以自主监控跟踪完整性并触发故障恢复机制,保障扫描连续性;(3)设计双阶段位姿精化网络,显式解耦高频抖动与低频偏置,有效去噪轨迹同时保留操作者运动的运动学保真度。实验表明,该方法在复杂轨迹上平均位置误差低至0.88 mm,且重建表面均方误差小于1 mm,为资源受限临床环境中的低成本、高精度三维超声成像树立了新基准。
链接: https://arxiv.org/abs/2603.00990
作者: Yi Zhang,Puxun Tu,Kun Wang,Yulin Yan,Tao Ying,Xiaojun Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures
Abstract:Freehand 3D ultrasound (US) reconstruction promises volumetric imaging with the flexibility of standard 2D probes, yet existing tracking paradigms face a restrictive trilemma: marker-based systems demand prohibitive costs, inside-out methods require intrusive sensor attachment, and sensorless approaches suffer from severe cumulative drift. To overcome these limitations, we present MLRecon, a robust markerless 3D US reconstruction framework delivering drift-resilient 6D probe pose tracking using a single commodity RGB-D camera. Leveraging the generalization power of vision foundation models, our pipeline enables continuous markerless tracking of the probe, augmented by a vision-guided divergence detector that autonomously monitors tracking integrity and triggers failure recovery to ensure uninterrupted scanning. Crucially, we further propose a dual-stage pose refinement network that explicitly disentangles high-frequency jitter from low-frequency bias, effectively denoising the trajectory while maintaining the kinematic fidelity of operator maneuvers. Experiments demonstrate that MLRecon significantly outperforms competing sensorless and sensor-aided methods, achieving average position errors as low as 0.88 mm on complex trajectories and yielding high-quality 3D reconstructions with sub-millimeter mean surface accuracy. This establishes a new benchmark for low-cost, accessible volumetric US imaging in resource-limited clinical settings.
[CV-166] Foundation Models in Remote Sensing: Evolving from Unimodality to Multimodality
【速读】:该论文旨在解决遥感(Remote Sensing, RS)领域中因数据量和多样性急剧增长所带来的数据建模与理解能力不足的问题,从而难以有效管理和解析海量遥感数据。其解决方案的关键在于系统性地梳理和分类遥感领域的基础模型(foundation models),从单模态向多模态演进的视角出发,提供一套完整的理论框架与实践指导,帮助研究人员尤其是初学者掌握如何训练和应用这些模型于实际遥感任务中,进而推动遥感技术向更高效、智能的方向发展。
链接: https://arxiv.org/abs/2603.00988
作者: Danfeng Hong,Chenyu Li,Xuyang Li,Gustau Camps-Valls,Jocelyn Chanussot
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: Accepted by IEEE GRSM
Abstract:Remote sensing (RS) techniques are increasingly crucial for deepening our understanding of the planet. As the volume and diversity of RS data continue to grow exponentially, there is an urgent need for advanced data modeling and understanding capabilities to manage and interpret these vast datasets effectively. Foundation models present significant new growth opportunities and immense potential to revolutionize the RS field. In this paper, we conduct a comprehensive technical survey on foundation models in RS, offering a brand-new perspective by exploring their evolution from unimodality to multimodality. We hope this work serves as a valuable entry point for researchers interested in both foundation models and RS and helps them launch new projects or explore new research topics in this rapidly evolving area. This survey addresses the following three key questions: What are foundation models in RS? Why are foundation models needed in RS? How can we effectively guide junior researchers in gaining a comprehensive and practical understanding of foundation models in RS applications? More specifically, we begin by outlining the background and motivation, emphasizing the importance of foundation models in RS. We then review existing foundation models in RS, systematically categorizing them into unimodal and multimodal approaches. Additionally, we provide a tutorial-like section to guide researchers, especially beginners, on how to train foundation models in RS and apply them to real-world tasks. The survey aims to equip researchers in RS with a deeper and more efficient understanding of foundation models, enabling them to get started easily and effectively apply these models across various RS applications.
[CV-167] he Texture-Shape Dilemma: Boundary-Safe Synthetic Generation for 3D Medical Transformers
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在医学图像分析中因依赖大量标注数据而面临的挑战,尤其是临床数据稀缺与隐私保护限制的问题。现有公式驱动的监督学习(Formula-Driven Supervised Learning, FDSL)方法虽能生成无限合成样本,但其仅使用简单几何形状和均匀强度,忽略了CT/MRI等模态中固有的组织纹理与噪声模式,导致模型难以学习真实解剖边界。论文提出的关键解决方案是引入一种受物理启发的空间解耦合成框架(Physics-inspired Spatially-Decoupled Synthesis),其核心在于对合成过程进行正交化设计:首先基于边界距离构建梯度屏蔽缓冲区以稳定结构学习,随后在目标区域注入物理驱动的频谱纹理,从而有效平衡形状表示的鲁棒性与对成像噪声的不变性,克服了高频率纹理引入引发的边界混叠(boundary aliasing)问题。
链接: https://arxiv.org/abs/2603.00985
作者: Jiaqi Tang,Weixuan Xu,Shu Zhang,Fandong Zhang,Qingchao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers (ViTs) have revolutionized medical image analysis, yet their data-hungry nature clashes with the scarcity and privacy constraints of clinical archives. Formula-Driven Supervised Learning (FDSL) has emerged as a promising solution to this bottleneck, synthesizing infinite annotated samples from mathematical formulas without utilizing real patient data. However, existing FDSL paradigms rely on simple geometric shapes with homogeneous intensities, creating a substantial gap by neglecting tissue textures and noise patterns inherent in modalities like CT and MRI. In this paper, we identify a critical optimization conflict termed boundary aliasing: when high-frequency synthetic textures are naively added, they corrupt the image gradient signals necessary for learning structural boundaries, causing the model to fail in delineating real anatomical margins. To bridge this gap, we propose a novel Physics-inspired Spatially-Decoupled Synthesis framework. Our approach orthogonalizes the synthesis process: it first constructs a gradient-shielded buffer zone based on boundary distance to ensure stable shape learning, and subsequently injects physics-driven spectral textures into the object core. This design effectively reconciles robust shape representation learning with invariance to acquisition noise. Extensive experiments on the BTCV and MSD datasets demonstrate that our method significantly outperforms previous FDSL, as well as SSL methods trained on real-world medical datasets, by 1.43% on BTCV and up to 1.51% on MSD task, offering a scalable, annotation-free foundation for medical ViTs. The code will be made publicly available upon acceptance.
[CV-168] Event-Anchored Frame Selection for Effective Long-Video Understanding
【速读】:该论文旨在解决长视频理解中因帧冗余度高和上下文窗口有限而导致的关键帧选择效率低的问题。现有方法采用扁平化采样策略,将视频视为无结构的帧集合,难以捕捉语义事件间的层次关系。其解决方案的核心是提出事件锚定的关键帧选择(Event-Anchored Frame Selection, EFS)框架,该框架通过自监督DINO嵌入对视频进行视觉同质性分段,以识别语义事件;随后在每个事件内选取与查询最相关的帧作为锚点,并利用自适应最大边际相关性(adaptive Maximal Marginal Relevance, MMR)机制进行全局优化,从而在事件覆盖、查询相关性和视觉多样性之间实现协同优化。EFS无需训练且可无缝集成至现成的大规模视觉语言模型(LVLMs),显著提升多个基准测试上的性能表现。
链接: https://arxiv.org/abs/2603.00983
作者: Wang Chen,Yongdong Luo,Yuhui Zeng,Luojun Lin,Tianyu Xie,Fei Chao,Rongrong Ji,Xiawu Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.
[CV-169] Fake It Right: Injecting Anatomical Logic into Synthetic Supervised Pre-training for Medical Segmentation
【速读】:该论文旨在解决3D医学图像分割中对大规模标注数据的依赖问题,尤其是在隐私保护和数据获取受限场景下,传统自监督学习(Self-Supervised Learning, SSL)仍面临挑战。其核心解决方案是提出一种解剖学信息引导的合成监督预训练框架(Anatomy-Informed Synthetic Supervised Pre-training),关键在于用包含去标识化、仅含标签的分割掩码构建轻量级形状库替代原始数学基元,并引入结构感知的顺序放置策略——通过空间锚点确保解剖位置合理性,以及拓扑图约束器官间相互关系(如避免不可能的重叠),从而弥合公式驱动监督学习(Formula-Driven Supervised Learning, FDSL)中因通用几何形状导致的语义鸿沟,实现兼具无限可扩展性与解剖真实性的预训练范式。
链接: https://arxiv.org/abs/2603.00979
作者: Jiaqi Tang,Mengyan Zheng,Shu Zhang,Fandong Zhang,Qingchao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers (ViTs) excel in 3D medical segmentation but require massive annotated datasets. While Self-Supervised Learning (SSL) mitigates this using unlabeled data, it still faces strict privacy and logistical barriers. Formula-Driven Supervised Learning (FDSL) offers a privacy-preserving alternative by pre-training on synthetic mathematical primitives. However, a critical semantic gap limits its efficacy: generic shapes lack the morphological fidelity, fixed spatial layouts, and inter-organ relationships of real anatomy, preventing models from learning essential global structural priors. To bridge this gap, we propose an Anatomy-Informed Synthetic Supervised Pre-training framework unifying FDSL’s infinite scalability with anatomical realism. We replace basic primitives with a lightweight shape bank with de-identified, label-only segmentation masks from 5 subjects. Furthermore, we introduce a structure-aware sequential placement strategy to govern the patch synthesis process. Instead of random placement, we enforce physiological plausibility using spatial anchors for correct localization and a topological graph to manage inter-organ interactions (e.g., preventing impossible overlaps). Extensive experiments on BTCV and MSD datasets demonstrate that our method significantly outperforms state-of-the-art FDSL baselines and SSL methods by 1.74% and up to 1.66%, while exhibiting a robust scaling effect where performance improves with increased synthetic data volume. This provides a data-efficient, privacy-compliant solution for medical segmentation. The code will be made publicly available upon acceptance.
[CV-170] EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers Leverag ing Multi-Object Optimization
【速读】:该论文旨在解决在现代文本到图像(T2I)和文本到视频(T2V)扩散模型中,如何有效移除特定概念的同时保持生成质量的问题,尤其针对采用流匹配(flow-matching)和基于Transformer架构的新型模型(如Stable Diffusion v3、Flux和OpenSora)而言,传统概念擦除方法因不兼容其架构而失效。解决方案的关键在于提出EraseAnything++框架,其核心是将概念擦除建模为一个约束的多目标优化问题,显式平衡概念移除与生成效用保留之间的冲突;通过引入基于隐式梯度手术(implicit gradient surgery)的高效效用保留遗忘策略,并结合LoRA参数微调与注意力层级正则化,锚定关键视觉表征并跨空间和时间维度一致传播擦除效果;在视频场景下进一步设计锚定-传播机制(anchor-and-propagate mechanism),在参考帧初始化擦除并强制贯穿后续Transformer层,从而缓解时间漂移问题,显著提升擦除有效性、生成保真度与时序一致性。
链接: https://arxiv.org/abs/2603.00978
作者: Zhaoxin Fan,Nanxiang Jiang,Daiheng Gao,Shiji Zhou,Wenjun Wu
机构: Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University (北京航空航天大学人工智能学院未来区块链与隐私计算先进创新中心); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Removing undesired concepts from large-scale text-to-image (T2I) and text-to-video (T2V) diffusion models while preserving overall generative quality remains a major challenge, particularly as modern models such as Stable Diffusion v3, Flux, and OpenSora employ flow-matching and transformer-based architectures and extend to long-horizon video generation. Existing concept erasure methods, designed for earlier T2I/T2V models, often fail to generalize to these paradigms. To address this issue, we propose EraseAnything++, a unified framework for concept erasure in both image and video diffusion models with flow-matching objectives. Central to our approach is formulating concept erasure as a constrained multi-objective optimization problem that explicitly balances concept removal with preservation of generative utility. To solve the resulting conflicting objectives, we introduce an efficient utility-preserving unlearning strategy based on implicit gradient surgery. Furthermore, by integrating LoRA-based parameter tuning with attention-level regularization, our method anchors erasure on key visual representations and propagates it consistently across spatial and temporal dimensions. In the video setting, we further enhance consistency through an anchor-and-propagate mechanism that initializes erasure on reference frames and enforces it throughout subsequent transformer layers, thereby mitigating temporal drift. Extensive experiments on both image and video benchmarks demonstrate that EraseAnything++ substantially outperforms prior methods in erasure effectiveness, generative fidelity, and temporal consistency, establishing a new state of the art for concept erasure in next-generation diffusion models.
[CV-171] PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation ICLR2026
【速读】:该论文旨在解决视频生成模型在实际应用中因计算成本高和推理速度慢而导致的效率瓶颈问题。现有方法通过特征缓存(feature caching)加速生成过程,但常因无法区分真正冗余的特征而误跳过重要计算,导致质量显著下降。其解决方案的关键在于提出一种即插即用的框架PreciseCache,该框架通过两个核心组件实现精确的冗余检测与跳过:一是LFCache,基于低频差异(Low-Frequency Difference, LFD)判断每一步的冗余性,从而精准识别可跳过的步骤;二是BlockCache,在非跳过步骤内部进一步以块为单位检测并跳过冗余计算。实验表明,PreciseCache可在不明显损失质量的前提下实现平均2.6倍的加速效果。
链接: https://arxiv.org/abs/2603.00976
作者: Jiangshan Wang,Kang Zhao,Jiayi Guo,Jiayu Wang,Hang Guo,Chenyang Zhu,Xiu Li,Xiangyu Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:High computational costs and slow inference hinder the practical application of video generation models. While prior works accelerate the generation process through feature caching, they often suffer from notable quality degradation. In this work, we reveal that this issue arises from their inability to distinguish truly redundant features, which leads to the unintended skipping of computations on important features. To address this, we propose \textbfPreciseCache, a plug-and-play framework that precisely detects and skips truly redundant computations, thereby accelerating inference without sacrificing quality. Specifically, PreciseCache contains two components: LFCache for step-wise caching and BlockCache for block-wise caching. For LFCache, we compute the Low-Frequency Difference (LFD) between the prediction features of the current step and those from the previous cached step. Empirically, we observe that LFD serves as an effective measure of step-wise redundancy, accurately detecting highly redundant steps whose computation can be skipped through reusing cached features. To further accelerate generation within each non-skipped step, we propose BlockCache, which precisely detects and skips redundant computations at the block level within the network. Extensive experiments on various backbones demonstrate the effectiveness of our PreciseCache, which achieves an average of 2.6x speedup without noticeable quality loss. Source code will be released.
[CV-172] Decoupling Motion and Geometry in 4D Gaussian Splatting
【速读】:该论文旨在解决动态场景高保真重建中因4D高斯点阵(4D Gaussian Splatting, 4DGS)将高斯运动与几何属性耦合在单一协方差建模中而导致表达能力受限、易产生视觉伪影的问题。其解决方案的关键在于提出VeGaS框架,通过引入伽利略剪切矩阵(Galilean shearing matrix)显式建模时变速度,从而灵活刻画复杂非线性运动,并严格分离高斯运动效应与与几何相关的条件高斯协方差;同时设计几何形变网络(Geometric Deformation Network),利用时空上下文和速度线索优化高斯形状与朝向,显著提升时间维度上的几何建模精度。
链接: https://arxiv.org/abs/2603.00952
作者: Yi Zhang,Yulei Kang,Jian-Fang Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-fidelity reconstruction of dynamic scenes is an important yet challenging problem. While recent 4D Gaussian Splatting (4DGS) has demonstrated the ability to model temporal dynamics, it couples Gaussian motion and geometric attributes within a single covariance formulation, which limits its expressiveness for complex motions and often leads to visual artifacts. To address this, we propose VeGaS, a novel velocity-based 4D Gaussian Splatting framework that decouples Gaussian motion and geometry. Specifically, we introduce a Galilean shearing matrix that explicitly incorporates time-varying velocity to flexibly model complex non-linear motions, while strictly isolating the effects of Gaussian motion from the geometry-related conditional Gaussian covariance. Furthermore, a Geometric Deformation Network is introduced to refine Gaussian shapes and orientations using spatio-temporal context and velocity cues, enhancing temporal geometric modeling. Extensive experiments on public datasets demonstrate that VeGaS achieves state-of-the-art performance.
[CV-173] When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning
【速读】:该论文旨在解决对比前向-前向(Contrastive Forward-Forward, CFF)训练中因随机种子导致的性能不稳定性问题,尤其聚焦于对比损失中正样本对边距(positive-pair margin)的实现方式——即通过饱和相似度夹紧(saturating similarity clamping, min(s+m,1))所引发的梯度截断效应。研究表明,这种夹紧机制会导致早期层梯度被人为截断,从而显著增加测试准确率的方差(在CIFAR-10上提升5.90倍,p=0.003),而均值准确率不变。解决方案的关键在于采用一种梯度中性(gradient-neutral)的替代形式:在对数概率后减去边距(subtracting the margin after the log-probability),该方法在均值-正样本归约(mean-over-positives reduction)下保持梯度一致性,且在CIFAR-10等特定条件下可有效消除方差膨胀而不牺牲平均性能。
链接: https://arxiv.org/abs/2603.00951
作者: Joshua Steier
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 2 figures, 15 tables, including appendices
Abstract:Contrastive Forward-Forward (CFF) learning trains Vision Transformers layer by layer against supervised contrastive objectives. CFF training can be sensitive to random seed, but the sources of this instability are poorly understood. We focus on one implementation detail: the positive-pair margin in the contrastive loss is applied through saturating similarity clamping, \min(s + m,, 1) . We prove that an alternative formulation, subtracting the margin after the log-probability, is gradient-neutral under the mean-over-positives reduction. On CIFAR-10 ( 2 \times 2 factorial, n=7 seeds per cell), clamping produces 5.90\times higher pooled test-accuracy variance ( p=0.003 ) with no difference in mean accuracy. Analyses of clamp activation rates, layerwise gradient norms, and a reduced-margin probe point to saturation-driven gradient truncation at early layers. The effect does not transfer cleanly to other datasets: on CIFAR-100, SVHN, and Fashion-MNIST, clamping produces equal or lower variance. Two factors account for the discrepancy. First, positive-pair density per batch controls how often saturation occurs. Second, task difficulty compresses seed-to-seed spread when accuracy is high. An SVHN difficulty sweep confirms the interaction on a single dataset, with the variance ratio moving from 0.25\times at high accuracy to 16.73\times under aggressive augmentation. In moderate-accuracy regimes with many same-class pairs per batch, switching to the gradient-neutral subtraction reference removes this variance inflation at no cost to mean accuracy. Measuring the layer-0 clamp activation rate serves as a simple check for whether the problem applies.
[CV-174] StegoNGP: 3D Cryptographic Steganography using Instant-NGP
【速读】:该论文旨在解决如何在不改变网络结构和参数数量的前提下,安全地嵌入高容量隐藏数据(如完整3D场景)的问题。现有方法依赖外部解码器、需修改架构且容量有限,易被检测。其解决方案的关键在于提出一种无参数的3D密码隐写术(StegoNGP),利用Instant-NGP中的哈希编码函数作为受密钥控制的场景切换器:通过默认密钥映射到载体场景、秘密密钥映射到隐藏场景,训练单一模型将两者表示交织于同一网络权重中,从而实现不可感知且安全的信息隐藏;同时引入多密钥方案,在哈希层级分配多个独立密钥,显著扩展密钥空间并提升对部分密钥泄露攻击的鲁棒性。
链接: https://arxiv.org/abs/2603.00949
作者: Wenxiang Jiang,Yujun Lan,Shuo Zhao,Yuanshan Liu,Mingzhu Zhou,Jinxin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, Instant Neural Graphics Primitives (Instant-NGP) has achieved significant success in rapid 3D scene reconstruction, but securely embedding high-capacity hidden data, such as an entire 3D scene, remains a challenge. Existing methods rely on external decoders, require architectural modifications, and suffer from limited capacity, which makes them easily detectable. We propose a novel parameter-free 3D Cryptographic Steganography using Instant-NGP (StegoNGP), which leverages the Instant-NGP hash encoding function as a key-controlled scene switcher. By associating a default key with a cover scene and a secret key with a hidden scene, our method trains a single model to interweave both representations within the same network weights. The resulting model is indistinguishable from a standard Instant-NGP in architecture and parameter count. We also introduce an enhanced Multi-Key scheme, which assigns multiple independent keys across hash levels, dramatically expanding the key space and providing high robustness against partial key disclosure attacks. Experimental results demonstrated that StegoNGP can hide a complete high-quality 3D scene with strong imperceptibility and security, providing a new paradigm for high-capacity, undetectable information hiding in neural fields. The code can be found at this https URL.
[CV-175] textscMobile-VTON: High-Fidelity On-Device Virtual Try-On
【速读】:该论文旨在解决虚拟试衣(Virtual Try-On, VTON)系统在云端部署时存在的隐私泄露风险及移动端部署受限的问题。现有方法通常需要将用户个人图像上传至云服务器进行计算,不仅存在数据安全隐患,也难以满足实时性和离线使用的需求。为此,作者提出了一种名为 \textscMobile-VTON 的高保真、隐私保护的端侧虚拟试衣框架,其核心创新在于构建了一个模块化架构——TeacherNet–GarmentNet–TryonNet(TGT),集成知识蒸馏、服装条件生成与服装对齐机制,并针对移动设备优化计算效率。关键解决方案包括:1)引入特征引导对抗蒸馏(Feature-Guided Adversarial, FGA)策略,融合教师模型监督与对抗学习以逼近真实图像分布;2)通过轨迹一致性损失训练GarmentNet,确保扩散过程中服装语义稳定;3)采用潜空间拼接和轻量级跨模态条件控制的TryonNet,实现无需大规模预训练的鲁棒服装到人体对齐。实验表明,该方案在VITON-HD和DressCode数据集上达到或超越云端基线性能,同时完全离线运行,验证了高质量VTON在移动端的可行性与实用性。
链接: https://arxiv.org/abs/2603.00947
作者: Zhenchen Wan,Ce Chen,Runqi Lin,Jiaxin Huang,Tianxi Chen,Yanwu Xu,Tongliang Liu,Mingming Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is available at: this https URL
Abstract:Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present \textscMobile-VTON, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. \textscMobile-VTON introduces a modular TeacherNet–GarmentNet–TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, \textscMobile-VTON achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at 1024\times768 show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.
[CV-176] Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos
【速读】:该论文旨在解决当前感知视频质量评估(Perceptual Video Quality Assessment, VQA)系统对高动态范围(High Dynamic Range, HDR)用户生成内容(User-Generated Content, UGC)适应性不足的问题。现有VQA模型主要针对标准动态范围(Standard Dynamic Range, SDR)设计,无法有效识别HDR特有的失真类型(如近黑区压扁、高光截断、色带效应和曝光闪烁),导致评估性能下降。解决方案的关键在于提出HDR-Q,首个面向HDR-UGC的多模态大语言模型(Multimodal Large Language Model, MLLM),其核心创新包括:(i) 一种新型HDR感知视觉编码器,用于生成对HDR特征敏感的嵌入表示;(ii) HDR感知策略优化(HDR-Aware Policy Optimization, HAPO),通过引入HDR-SDR对比KL散度约束和高斯加权回归奖励机制,在强化学习微调过程中增强模型对HDR输入的依赖并实现更精细的平均意见分数(Mean Opinion Score, MOS)校准,从而在Beyond8Bits及公开HDR-VQA基准上达到最先进性能。
链接: https://arxiv.org/abs/2603.00938
作者: Shreshth Saini,Bowen Chen,Neil Birkbeck,Yilin Wang,Balu Adsumilli,Alan C. Bovik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR has a higher bit depth, wide color gamut, and elevated luminance range, exposing distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate Beyond8Bits, a large-scale subjective dataset of 44K videos from 6.5K sources with over 1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce HDR-Q, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR-SDR contrastive KL that encourages token reliance on HDR inputs and a Gaussian weighted regression reward for fine-grained MOS calibration. Across Beyond8Bits and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance.
[CV-177] Learning to Weigh Waste: A Physics-Informed Multimodal Fusion Framework and Large-Scale Dataset for Commercial and Industrial Applications
【速读】:该论文旨在解决基于图像的商业与工业废弃物重量估计难题,其核心挑战在于:外观相似的物体可能密度不同,且物体在图像中的可见尺寸受相机距离影响显著。为应对这一问题,作者提出Multimodal Weight Predictor (MWP)框架,其关键创新在于融合RGB图像与物理信息驱动的元数据(如物体尺寸、相机距离和高度),通过Vision Transformer提取视觉特征,结合专用元数据编码器处理几何与类别信息,并采用Stacked Mutual Attention Fusion机制实现视觉与物理线索的相互引导,从而有效缓解透视效应并关联物体至材料属性。此外,模型使用均方对数误差训练以确保在宽重量范围(3.5–3,450 kg)内稳定性能,最终在自建Waste-Weight-10K数据集上实现88.06 kg MAE、6.39% MAPE和0.9548 R²,且具备可解释性——通过SHAP与大语言模型生成人类可读的预测解释。
链接: https://arxiv.org/abs/2603.00931
作者: Md. Adnanul Islam,Wasimul Karim,Md Mahbub Alam,Subhey Sadi Rahman,Md. Abdur Rahman,Arefin Ittesafun Abian,Mohaimenul Azam Khan Raiaan,Kheng Cher Yeo,Deepika Mathur,Sami Azam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate weight estimation of commercial and industrial waste is important for efficient operations, yet image-based estimation remains difficult because similar-looking objects may have different densities, and the visible size changes with camera distance. Addressing this problem, we propose Multimodal Weight Predictor (MWP) framework that estimates waste weight by combining RGB images with physics-informed metadata, including object dimensions, camera distance, and camera height. We also introduce Waste-Weight-10K, a real-world dataset containing 10,421 synchronized image-metadata collected from logistics and recycling sites. The dataset covers 11 waste categories and a wide weight range from 3.5 to 3,450 kg. Our model uses a Vision Transformer for visual features and a dedicated metadata encoder for geometric and category information, combining them with Stacked Mutual Attention Fusion that allows visual and physical cues guide each other. This helps the model manage perspective effects and link objects to material properties. To ensure stable performance across the wide weight range, we train the model using Mean Squared Logarithmic Error. On the test set, the proposed method achieves 88.06 kg Mean Absolute Error (MAE), 6.39% Mean Absolute Percentage Error (MAPE), and an R2 coefficient of 0.9548. The model shows strong accuracy for light objects in the 0-100 kg range with 2.38 kg MAE and 3.1% MAPE, maintaining reliable performance for heavy waste in the 1000-2000 kg range with 11.1% MAPE. Finally, we incorporate a physically grounded explanation module using Shapley Additive Explanations (SHAP) and a large language model to provide clear, human-readable explanations for each prediction.
[CV-178] DriveCode: Domain Specific Numerical Encoding for LLM -Based Autonomous Driving
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自动驾驶应用中因数值离散化表示而导致的精确数值推理能力不足问题,具体表现为:将数字拆分为离散文本标记会削弱数字位值的重要性、影响训练目标的有效性,并难以同时实现解码效率与数值精度。这限制了传感器数据处理和控制指令生成的准确性,成为LLM驱动自动驾驶系统部署的关键障碍。解决方案的核心在于提出DriveCode,一种新型数值编码方法,其关键创新是将数字映射为专用嵌入向量(dedicated embeddings),而非传统离散文本标记;并通过一个数字投影器(number projector)将数值直接投射到语言模型的隐藏空间中,从而实现与视觉和文本特征在统一多模态序列中的无缝融合,显著提升了轨迹预测与控制信号生成的性能。
链接: https://arxiv.org/abs/2603.00919
作者: Zhiye Wang,Yanbo Jiang,Rui Zhou,Bo Zhang,Fang Zhang,Zhenhua Xu,Yaqin Zhang,Jianqiang Wang
机构: Lanzhou University (兰州大学); Tsinghua University (清华大学); DiDi (滴滴出行); The Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: The project page is available at this https URL
Abstract:Large language models (LLMs) have shown great promise for autonomous driving. However, discretizing numbers into tokens limits precise numerical reasoning, fails to reflect the positional significance of digits in the training objective, and makes it difficult to achieve both decoding efficiency and numerical precision. These limitations affect both the processing of sensor measurements and the generation of precise control commands, creating a fundamental barrier for deploying LLM-based autonomous driving systems. In this paper, we introduce DriveCode, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens. DriveCode employs a number projector to map numbers into the language model’s hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence. Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.
[CV-179] Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards CVPR2026
【速读】:该论文旨在解决文本到图像生成模型在后训练阶段难以准确匹配人类偏好、事实一致性与美学质量的问题。传统方法依赖外部奖励监督,需额外数据集、标注人员或奖励模型,成本高且易受偏差影响。其解决方案的关键在于提出ARC(Adaptive Rewarding by self-Confidence)框架,通过模型自身对注入噪声的自去噪恢复能力来提取内生的置信度信号,并将其转化为标量奖励,实现无需外部监督的完全无监督优化,从而在组合生成、文本渲染和图文对齐等任务上显著提升性能,同时与外部奖励结合时还能缓解奖励劫持问题。
链接: https://arxiv.org/abs/2603.00918
作者: Seungwook Kim,Minsu Cho
机构: Pohang University of Science and Technology (POSTECH); RLWRLD
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, accepted to CVPR 2026. Project page this https URL
Abstract:Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce ARC (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.
[CV-180] VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection CVPR2026
【速读】:该论文旨在解决传感器几何信息缺失(Sensor-Geometry-Free, SG-Free)条件下的多视角室内三维目标检测问题,即在无精确相机位姿或深度信息的情况下实现鲁棒的3D目标检测。传统方法依赖昂贵的多视角相机标定来融合视图信息,限制了其在真实场景中的部署。解决方案的关键在于提出VGGT-Det框架,通过引入两个核心组件:(i) 注意力引导查询生成(Attention-Guided Query Generation, AG),利用VGGT内部注意力图作为语义先验初始化对象查询,从而聚焦于目标区域并保持全局空间结构;(ii) 查询驱动特征聚合(Query-Driven Feature Aggregation, QD),设计可学习的“看-查询”机制与对象查询交互,动态聚合跨层几何特征,逐步将2D特征提升至3D表示。该方法有效利用了VGGT中隐式学习到的语义和几何先验,显著提升了SG-Free设置下的检测性能,在ScanNet和ARKitScenes数据集上分别达到4.4和8.6 mAP@0.25的提升。
链接: https://arxiv.org/abs/2603.00912
作者: Yang Cao,Feize Wu,Dave Zhenyu Chen,Yingji Zhong,Lanqing Hong,Dan Xu
机构: Hong Kong University of Science and Technology (香港科技大学); Huawei (华为); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Code Page: this https URL
Abstract:Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to ‘see’ what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT’s internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.
[CV-181] On the Exact Algorithmic Extraction of Finite Tesselations Through Prime Extraction of Minimal Representative Forms
【速读】:该论文旨在解决离散网格中重复模式识别的难题,尤其是在符号推理、算法合成和结构优化等计算领域中,如何准确发现精确的轴对齐矩形镶嵌(tessellations)问题。当前统计方法虽能近似识别噪声数据中的模式,但基于确定性提取周期结构的符号分析仍不成熟。解决方案的关键在于提出一种分层算法:通过复合发现机制(双重检测与广度优先剪枝)识别具有内部重复性的矩形区域,利用归一化技术将其映射到最小代表形式,并采用素提取策略(选择性复制与分层记忆化)处理非规则尺寸并实现高效计算时间。该方法确保了对有限平面网格中多重独立模式的精确识别,适用于拼图推理任务及离散符号域中精确重复结构的识别。
链接: https://arxiv.org/abs/2603.00911
作者: Sushish Baral,Paulo Garcia,Warisa Sritriratanarak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The identification of repeating patterns in discrete grids is rudimentary within symbolic reasoning, algorithm synthesis and structural optimization across diverse computational domains. Although statistical approaches targeting noisy data can approximately recognize patterns, symbolic analysis utilizing deterministic extraction of periodic structures is underdeveloped. This paper aims to fill this gap by employing a hierarchical algorithm that discovers exact tessellations in finite planar grids, addressing the problem where multiple independent patterns may coexist within a hierarchical structure. The proposed method utilizes composite discovery (dual inspection and breadth-first pruning) for identifying rectangular regions with internal repetition, normalization to a minimal representative form, and prime extraction (selective duplication and hierarchical memoization) to account for irregular dimensions and to achieve efficient computation time. We evaluate scalability on grid sizes from 2x2 to 32x32, showing overlap detection on simple repeating tiles exhibits processing time under 1ms, while complex patterns which require exhaustive search and systematic exploration shows exponential growth. This algorithm provides deterministic behavior for exact, axis-aligned, rectangular tessellations, addressing a critical gap in symbolic grid analysis techniques, applicable to puzzle solving reasoning tasks and identification of exact repeating structures in discrete symbolic domains.
[CV-182] UD-SfPNet: An Underwater Descattering Shape-from-Polarization Network for 3D Normal Reconstruction
【速读】:该论文旨在解决水下光学成像中因散射导致的图像质量退化问题,同时实现高精度的三维表面法向量预测。其关键解决方案是提出UD-SfPNet网络架构,该架构将偏振图像去散射与基于偏振的形状恢复(Shape-from-Polarization, SfP)法向估计统一建模在一个端到端的流水线中,避免了传统分步处理带来的误差累积,并支持两个任务间的全局优化;此外,引入颜色嵌入模块以增强几何一致性,并设计细节增强卷积模块以保留高频几何细节,从而显著提升水下场景下的3D重建精度,实验表明该方法在MuS-Polar3D数据集上实现了最低的平均表面法向角误差(15.12°)。
链接: https://arxiv.org/abs/2603.00908
作者: Puyun Wang,Kaimin Yu,Huayang He,Feng Huang,Xianyu Wu,Yating Chen
机构: Fuzhou University (福州大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater optical imaging is severely hindered by scattering, but polarization imaging offers the unique dual advantages of descattering and shape-from-polarization (SfP) 3D reconstruction. To exploit these advantages, this paper proposes UD-SfPNet, an underwater descattering shape-from-polarization network that leverages polarization cues for improved 3D surface normal prediction. The framework jointly models polarization-based image descattering and SfP normal estimation in a unified pipeline, avoiding error accumulation from sequential processing and enabling global optimization across both tasks. UD-SfPNet further incorporates a novel color embedding module to enhance geometric consistency by exploiting the relationship between color encodings and surface orientation. A detail enhancement convolution module is also included to better preserve high-frequency geometric details that are lost under scattering. Experiments on the MuS-Polar3D dataset show that the proposed method significantly improves reconstruction accuracy, achieving a mean surface normal angular error of 15.12 ^\circ (the lowest among compared methods). These results confirm the efficacy of combining descattering with polarization-based shape inference, and highlight the practical significance and potential applications of UD-SfPNet for optical 3D imaging in challenging underwater environments. The code is available at this https URL.
[CV-183] ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration CVPR2026
【速读】:该论文旨在解决基于查找表(Look-Up Table, LUT)的图像恢复方法在扩大感受野(receptive field)时引入额外计算与存储开销的问题,从而限制其在边缘设备上的部署。解决方案的关键在于三个互补组件:首先,提出可学习的空间偏移模块(Learnable Spatial Shift module, LSS),通过在特征图上施加可学习的、通道独立的空间偏移来扩展感受野;其次,设计不对称双分支架构,将更多计算资源分配给信息密集分支,在不牺牲恢复质量的前提下显著降低推理延迟;最后,引入基于特征层面的LUT压缩策略——误差有界自适应采样(Error-bounded Adaptive Sampling, EAS),有效减少存储开销。该方法在保持小模型尺寸和低推理时间的同时,实现了比当前最优方法TinyLUT更大的感受野(提升3.8倍)和更高的重建质量(平均PSNR提升超过0.21 dB)。
链接: https://arxiv.org/abs/2603.00906
作者: Xiaolong Zeng,Yitong Yu,Shiyao Xiong,Jinhua Hao,Ming Sun,Chao Zhou,Bin Wang
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Look-Up Table based methods have emerged as a promising direction for efficient image restoration tasks. Recent LUT-based methods focus on improving their performance by expanding the receptive field. However, they inevitably introduce extra computational and storage overhead, which hinders their deployment in edge devices. To address this issue, we propose ShiftLUT, a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency. Our key insight lies in three complementary components. First, Learnable Spatial Shift module (LSS) is introduced to expand the receptive field by applying learnable, channel-wise spatial offsets on feature maps. Second, we propose an asymmetric dual-branch architecture that allocates more computation to the information-dense branch, substantially reducing inference latency without compromising restoration quality. Finally, we incorporate a feature-level LUT compression strategy called Error-bounded Adaptive Sampling (EAS) to minimize the storage overhead. Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8 \times larger receptive field and improves an average PSNR by over 0.21 dB across multiple standard benchmarks, while maintaining a small storage size and inference time.
[CV-184] pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning ICLR2026
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在需要三维空间理解的任务中表现不足的问题。现有MLLMs虽具备通用感知与推理能力,但在处理涉及真实世界空间结构的复杂任务时仍存在局限。解决方案的关键在于提出pySpatial——一个视觉编程框架,通过生成Python代码调用空间工具(如3D重建、相机位姿恢复、新视角渲染等),将原始2D图像序列转化为可交互的3D场景,从而让MLLMs能够基于结构化的空间表示进行显式推理。该方法无需梯度微调,在零样本(zero-shot)设置下即可实现高效的空间推理,且已在MindCube和Omni3D-Bench等基准上显著优于主流基线模型,并在真实室内导航实验中验证了其实际应用价值。
链接: https://arxiv.org/abs/2603.00905
作者: Zhanpeng Luo,Ce Zhang,Silong Yong,Cunxi Dai,Qianwei Wang,Haoxi Ran,Guanya Shi,Katia Sycara,Yaqi Xie
机构: Carnegie Mellon University (卡内基梅隆大学); University of Pittsburgh (匹兹堡大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026, Project Page: Our project: this https URL
Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.
[CV-185] VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
【速读】:该论文旨在解决体积电子显微镜(Volume Electron Microscopy, VEM)成像中因各向异性数据导致的轴向分辨率低的问题,这一问题严重影响了三维组织结构的可视化与后续分析。现有各向同性重建方法常忽略丰富的轴向信息,且采用简单下采样模拟各向异性数据,难以实现高质量重建。其解决方案的关键在于提出VEMamba框架,核心创新为一种新颖的3D依赖重排序(3D Dependency Reordering)范式,通过两个关键组件实现:一是轴向-横向分块选择性扫描模块(Axial-Lateral Chunking Selective Scan Module, ALCSSM),将复杂的三维空间依赖关系(包括轴向和横向)智能映射为优化的一维序列,以高效进行基于Mamba的建模并显式保证轴向-横向一致性;二是动态权重聚合模块(Dynamic Weights Aggregation Module, DWAM),自适应融合重排序后的序列输出以增强表征能力。此外,引入真实退化模拟并结合动量对比学习(Momentum Contrast, MoCo)将退化感知知识融入网络,显著提升重建性能。
链接: https://arxiv.org/abs/2603.00887
作者: Longmi Gao,Pan Gao
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Volume Electron Microscopy (VEM) is crucial for 3D tissue imaging but often produces anisotropic data with poor axial resolution, hindering visualization and downstream analysis. Existing methods for isotropic reconstruction often suffer from neglecting abundant axial information and employing simple downsampling to simulate anisotropic data. To address these limitations, we propose VEMamba, an efficient framework for isotropic reconstruction. The core of VEMamba is a novel 3D Dependency Reordering paradigm, implemented via two key components: an Axial-Lateral Chunking Selective Scan Module (ALCSSM), which intelligently re-maps complex 3D spatial dependencies (both axial and lateral) into optimized 1D sequences for efficient Mamba-based modeling, explicitly enforcing axial-lateral consistency; and a Dynamic Weights Aggregation Module (DWAM) to adaptively aggregate these reordered sequence outputs for enhanced representational power. Furthermore, we introduce a realistic degradation simulation and then leverage Momentum Contrast (MoCo) to integrate this degradation-aware knowledge into the network for superior reconstruction. Extensive experiments on both simulated and real-world anisotropic VEM datasets demonstrate that VEMamba achieves highly competitive performance across various metrics while maintaining a lower computational footprint. The source code is available on GitHub: this https URL
[CV-186] Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos
【速读】:该论文旨在解决X射线冠状动脉造影(X-ray coronary angiography, XCA)序列中主冠状动脉分割的难题,该任务在冠状动脉疾病诊断中至关重要,但受限于边界模糊、辐射对比度不一致、复杂运动模式以及标注数据稀缺等问题。传统半监督学习(Semi-Supervised Learning, SSL)方法难以应对时间动态复杂性和不可靠的不确定性量化。为此,作者提出了一种基于SAM3的教师-学生框架(SMART),其关键创新在于:一是利用SAM3的可提示概念分割设计构建教师-学生架构以最大化模型性能潜力;二是引入基于血管掩膜变形(vessel mask warping)和运动一致性损失(motion consistency loss)的机制来建模复杂的血管动态;三是提出渐进式置信度感知一致性正则化(progressive confidence-aware consistency regularization),有效缓解因边界模糊和低对比度导致的教师预测不可靠问题。实验表明,SMART在三个不同机构的数据集上均达到最优性能,且显著降低标注需求,适用于标注稀缺的临床场景。
链接: https://arxiv.org/abs/2603.00881
作者: Yu Luo,Guangyu Wei,Yangfan Li,Jieyu He,Yueming Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures
Abstract:Segmentation of the main coronary artery from X-ray coronary angiography (XCA) sequences is crucial for the diagnosis of coronary artery diseases. However, this task is challenging due to issues such as blurred boundaries, inconsistent radiation contrast, complex motion patterns, and a lack of annotated images for training. Although Semi-Supervised Learning (SSL) can alleviate the annotation burden, conventional methods struggle with complicated temporal dynamics and unreliable uncertainty quantification. To address these challenges, we propose SAM3-based Teacher-student framework with Motion-Aware consistency and Progressive Confidence Regularization (SMART), a semi-supervised vessel segmentation approach for X-ray angiography videos. First, our method utilizes SAM3’s unique promptable concept segmentation design and innovates a SAM3-based teacher-student framework to maximize the performance potential of both the teacher and the student. Second, we enhance segmentation by integrating the vessel mask warping technique and motion consistency loss to model complex vessel dynamics. To address the issue of unreliable teacher predictions caused by blurred boundaries and minimal contrast, we further propose a progressive confidence-aware consistency regularization to mitigate the risk of unreliable outputs. Extensive experiments on three datasets of XCA sequences from different institutions demonstrate that SMART achieves state-of-the-art performance while requiring significantly fewer annotations, making it particularly valuable for real-world clinical applications where labeled data is scarce. Our code is available at: this https URL.
[CV-187] MMTA: Multi Membership Temporal Attention for Fine-Grained Stroke Rehabilitation Assessment
【速读】:该论文旨在解决康复评估中细粒度动作分割的精度问题,特别是如何在保持运动训练上下文的同时捕捉亚秒级微动作(sub-second micro-movements),以提升对患者运动恢复状态的可靠下游评估。现有时间动作分割(Temporal Action Segmentation, TAS)模型因难以区分快速相位转换边界而限制了评估准确性。其解决方案的关键在于提出多成员时间注意力机制(Multi-Membership Temporal Attention, MMTA),该机制允许每一帧在同层内同时关注多个局部归一化的时域注意力窗口,并通过特征空间重叠解析融合这些并发时域视角,在不增加网络深度或分阶段优化的前提下增强边界敏感性,从而实现高分辨率、高精度的动作分割,适用于视频与可穿戴惯性测量单元(IMU)输入的统一单阶段架构。
链接: https://arxiv.org/abs/2603.00878
作者: Halil Ismail Helvaci,Justin Huber,Jihye Bae,Sen-ching Samson Cheung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To empower the iterative assessments involved during a person’s rehabilitation, automated assessment of a person’s abilities during daily activities requires temporally precise segmentation of fine-grained actions in therapy videos. Existing temporal action segmentation (TAS) models struggle to capture sub-second micro-movements while retaining exercise context, blurring rapid phase transitions and limiting reliable downstream assessment of motor recovery. We introduce Multi-Membership Temporal Attention (MMTA), a high-resolution temporal transformer for fine-grained rehabilitation assessment. Unlike standard temporal attention, which assigns each frame a single attention context per layer, MMTA lets each frame attend to multiple locally normalized temporal attention windows within the same layer. We fuse these concurrent temporal views via feature-space overlap resolution, preserving competing local contexts near transitions while enabling longer-range reasoning through layer-wise propagation. This increases boundary sensitivity without additional depth or multi-stage refinement. MMTA supports both video and wearable IMU inputs within a unified single-stage architecture, making it applicable to both clinical and home settings. MMTA consistently improves over the Global Attention transformer, boosting Edit Score by +1.3 (Video) and +1.6 (IMU) on StrokeRehab while further improving 50Salads by +3.3. Ablations confirm that performance gains stem from multi-membership temporal views rather than architectural complexity, offering a practical solution for resource-constrained rehabilitation assessment.
[CV-188] PPC-MT: Parallel Point Cloud Completion with Mamba-Transformer Hybrid Architecture
【速读】:该论文旨在解决点云补全(Point Cloud Completion)中高质量重建与计算效率难以兼顾的问题。其解决方案的关键在于提出了一种基于混合Mamba-Transformer架构的并行框架PPC-MT,通过主成分分析(PCA)引导的并行补全策略,将无序点云转化为具有几何意义的有序集合,并分解为多个子集进行并行重建;同时,利用Mamba的线性复杂度实现高效编码特征提取,结合Transformer在解码阶段对细粒度多序列关系的建模能力,从而在保持计算效率的同时显著提升点云分布均匀性和细节保真度。
链接: https://arxiv.org/abs/2603.00870
作者: Jie Li,Shengwei Tian,Long Yu,Xin Ning
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE TPAMI
Abstract:Existing point cloud completion methods struggle to balance high-quality reconstruction with computational efficiency. To address this, we propose PPC-MT, a novel parallel framework for point cloud completion leveraging a hybrid Mamba-Transformer architecture. Our approach introduces an innovative parallel completion strategy guided by Principal Component Analysis (PCA), which imposes a geometrically meaningful structure on unordered point clouds, transforming them into ordered sets and decomposing them into multiple subsets. These subsets are reconstructed in parallel using a multi-head reconstructor. This structured parallel synthesis paradigm significantly enhances the uniformity of point distribution and detail fidelity, while preserving computational efficiency. By integrating Mamba’s linear complexity for efficient feature extraction during encoding with the Transformer’s capability to model fine-grained multi-sequence relationships during decoding, PPC-MT effectively balances efficiency and reconstruction accuracy. Extensive quantitative and qualitative experiments on benchmark datasets, including PCN, ShapeNet-55/34, and KITTI, demonstrate that PPC-MT outperforms state-of-the-art methods across multiple metrics, validating the efficacy of our proposed framework.
[CV-189] Geometry OR Tracker: Universal Geometric Operating Room Tracking
【速读】:该论文旨在解决手术室(Operating Room, OR)中多视角三维跟踪因相机标定与RGB-D配准不准确导致的几何不一致性问题,该问题会引发融合过程中的“鬼影”现象并降低共享世界坐标系下的3D轨迹精度。解决方案的关键在于提出一个两阶段流程:首先通过多视角度量几何校正模块(Multi-view Metric Geometry Rectification module)将不精确的标定结果统一为具有全局尺度一致性和几何一致性的相机配置;随后在统一的OR世界坐标系中执行抗遮挡的3D点跟踪(Occlusion-Robust 3D Point Tracking)。实验证明,该方法显著降低了跨视角深度差异(超过30倍),且几何一致性提升直接增强了世界帧中的跟踪性能。
链接: https://arxiv.org/abs/2603.00560
作者: Yihua Shao,Kang Chen,Feng Xue,Siyu Chen,Long Bai,Hongyuan Yu,Hao Tang,Jinlin Wu,Nassir Navab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In operating rooms (OR), world-scale multi-view 3D tracking supports downstream applications such as surgeon behavior recognition, where physically meaningful quantities such as distances and motion statistics must be measured in meters. However, real clinical deployments rarely satisfy the geometric prerequisites for stable multi-view fusion and tracking: camera calibration and RGB-D registration are always unreliable, leading to cross-view geometric inconsistency that produces “ghosting” during fusion and degrades 3D trajectories in a shared OR coordinate frame. To address this, we introduce Geometry OR Tracker, a two-stage pipeline that first rectifies imprecise calibration into a scaleconsistent and geometrically consistent camera setup with a single global scale via a Multi-view Metric Geometry Rectification module, and then performs Occlusion-Robust 3D Point Tracking directly in the unified OR world frame. On the MM-OR benchmark, improved geometric consistency translates into tracking gains: our rectification front-end reduces cross-view depth disagreement by more than 30 \times compared to raw calibration. Ablation studies further demonstrate the relationship between calibration quality and tracking accuracy, showing that improved geometric consistency yields stronger world-frame tracking.
[CV-190] Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning CVPR2026
【速读】:该论文旨在解决弱监督视频异常检测(Weakly Supervised Video Anomaly Detection, WS-VAD)中因缺乏密集帧级标注而导致异常语义学习困难的问题。现有方法在仅依赖视频级标签的情况下难以准确区分正常与异常行为,尤其在相似场景下(如“取物”与“偷窃”),模型易混淆。解决方案的关键在于提出LAS-VAD框架,其核心创新包括:(1)异常连通组件机制(Anomaly-Connected Component Mechanism),用于将视频帧划分为语义一致的组别,提升局部语义一致性建模能力;(2)意图感知机制(Intention Awareness Mechanism),通过区分行为意图增强对相似动作的判别力;(3)引入异常属性信息(Anomaly Attribute Information),利用异常事件特有的特征(如爆炸伴随火焰和浓烟)引导更精准的检测。实验表明,该方法在XD-Violence和UCF-Crime两个基准数据集上显著优于当前最优方法。
链接: https://arxiv.org/abs/2603.00550
作者: Yu Wang,Shengjie Zhao
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Weakly supervised video anomaly detection (WS-VAD) involves identifying the temporal intervals that contain anomalous events in untrimmed videos, where only video-level annotations are provided as supervisory signals. However, a key limitation persists in WS-VAD, as dense frame-level annotations are absent, which often leaves existing methods struggling to learn anomaly semantics effectively. To address this issue, we propose a novel framework named LAS-VAD, short for Learning Anomaly Semantics for WS-VAD, which integrates anomaly-connected component mechanism and intention awareness mechanism. The former is designed to assign video frames into distinct semantic groups within a video, and frame segments within the same group are deemed to share identical semantic information. The latter leverages an intention-aware strategy to distinguish between similar normal and abnormal behaviors (e.g., taking items and stealing). To further model the semantic information of anomalies, as anomaly occurrence is accompanied by distinct characteristic attributes (i.e., explosions are characterized by flames and thick smoke), we additionally incorporate anomaly attribute information to guide accurate detection. Extensive experiments on two benchmark datasets, XD-Violence and UCF-Crime, demonstrate that our LAS-VAD outperforms current state-of-the-art methods with remarkable gains.
[CV-191] Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation
【速读】:该论文旨在解决当前多模态大语言模型作为评判者(MLLM-as-a-judge)系统在评估能力与可靠性方面存在的不足,特别是现有基准测试仅按任务类型分类样本,未能深入刻画模型进行可靠判断所需的核心能力。其关键解决方案是提出M-JudgeBench这一十维能力导向的评测基准,将评估任务细分为成对思维链(Chain-of-Thought, CoT)比较、长度偏倚规避和过程错误检测等十个子任务,从而实现对模型在不同推理风格、响应长度及跨模型差异下的可靠性诊断;同时设计Judge-MCTS数据构建框架,生成具有多样正确性和长度的成对推理轨迹,并基于此训练出M-Judger系列强判别模型,显著提升了在多个基准上的表现,为未来判别模型评估与以能力为导向的训练提供了更扎实的方法论基础。
链接: https://arxiv.org/abs/2603.00546
作者: Zeyu Chen,Huanjin Yao,Ziwang Zhao,Min Yang
机构: Tsinghua University (清华大学); ByteDance (字节跳动)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Using Multimodal Large Language Models (MLLMs) as judges to achieve precise and consistent evaluations has gradually become an emerging paradigm across various domains. Evaluating the capability and reliability of MLLM-as-a-judge systems is therefore essential for ensuring trustworthy assessment. Existing judge benchmarks categorize samples by task types but fail to capture the fundamental judgment capabilities required for reliable evaluation. In this work, we introduce M-JudgeBench, a ten-dimensional capability-oriented benchmark designed to comprehensively assess the judgment abilities of MLLMs. Our benchmark decomposes evaluation into pairwise Chain-of-Thought (CoT) comparison, length bias avoidance, and process error detection tasks, jointly covering ten fine-grained subtasks. This design enables diagnosis of model reliability across reasoning styles, response lengths, and cross-model variations. Systematic evaluation uncovers the systematic weaknesses in existing MLLM-as-a-judge systems. To address this issue, we further propose Judge-MCTS, a data construction framework generating pairwise reasoning trajectories with various correctness and length. Using Judge-MCTS, we construct an MCTS-augmented dataset and train M-Judger, a series of strong judge models. Extensive experiments demonstrate the superiority of M-Judger on existing judge benchmarks as well as M-JudgeBench. Overall, our work establishes a more principled foundation for evaluating MLLM-as-a-judge through M-JudgeBench and Judge-MCTS framework, paving the way for future research on judge model evaluation and capability-driven judge training.
[CV-192] Multiple Inputs and Mixwd data for Alzheimers Disease Classification Based on 3D Vision Transformer
【速读】:该论文旨在解决当前基于磁共振成像(MRI)诊断阿尔茨海默病(Alzheimer’s Disease, AD)方法中存在的三大局限:一是多数研究采用二维变换器(2D Transformer)独立分析单个脑切片,导致丢失关键的三维空间上下文信息;二是基于感兴趣区域(Region of Interest, ROI)的模型仅关注少数脑区,而AD实际影响多个脑区;三是分类模型通常依赖单一数据源,缺乏多模态融合以提升诊断准确性。其解决方案的关键在于提出一种新型多输入混合数据三维视觉变换器(Multiple Inputs and Mixed Data 3D Vision Transformer, MIMD-3DVT),该方法通过联合处理连续脑切片以保留三维特征与空间关系,融合多个3D ROI影像输入,并集成人口统计学、认知评估和脑影像等多源异构数据,从而显著提升对正常认知与阿尔茨海默病的区分准确率(达97.14%)。
链接: https://arxiv.org/abs/2603.00545
作者: Juan A. Castro-Silva,Maria N. Moreno Garcia,Diego H. Peluffo-Ordoñez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The current methods for diagnosing Alzheimer Disease using Magnetic Resonance Imaging (MRI) have significant limitations. Many previous studies used 2D Transformers to analyze individual brain slices independently, potentially losing critical 3D contextual information. Region of interest-based models often focus on only a few brain regions despite Alzheimer’s affecting multiple areas. Additionally, most classification models rely on a single test, whereas diagnosing Alzheimer’s requires a multifaceted approach integrating diverse data sources for a more accurate assessment. This study introduces a novel methodology called the Multiple Inputs and Mixed Data 3D Vision Transformer (MIMD-3DVT). This method processes consecutive slices together to capture the feature dimensions and spatial information, fuses multiple 3D ROI imaging data inputs, and integrates mixed data from demographic factors, cognitive assessments, and brain imaging. The proposed methodology was experimentally evaluated using a combined dataset that included the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the Australian Imaging, Biomarker, and Lifestyle Flagship Study of Ageing (AIBL), and the Open Access Series of Imaging Studies (OASIS). Our MIMD-3DVT, utilizing single or multiple ROIs, achieved an accuracy of 97.14%, outperforming the state-of-the-art methods in distinguishing between Normal Cognition and Alzheimer’s Disease.
[CV-193] Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark CVPR2026
【速读】:该论文旨在解决现有全色锐化(pansharpening)方法在低分辨率训练环境下评估导致的跨尺度泛化能力不足的问题,即模型难以适应真实世界中高分辨率的实际应用场景。其解决方案的关键在于提出ScaleFormer架构:该架构将不同分辨率下的图像泛化问题转化为序列长度的泛化问题,通过Scale-Aware Patchify模块实现固定尺寸裁剪下对不同尺度图像的统一处理;同时,它解耦了patch内部空间特征学习与patch间序列依赖建模,并引入旋转位置编码(Rotary Positional Encoding)以增强对未见尺度的外推能力,从而显著提升融合质量与跨尺度泛化性能。
链接: https://arxiv.org/abs/2603.00543
作者: Ke Cao,Xuanhua He,Xueheng Li,Lingting Zhu,Yingying Wang,Ao Ma,Zhanjie Zhang,Man Zhou,Chengjun Xie,Jie Zhang
机构: HFIPS, Chinese Academy of Sciences; University of Science and Technology of China; The Hong Kong University of Science and Technology; The University of Hong Kong; Xiamen University; JD.com; Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Pansharpening aims to generate high-resolution multi-spectral images by fusing the spatial detail of panchromatic images with the spectral richness of low-resolution MS data. However, most existing methods are evaluated under limited, low-resolution settings, limiting their generalization to real-world, high-resolution scenarios. To bridge this gap, we systematically investigate the data, algorithmic, and computational challenges of cross-scale pansharpening. We first introduce PanScale, the first large-scale, cross-scale pansharpening dataset, accompanied by PanScale-Bench, a comprehensive benchmark for evaluating generalization across varying resolutions and scales. To realize scale generalization, we propose ScaleFormer, a novel architecture designed for multi-scale pansharpening. ScaleFormer reframes generalization across image resolutions as generalization across sequence lengths: it tokenizes images into patch sequences of the same resolution but variable length proportional to image scale. A Scale-Aware Patchify module enables training for such variations from fixed-size crops. ScaleFormer then decouples intra-patch spatial feature learning from inter-patch sequential dependency modeling, incorporating Rotary Positional Encoding to enhance extrapolation to unseen scales. Extensive experiments show that our approach outperforms SOTA methods in fusion quality and cross-scale generalization. The datasets and source code are available upon acceptance.
[CV-194] Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation AAAI2026
【速读】:该论文旨在解决真实场景中雾霾去除任务不仅要提升图像可见度,还需满足多种下游视觉任务特定需求的问题。传统方法难以适应不同应用场景的差异化要求,导致去雾效果与下游任务性能脱节。解决方案的关键在于提出一种自适应动态去雾框架,其核心是引入闭环优化机制,包含两个互补创新模块:一是基于多下游任务性能反馈的动态调制机制,实现去雾输出的实时优化;二是文本指令接口,允许用户通过高阶任务描述引导去雾行为。这种双引导策略使模型在训练完成后仍能根据任务需求和用户指令灵活调整去雾输出,从而实现与下游应用的协同优化,显著提升了去雾结果的针对性、鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2603.00542
作者: Yafei Zhang,Shuaitian Song,Huafeng Li,Shujuan Wang,Yu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Aceepted by AAAI2026
Abstract:In real-world vision systems,haze removal is required not only to enhance image visibility but also to meet the specific needs of diverse downstream this http URL address this challenge,we propose a novel adaptive dynamic dehazing framework that incorporates a closed-loop optimization this http URL enables feedback-driven refinement based on downstream task performance and user instruction-guided adjustment during inference,allowing the model to satisfy the specific requirements of multiple downstream tasks without this http URL,our framework integrates two complementary and innovative mechanisms: (1)a task feedback loop that dynamically modulates dehazing outputs based on performance across multiple downstream tasks,and (2) a text instruction interface that allows users to specify high-level task this http URL dual-guidance strategy enables the model to adapt its dehazing behavior after training,tailoring outputs in real time to the evolving needs of multiple this http URL experiments across various vision tasks demonstrate the strong effectiveness,robustness,and generalizability of our this http URL results establish a new paradigm for interactive,task-adaptive dehazing that actively collaborates with downstream applications.
[CV-195] RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation
【速读】:该论文旨在解决在放射治疗中,锥形束CT(Cone-beam CT, CBCT)因严重伪影和不可靠的亨氏单位(Hounsfield Unit, HU)值而难以直接用于剂量计算的问题。为此,研究者提出通过从CBCT生成合成CT(synthetic CT, sCT)来实现更准确的剂量计算,但面临的关键挑战是配对CBCT-CT数据常因时间间隔、解剖变异及配准误差而不可用或不可靠,导致传统成对学习方法失效。解决方案的核心在于引入检索增强型流匹配(Retrieval-Augmented Flow Matching, RAFM),其关键创新是利用冻结的DINOv3编码器与全局CT记忆库构建检索引导的伪配对样本,从而提升无配对流模型训练中的分布耦合质量并稳定监督信号,显著改善了小样本医学图像翻译的性能。
链接: https://arxiv.org/abs/2603.00535
作者: Xianhao Zhou,Jianghao Wu,Lanfeng Zhong,Ku Zhao,Jinlong He,Shaoting Zhang,Guotai Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cone-beam CT (CBCT) is routinely acquired in radiotherapy but suffers from severe artifacts and unreliable Hounsfield Unit (HU) values, limiting its direct use for dose calculation. Synthetic CT (sCT) generation from CBCT is therefore an important task, yet paired CBCT–CT data are often unavailable or unreliable due to temporal gaps, anatomical variation, and registration errors. In this work, we introduce rectified flow (RF) into unpaired CBCT-to-CT translation in medical imaging. Although RF is theoretically compatible with unpaired learning through distribution-level coupling and deterministic transport, its practical effectiveness under small medical datasets and limited batch sizes remains underexplored. Direct application with random or batch-local pseudo pairing can produce unstable supervision due to semantically mismatched endpoint samples. To address this challenge, we propose Retrieval-Augmented Flow Matching (RAFM), which adapts RF to the medical setting by constructing retrieval-guided pseudo pairs using a frozen DINOv3 encoder and a global CT memory bank. This strategy improves empirical coupling quality and stabilizes unpaired flow-based training. Experiments on SynthRAD2023 under a strict subject-level true-unpaired protocol show that RAFM outperforms existing methods across FID, MAE, SSIM, PSNR, and SegScore. The code is available at this https URL.
[CV-196] CaptionFool: Universal Image Captioning Model Attacks
【速读】:该论文旨在解决生成式视觉语言模型(Vision-Language Models)在实际部署中面临的对抗攻击脆弱性问题,特别是针对基于Transformer架构的图像描述生成模型(Image Captioning Models)。其解决方案的关键在于提出了一种新型通用对抗攻击方法CaptionFool,该方法通过仅修改图像中7个(约1.2%)的patch区域,即可实现高达94–96%的成功率,诱导模型生成任意指定的目标文本,包括恶意内容;此外,该方法还能生成特定“俚语”表达以规避现有的内容过滤机制,从而揭示当前视觉语言模型在安全性和鲁棒性方面的严重漏洞。
链接: https://arxiv.org/abs/2603.00529
作者: Swapnil Parekh
机构: Intuit( intuit)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image captioning models are encoder-decoder architectures trained on large-scale image-text datasets, making them susceptible to adversarial attacks. We present CaptionFool, a novel universal (input-agnostic) adversarial attack against state-of-the-art transformer-based captioning models. By modifying only 7 out of 577 image patches (approximately 1.2% of the image), our attack achieves 94-96% success rate in generating arbitrary target captions, including offensive content. We further demonstrate that CaptionFool can generate “slang” terms specifically designed to evade existing content moderation filters. Our findings expose critical vulnerabilities in deployed vision-language models and underscore the urgent need for robust defenses against such attacks. Warning: This paper contains model outputs which are offensive in nature.
[CV-197] P-Spikformer: Token Pruned Spiking Transformer
【速读】:该论文旨在解决当前基于脉冲神经网络(Spiking Neural Networks, SNNs)的Transformer模型在资源受限设备上部署时面临的计算与存储开销过大的问题。尽管近年来spiking transformers在准确性方面取得进展,但其大规模架构导致显著的资源消耗,限制了实际应用。解决方案的关键在于提出一种名为TP-Spikformer的简单而有效的token剪枝方法:首先设计了一个综合考量时空信息保留能力的启发式重要性评估准则,用于量化token的信息价值;进而构建一个以块级早停策略为核心的剪枝框架,对低信息量token不直接移除而是延迟处理,从而在减少冗余计算的同时最大限度保留有效信息,实现高效且无训练依赖的轻量化部署。
链接: https://arxiv.org/abs/2603.00527
作者: Wenjie Wei,Xiaolong Zhou,Malu Zhang,Ammar Belatreche,Qian Sun,Yimeng Shan,Dehao Zhang,Zijian Zhou,Zeyu Ma,Yang Yang,Haizhou Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 7 figures
Abstract:Spiking neural networks (SNNs) offer an energy-efficient alternative to traditional neural networks due to their event-driven computing paradigm. However, recent advancements in spiking transformers have focused on improving accuracy with large-scale architectures, which require significant computational resources and limit deployment on resource-constrained devices. In this paper, we propose a simple yet effective token pruning method for spiking transformers, termed TP-Spikformer, that reduces storage and computational overhead while maintaining competitive performance. Specifically, we first introduce a heuristic spatiotemporal information-retaining criterion that comprehensively evaluates tokens’ importance, assigning higher scores to informative tokens for retention and lower scores to uninformative ones for pruning. Based on this criterion, we propose an information-retaining token pruning framework that employs a block-level early stopping strategy for uninformative tokens, instead of removing them outright. This also helps preserve more information during token pruning. We demonstrate the effectiveness, efficiency and scalability of TP-Spikformer through extensive experiments across diverse architectures, including Spikformer, QKFormer and Spike-driven Transformer V1 and V3, and a range of tasks such as image classification, object detection, semantic segmentation and event-based object tracking. Particularly, TP-Spikformer performs well in a training-free manner. These results reveal its potential as an efficient and practical solution for deploying SNNs in real-world applications with limited computational resources.
[CV-198] Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation CVPR2026
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在三维网格(3D mesh)生成任务中训练效率低和泛化能力弱的问题。现有方法多依赖离线直接偏好优化(Direct Preference Optimization, DPO),存在训练效率低下且难以适应多样化场景的局限性。其解决方案的关键在于:首先提出首个异步在线RL框架,显著提升训练效率(比同步RL快3.75倍);其次设计优势引导的排名偏好优化(Advantage-guided Ranking Preference Optimization, ARPO)算法,在训练效率与泛化性能之间取得更好平衡;最终基于此构建Mesh-Pro系统,引入对角感知的混合三角-四边形标记化策略及基于光线的几何完整性奖励机制,从而实现艺术性和密集网格生成上的最先进性能。
链接: https://arxiv.org/abs/2603.00526
作者: Zhen Zhou,Jian Liu,Biwen Lei,Jing Xu,Haohan Weng,Yiling Zhu,Zhuo Chen,Junfeng Fan,Yunkai Ma,Dazhao Du,Song Guo,Fengshui Jing,Chunchao Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. Specifically, (1) we design the first asynchronous online RL framework tailored for 3D mesh generation post-training efficiency improvement, which is 3.75 \times faster than synchronous RL. (2) We propose Advantage-guided Ranking Preference Optimization (ARPO), a novel RL algorithm that achieves a better trade-off between training efficiency and generalization than current RL algorithms designed for 3D mesh generation, such as DPO and group relative policy optimization (GRPO). (3) Based on asynchronous ARPO, we propose Mesh-Pro, which additionally introduces a novel diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation and a ray-based reward for geometric integrity. Mesh-Pro achieves state-of-the-art performance on artistic and dense meshes.
[CV-199] Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness
【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成式 AI 中计算效率低下的问题,尤其是扩散 Transformer(DiT)因需进行密集的全注意力计算而带来的高资源消耗。现有加速方法多采用内容无关的均匀优化策略,忽略了不同区域在去噪过程中收敛行为的异质性。解决方案的关键在于提出一种无需训练的框架 JANO,其核心是两个创新:一是早期复杂度识别算法(early-stage complexity recognition algorithm),能够在初始去噪步骤中准确识别各区域的收敛需求;二是自适应标记调度运行时机制(adaptive token scheduling runtime),动态优化计算资源分配。该方法实现了平均 2.0 倍、最高达 2.4 倍的加速效果,同时保持生成质量不变,从而挑战了传统均匀处理假设并为大规模内容生成提供了实用加速方案。
链接: https://arxiv.org/abs/2603.00519
作者: Yuyang Chen,Linqian Zeng,Yijin ZHou,Hengjie Li,Jidong Zhai
机构: Shanghai Jiaotong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have achieved remarkable success in generative AI, yet their computational efficiency remains a significant challenge, particularly for Diffusion Transformers (DiTs) requiring intensive full-attention computation. While existing acceleration approaches focus on content-agnostic uniform optimization strategies, we observe that different regions in generated content exhibit heterogeneous convergence patterns during the denoising process. We present Jano, a training-free framework that leverages this insight for efficient region-aware generation. Jano introduces an early-stage complexity recognition algorithm that accurately identifies regional convergence requirements within initial denoising steps, coupled with an adaptive token scheduling runtime that optimizes computational resource allocation. Through comprehensive evaluation on state-of-the-art models, Jano achieves substantial acceleration (average 2.0 times speedup, up to 2.4 times) while preserving generation quality. Our work challenges conventional uniform processing assumptions and provides a practical solution for accelerating large-scale content generation. The source code of our implementation is available at this https URL.
[CV-200] Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在应用中因自注意力机制带来的二次时间复杂度问题,从而限制其效率与可扩展性。解决方案的关键在于引入一种新的线性时间序列建模方法——测试时训练(Test-Time Training, TTT),并提出Vision-TTT架构:通过一种新颖的自监督学习方式压缩视觉token序列,并结合双向扫描策略和Conv2d模块,有效将原始TTT扩展至建模具有全局感受野的二维视觉相关性。该方法在ImageNet分类任务上达到77.3%~82.5% Top-1准确率,同时在高分辨率下显著降低计算量(FLOPs减少79.4%)和内存占用(减少88.9%),且推理速度提升4.38倍,展现出卓越的表达能力与高效性。
链接: https://arxiv.org/abs/2603.00518
作者: Quan Kong,Yanru Xiao,Yuhao Shen,Cong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision and propose Vision-TTT, which compresses the visual token sequence in a novel self-supervised learning manner. By incorporating bidirectional scan strategy and the Conv2d module, Vision-TTT effectively extends vanilla TTT to model 2D visual correlations with global receptive fields. Extensive experiments show that \textttVittt-T/S/B achieve 77.3%,81.2%,82.5% Top-1 accuracy on ImageNet classification and also greatly outperform their counterparts on downstream tasks. At 1280x1280 resolution, \textttVittt-T reduces FLOPs by 79.4% and runs 4.38x faster with 88.9% less memory than DeiT-T. These results demonstrate the expressiveness and efficiency of Vision-TTT as a strong candidate for the next-generation generic visual backbone.
[CV-201] MLLM -4D: Towards Visual-based Spatial-Temporal Intelligence
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在从纯视觉输入中理解与推理4D时空演化能力方面的瓶颈问题。其关键解决方案在于提出一个名为MLLM-4D的完整框架,通过两个核心环节实现突破:一是构建高效的训练数据采集流程,将现有立体视频数据集转化为高质量的4D时空指令数据(包括用于监督微调的MLLM4D-2M和用于强化微调的MLLM4D-R1-30k数据集);二是设计无需修改模型架构的后训练策略,先通过监督微调建立基础4D感知能力,再利用分组相对策略优化(Group Relative Policy Optimization, GRPO)结合时空思维链(Spatiotemporal Chain of Thought, ST-CoT)提示和时空奖励函数(ST-reward)进一步催化模型的4D推理能力。实验表明,该方法能从纯2D RGB输入中实现最先进的时空理解与推理性能。
链接: https://arxiv.org/abs/2603.00515
作者: Xingyilang Yin,Chengzhengxu Li,Jiahao Chang,Chi-Man Pun,Xiaodong Cun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions (ST-reward) without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Project page: this https URL.
[CV-202] Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding CVPR2026
【速读】:该论文旨在解决长视频中帧选择(frame selection)的问题,即在应用大视觉语言模型(Large Vision-Language Models, LVLMs)处理长视频时,由于帧冗余高和上下文窗口有限,传统方法仅基于查询相关性选取帧,导致所选帧缺乏叙事连贯性。其解决方案的关键在于提出一种无需训练的框架WFS-SB(Wavelet-based Frame Selection by Detecting Semantic Boundary),通过检测语义边界来捕捉视频中的关键叙事转变点(semantic shifts),而非单纯依赖高相关性。该方法利用小波变换(wavelet transform)对查询-帧相似度信号进行多尺度分解,在粗尺度上提取干净的语义变化信号,并以局部极值识别语义边界,从而将视频划分为结构连贯的片段;随后采用两阶段策略:首先按综合重要性评分分配每段帧预算,其次在每段内使用最大边际相关性(Maximal Marginal Relevance)选择多样且相关的帧,显著提升LVLM在长视频理解任务上的性能。
链接: https://arxiv.org/abs/2603.00512
作者: Wang Chen,Yuhui Zeng,Yongdong Luo,Tianyu Xie,Luojun Lin,Jiayi Ji,Yan Zhang,Xiawu Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026
Abstract:Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods.
[CV-203] Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning
【速读】:该论文旨在解决视觉问答(Visual Question Answering, VQA)系统中因幻觉(hallucination)导致的可靠性问题,即模型生成的答案与视觉输入或事实知识不一致。现有基于检索增强生成(Retrieval Augmented Generation, RAG)的方法虽能引入外部知识以缓解该问题,但在视觉RAG场景下,静态检索常引入语义错误但视觉相似的无关信息,反而加剧误导。解决方案的关键在于提出多模态自适应RAG(Multimodal Adaptive RAG, MMA-RAG),其核心机制是通过一个基于层间分析训练的决策分类器,动态评估模型对内部知识的信心水平,并据此决定是否融合外部检索到的信息;该分类器利用联合的内部视觉与文本表征来指导反向图像检索策略,从而在不同模态场景下实现外部知识利用与推理鲁棒性的有效平衡。
链接: https://arxiv.org/abs/2603.00511
作者: Ruoshuang Du,Xin Sun,Qiang Liu,Bowen Song,Zhongqi Chen,Weiqiang Wang,Liang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 6 figures
Abstract:Visual Question Answering systems face reliability issues due to hallucinations, where models generate answers misaligned with visual input or factual knowledge. While Retrieval Augmented Generation frameworks mitigate this issue by incorporating external knowledge, static retrieval often introduces irrelevant or conflicting content, particularly in visual RAG settings where visually similar but semantically incorrect evidence may be retrieved. To address this, we propose Multimodal Adaptive RAG (MMA-RAG), which dynamically assesses the confidence in the internal knowledge of the model to decide whether to incorporate the retrieved external information into the generation process. Central to MMA-RAG is a decision classifier trained through a layer-wise analysis, which leverages joint internal visual and textual representations to guide the use of reverse image retrieval. Experiments demonstrated that the model achieves a significant improvement in response performance in three VQA datasets. Meanwhile, ablation studies highlighted the importance of internal representations in adaptive retrieval decisions. In general, the experimental results demonstrated that MMA-RAG effectively balances external knowledge utilization and inference robustness in diverse multimodal scenarios.
[CV-204] What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉语义在内部结构和处理机制不清晰的问题,特别是视觉标记(visual tokens)如何被编码、压缩与利用。其关键解决方案在于提出一个双层分析框架及新型探测工具 EmbedLens,通过细粒度分析发现:输入层面的视觉标记可明确划分为“sink”(无意义)、“dead”(冗余)和“alive”(有意义)三类,其中仅约60%的“alive”标记承载图像特异性语义信息;进一步实验证明这些“alive”标记在进入语言模型前已包含丰富的细粒度线索(如物体、颜色、OCR等),且大多数标准任务无需依赖内部视觉计算模块(如视觉注意力和前馈网络)。对于少数高度依赖视觉的任务,“alive”标记自然对齐于中间语言模型层而非初始嵌入空间,表明浅层处理冗余,直接从中层注入即可满足需求。这一发现为高效、可解释的MLLM架构设计提供了统一机制:通过选择性剪枝、最小化视觉计算和中层注入实现优化。
链接: https://arxiv.org/abs/2603.00510
作者: Yingqi Fan,Junlong Tong,Anhao Zhao,Xiaoyu Shen
机构: Institute of Digital Twin, Eastern Institute of Technology, Ningbo(Ningbo数字孪生研究所,东方理工大学); Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative(宁波空间智能与数字衍生重点实验室); Shanghai Jiao Tong University(上海交通大学); The Hong Kong Polytechnic University(香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR2026
Abstract:Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models, yet the internal structuring and processing of visual semantics remain poorly understood. In this work, we introduce a two-fold analytical framework featuring a novel probing tool, \textbfEmbedLens , to conduct a fine-grained analysis. We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories. Remarkably, only the alive tokens, comprising \approx60% of the total input, carry image-specific meaning. Furthermore, using a targeted patch-compression benchmark, we demonstrate that these alive tokens already encode rich, fine-grained cues (e.g., objects, colors, and OCR) prior to entering the LLM. Internal visual computations (such as visual attention and feed-forward networks) are redundant for most standard tasks. For the small subset of highly vision-centric tasks that actually benefit from internal processing, we reveal that alive tokens naturally align with intermediate LLM layers rather than the initial embedding space, indicating that shallow-layer processing is unnecessary and that direct mid-layer injection is both sufficient. Ultimately, our findings provide a unified mechanistic view of visual token processing, paving the way for more efficient and interpretable MLLM architectures through selective token pruning, minimized visual computation, and mid-layer injection. The code is released at: this https URL.
[CV-205] Hierarchical Classification for Improved Histopathology Image Analysis
【速读】:该论文旨在解决当前基于深度学习的全切片图像(Whole-slide image, WSI)分析方法在病理诊断中主要依赖扁平分类(flat classification),忽视类别标签之间层次关系的问题。其解决方案的关键在于提出HiClass框架,该框架基于多实例学习(Multiple Instance Learning, MIL)并引入双向特征融合机制,实现粗粒度与细粒度特征表示之间的信息交互,从而有效学习层级特征;同时设计了层次一致性损失、类内与类间距离损失及分组交叉熵损失等定制化损失函数,进一步优化层次化学习过程,显著提升WSI在粗粒度和细粒度分类任务上的性能。
链接: https://arxiv.org/abs/2603.00504
作者: Keunho Byeon,Jinsol Song,Seong Min Hong,Yosep Chong,Jin Tae Kwak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Whole-slide image analysis is essential for diagnostic tasks in pathology, yet existing deep learning methods primarily rely on flat classification, ignoring hierarchical relationships among class labels. In this study, we propose HiClass, a hierarchical classification framework for improved histopathology image analysis, that enhances both coarse-grained and fine-grained WSI classification. Built based upon a multiple instance learning approach, HiClass extends it by introducing bidirectional feature integration that facilitates information exchange between coarse-grained and fine-grained feature representations, effectively learning hierarchical features. Moreover, we introduce tailored loss functions, including hierarchical consistency loss, intra- and inter-class distance loss, and group-wise cross-entropy loss, to further optimize hierarchical learning. We assess the performance of HiClass on a gastric biopsy dataset with 4 coarse-grained and 14 fine-grained classes, achieving superior classification performance for both coarse-grained classification and fine-grained classification. These results demonstrate the effectiveness of HiClass in improving WSI classification by capturing coarse-grained and fine-grained histopathological characteristics.
[CV-206] M2: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长周期任务中面临的挑战,特别是由于交互历史冗长导致的上下文效率低下和决策鲁棒性不足的问题。现有方法依赖大量数据收集与模型训练,仍存在计算成本高、复杂场景下推理能力有限等缺陷。其解决方案的关键在于提出一种无需训练的、基于记忆增强的框架 M²,通过双层记忆机制实现优化:一是内部记忆中的动态轨迹摘要(Dynamic Trajectory Summarization),将冗长的交互历史压缩为紧凑的状态更新;二是外部记忆中的洞察检索增强(Insight Retrieval Augmentation),从离线洞察库中检索可操作的指导策略以引导代理决策。此设计显著提升了任务成功率并大幅降低token消耗与计算开销。
链接: https://arxiv.org/abs/2603.00503
作者: Dawei Yan,Haokui Zhang,Guangda Huzhang,Yang Li,Yibo Wang,Qing-Guo Chen,Zhao Xu,Weihua Luo,Ying Li,Wei Dong,Chunhua Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M ^2 , a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M ^2 consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.
[CV-207] COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation CVPR2026
【速读】:该论文旨在解决基于单张参考视图估计新物体6自由度(6DoF)位姿的难题,尤其针对遮挡、视角变化和异常值带来的挑战。其核心问题在于如何在跨视图之间建立鲁棒的对应关系,而现有方法常依赖非可微的离散一对一匹配,易退化至稀疏关键点。解决方案的关键在于提出了一种无监督框架——置信度感知最优几何对应(Confidence-aware Optimal Geometric Correspondence, COG),将对应关系估计建模为一个置信度感知的最优传输问题:通过预测逐点置信度并将其作为最优传输的边际约束,生成平衡的软对应关系,有效抑制非重叠区域;同时引入视觉基础模型提供的语义先验对对应关系进行正则化,从而实现稳定且精确的位姿估计。该设计首次将置信度机制融入对应关系发现与位姿估计的端到端流程中,支持无监督学习,并在实验中展现出与监督方法相当甚至更优的性能。
链接: https://arxiv.org/abs/2603.00493
作者: Yuchen Che,Jingtu Wu,Hao Zheng,Asako Kanezaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026 Accepted
Abstract:Estimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, view-point changes, and outliers. A core difficulty lies in finding robust cross-view correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse key-points. We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem. COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as optimal transport marginals, suppressing non-overlapping regions. Semantic priors from vision foundation models further regularize the correspondences, leading to stable pose estimation. This design integrates confidence into the correspondence finding and pose estimation pipeline, enabling unsupervised learning. Experiments show unsupervised COG achieves comparable performance to supervised methods, and supervised COG outperforms them.
[CV-208] Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
【速读】:该论文旨在解决大模型在复杂多模态任务中因推理过程缺乏可验证性、不确定性传播不可控以及计算资源利用效率低而导致的错误累积与幻觉问题(hallucination)。其解决方案的关键在于提出Proof-of-Perception (PoP) 框架,将多模态推理建模为一个带有显式可靠性保证的可执行图结构:每个感知或逻辑节点输出一个置信集(conformal set),从而实现校准后的分步不确定性量化;同时引入轻量级控制器基于这些证书动态分配计算资源,在预算内智能调度工具调用(tool call),仅在必要时扩展推理路径并提前终止冗余计算。该机制确保答案基于可验证证据,提升可靠性与准确性,并支持精确的精度-计算权衡。
链接: https://arxiv.org/abs/2603.00324
作者: Arya Fayyazi,Haleh Akrami
机构: University of Southern California (南加州大学); Nuro (Nuro)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set, yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget, expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy-compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.
[CV-209] Seeking Necessary and Sufficient Information from Multimodal Medical Data MICCAI2026
【速读】:该论文旨在解决多模态医学数据中特征学习的问题,即现有模型通常忽略同时具备必要性(Necessity)和充分性(Sufficiency)的特征,而这类特征对于提升模型性能和鲁棒性至关重要——尤其在某些模态缺失时仍能提供足够的预测信号。解决方案的关键在于将多模态表示分解为模态不变(modality-invariant)与模态特异(modality-specific)成分,并基于概率必要性与充分性(Probability of Necessity and Sufficiency, PNS)推导出可计算的目标函数,从而在多模态场景下有效学习此类关键特征。
链接: https://arxiv.org/abs/2603.00289
作者: Boyu Chen,Weiye Bao,Junjie Liu,Michael Shen,Bo Peng,Paul Taylor,Zhu Li,Mengyue Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 1 figure. Submitted to MICCAI 2026
Abstract:Learning multimodal representations from medical images and other data sources can provide richer information for decision-making. While various multimodal models have been developed for this, they overlook learning features that are both necessary (must be present for the outcome to occur) and sufficient (enough to determine the outcome). We argue learning such features is crucial as they can improve model performance by capturing essential predictive information, and enhance model robustness to missing modalities as each modality can provide adequate predictive signals. Such features can be learned by leveraging the Probability of Necessity and Sufficiency (PNS) as a learning objective, an approach that has proven effective in unimodal settings. However, extending PNS to multimodal scenarios remains underexplored and is non-trivial as key conditions of PNS estimation are violated. We address this by decomposing multimodal representations into modality-invariant and modality-specific components, then deriving tractable PNS objectives for each. Experiments on synthetic and real-world medical datasets demonstrate our method’s effectiveness. Code will be available on GitHub.
[CV-210] Ozone Cues Mitigate Reflected Downwelling Radiance in LWIR Absorption-Based Ranging
【速读】:该论文旨在解决被动长波红外(LWIR)吸收式测距中因大气下行辐射(downwelling radiance)导致的测距误差问题,尤其在低温度差异场景下,传统方法因忽略反射辐射而产生显著偏差。其解决方案的关键在于利用臭氧吸收特征来估计并校正下行辐射的贡献:提出两种新方法——四光谱法(quadspectral method)通过四个窄带测量(两个水汽吸收线和两个臭氧吸收线)获得闭合形式的距离估计;高光谱法(hyperspectral method)则扩展至更宽光谱范围,在提升精度的同时还能反演温度、发射率分布及多天顶角下的下行辐射贡献。实验表明,这两种方法可显著降低测距误差,其中四光谱法将误差从超过100米降至6.8米,高光谱法进一步降至1.2米。
链接: https://arxiv.org/abs/2603.00273
作者: Unay Dorken Gallastegi,Wentao Shangguan,Vaibhav Choudhary,Akshay Agarwal,Hoover Rueda-Chacón,Martin J. Stevens,Vivek K Goyal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 15 pages, 10 figures
Abstract:Passive long-wave infrared (LWIR) absorption-based ranging relies on atmospheric absorption to estimate distances to objects from their emitted thermal radiation. First demonstrated decades ago for objects much hotter than the air and recently extended to scenes with low temperature variations, this ranging has depended on reflected radiance being negligible. Downwelling radiance is especially problematic, sometimes causing large inaccuracies. In two new ranging methods, we use characteristic features from ozone absorption to estimate the contribution of reflected downwelling radiance. The quadspectral method gives a simple closed-form range estimate from four narrowband measurements, two at a water vapor absorption line and two at an ozone absorption line. The hyperspectral method uses a broader spectral range to improve accuracy while also providing estimates of temperature, emissivity profiles, and contributions of downwelling from a collection of zenith angles. Experimental results demonstrate improved ranging accuracy, in one case reducing error from over 100 m when reflected light is not modeled to 6.8 m with the quadspectral method and 1.2 m with the hyperspectral method.
[CV-211] Adversarial Patch Generation for Visual-Infrared Dense Prediction Tasks via Joint Position-Color Optimization
【速读】:该论文旨在解决多模态对抗攻击在密集预测任务中尚未充分探索的问题,特别是针对视觉-红外(Visual-Infrared, VI)感知系统中存在的跨谱不一致性挑战。现有对抗补丁方法主要针对单模态输入设计,无法有效处理不同模态间因光谱特性异质性和强度分布差异导致的攻击效果下降与隐蔽性不足问题。其解决方案的关键在于提出一种联合位置-颜色优化框架(Adversarial Patch Position-Color Optimization, AP-PCO),通过基于模型输出构建适应性 fitness 函数,同步优化补丁的空间位置和颜色组成,使单一补丁能同时干扰可见光与红外模态;进一步引入跨模态颜色自适应策略,在约束补丁红外灰度外观的同时保持可见域强扰动,从而降低跨谱显著性。该方法无需模型内部信息,支持灵活的黑盒攻击,并在多个VI密集预测架构上验证了稳定且高效的攻击性能。
链接: https://arxiv.org/abs/2603.00266
作者: He Li,Wenyue He,Weihang Kong,Xingchen Zhang
机构: Yanshan University (燕山大学); University of Exeter (埃克塞特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures
Abstract:Multimodal adversarial attacks for dense prediction remain largely underexplored. In particular, visual-infrared (VI) perception systems introduce unique challenges due to heterogeneous spectral characteristics and modality-specific intensity distributions. Existing adversarial patch methods are primarily designed for single-modal inputs and fail to account for crossspectral inconsistencies, leading to reduced attack effectiveness and poor stealthiness when applied to VI dense prediction models. To address these challenges, we propose a joint position-color optimization framework (AP-PCO) for generating adversarial patches in visual-infrared settings. The proposed method optimizes patch placement and color composition simultaneously using a fitness function derived from model outputs, enabling a single patch to perturb both visible and infrared modalities. To further bridge spectral discrepancies, we introduce a crossmodal color adaptation strategy that constrains patch appearance according to infrared grayscale characteristics while maintaining strong perturbations in the visible domain, thereby reducing cross-spectral saliency. The optimization procedure operates without requiring internal model information, supporting flexible black-box attacks. Extensive experiments on visual-infrared dense prediction tasks demonstrate that the proposed AP-PCO achieves consistently strong attack performance across multiple architectures, providing a practical benchmark for robustness evaluation in VI perception systems.
[CV-212] Pretty Good Measurement for Radiomics: A Quantum-Inspired Multi-Class Classifier for Lung Cancer Subtyping and Prostate Cancer Risk Stratification
【速读】:该论文旨在解决监督式多分类问题,尤其是如何在不依赖于成对或一对多(one-vs-rest)策略的前提下,实现真正意义上的多类分类。其核心挑战在于设计一种能够直接处理多个类别间复杂重叠关系的分类框架,并保持良好的性能表现。解决方案的关键在于引入量子启发式的“最优测量”(Pretty Good Measurement, PGM),将每类样本映射为一个编码的混合态(mixed state),并通过单一正算子值测度(POVM)构造完成分类决策,从而将多类分类问题转化为一组依赖类别的密度算子(density operators)的量子态判别问题。该方法的性能由编码映射所诱导的几何结构和类别间的重叠特性共同决定,实验表明其在肺癌组织亚型分类和前列腺癌风险分层等生物医学影像组学任务中具有竞争力,尤其在类别间区分度较高时表现优异。
链接: https://arxiv.org/abs/2603.00223
作者: Giuseppe Sergioli,Carlo Cuccu,Giovanni Pasini,Alessandro Stefano,Giorgio Russo,Andrés Camilo Granda Arango,Roberto Giuntini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)
备注: 15 pages, 7 figures, 1 table, in preparation for journal submission
Abstract:We investigate a quantum-inspired approach to supervised multi-class classification based on the \emphPretty Good Measurement (PGM), viewed as an operator-valued decision rule derived from quantum state discrimination. The method associates each class with an encoded mixed state and performs classification through a single POVM construction, thus providing a genuinely multi-class strategy without reduction to pairwise or one-vs-rest schemes. In this perspective, classification is reformulated as the discrimination of a finite ensemble of class-dependent density operators, with performance governed by the geometry induced by the encoding map and by the overlap structure among classes. To assess the practical scope of this framework, we apply the PGM-based classifier to two biomedical radiomics case studies: histopathological subtyping of non-small-cell lung carcinoma (NSCLC) and prostate cancer (PCa) risk stratification. The evaluation is conducted under protocols aligned with previously reported radiomics studies, enabling direct comparison with established classical baselines. The results show that the PGM-based classifier is consistently competitive and, in several settings, improves upon standard methods. In particular, the method performs especially well in the NSCLC binary and three-class tasks, while remaining competitive in the four-class case, where increased class overlap yields a more demanding discrimination geometry. In the PCa study, the PGM classifier remains close to the strongest ensemble baseline and exhibits clinically relevant sensitivity–specificity trade-offs across feature-selection scenarios.
[CV-213] Physical Evaluation of Naturalistic Adversarial Patches for Camera-Based Traffic-Sign Detection MICRO ATC2026
【速读】:该论文旨在解决自然主义对抗补丁(Naturalistic Adversarial Patches, NAPs)在物理交通标志场景中迁移性能的评估问题,特别是当检测模型是在面向自动驾驶车辆(AV)环境的定制数据集上训练时。其解决方案的关键在于构建了一个名为CompGTSRB的复合数据集,该数据集通过将德国交通标志识别基准(GTSRB)中的交通标志实例贴到目标平台采集的无畸变背景上,从而更真实地模拟AV感知场景;并基于此数据集训练YOLOv5模型,结合生成对抗网络(Generative Adversarial Network, GAN)与潜在空间优化技术生成NAPs,最终在Quanser QCar测试平台上系统性地验证了不同距离、补丁尺寸和位置配置下NAP对STOP类置信度的降低效果,证明了该方法在嵌入式感知流水线中评估对抗补丁有效性的可行性。
链接: https://arxiv.org/abs/2603.00217
作者: Brianna D’Urso,Tahmid Hasan Sakib,Syed Rafay Hasan,Terry N. Guo
机构: University of Hartford (哈特福德大学); Tennessee Technological University (田纳西理工大学); Center for Manufacturing Research (制造研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted to the 2nd IEEE Conference on Secure and Trustworthy CyberInfrastructure for IoT and Microelectronics (SaTC 2026), Houston, Texas, USA, March 24 to 26, 2026
Abstract:This paper studies how well Naturalistic Adversarial Patches (NAPs) transfer to a physical traffic sign setting when the detector is trained on a customized dataset for an autonomous vehicle (AV) environment. We construct a composite dataset, CompGTSRB (which is customized dataset for AV environment), by pasting traffic sign instances from the German Traffic Sign Recognition Benchmark (GTSRB) onto undistorted backgrounds captured from the target platform. CompGTSRB is used to train a YOLOv5 model and generate patches using a Generative Adversarial Network (GAN) with latent space optimization, following existing NAP methods. We carried out a series of experiments on our Quanser QCar testbed utilizing the front CSI camera provided in QCar. Across configurations, NAPs reduce the detector’s STOP class confidence. Different configurations include distance, patch sizes, and patch placement. These results along with a detailed step-by-step methodology indicate the utility of CompGTSRB dataset and the proposed systematic physical protocols for credible patch evaluation. The research further motivate researching the defenses that address localized patch corruption in embedded perception pipelines.
[CV-214] VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models CVPR2026
【速读】:该论文旨在解决视觉依赖任务中,随着推理时计算资源扩展(test-time compute scaling),大型多模态模型逐渐忽略视觉标记(visual tokens)而过度依赖文本先验(textual priors)导致性能下降的问题。解决方案的关键在于提出VisRef框架,通过在推理过程中主动重注入一组语义相关且具有全局代表性的视觉标记子集(coreset),从而引导模型保持对视觉信息的注意力,实现更 grounded 的多模态推理,无需额外的强化学习微调即可提升性能。
链接: https://arxiv.org/abs/2603.00207
作者: Soumya Suvra Ghosal,Youngeun Kim,Zhuowei Li,Ritwick Chaudhry,Linghan Xu,Hongjing Zhang,Jakub Zablocki,Yifan Xing,Qin Zhang
机构: University of Maryland, College Park (马里兰大学学院公园分校); Amazon (亚马逊); Physion Labs (Physion实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026
Abstract:Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended reasoning. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can degrade performance as models progressively lose attention to visual tokens and increasingly rely on textual priors alone. To address this, prior works use reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that, under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.
[CV-215] ACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models
【速读】:该论文旨在解决当前视觉推理基准测试中存在的三大局限性:依赖自然语言提示、评估的推理模态过于狭窄,以及采用主观评分机制(如大语言模型作为评判者)。为应对这些问题,作者提出了TACIT Benchmark,其核心创新在于构建了一个程序化、多任务、结构化的视觉推理评估体系。关键解决方案包括:(1)设计涵盖空间导航、抽象模式补全、因果模拟、逻辑约束满足、图论和拓扑等6个推理领域的10项任务;(2)提供双轨评估机制——生成式赛道要求模型输出可被确定性计算机视觉流水线验证的解图像,判别式赛道则通过五选一选择题与结构合理但仅违反单一约束的干扰项,迫使模型识别细微视觉差异而非利用表面线索;(3)确保数据生成与验证的完全确定性和可复现性,版本0.1.0包含6,000个谜题(共108,000张PNG图像),并开源于HuggingFace(Apache 2.0许可)。
链接: https://arxiv.org/abs/2603.00206
作者: Daniel Nobrega Medeiros
机构: Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 5 tables
Abstract:Existing visual reasoning benchmarks predominantly rely on natural language prompts, evaluate narrow reasoning modalities, or depend on subjective scoring procedures such as LLM-as-judge. We introduce the TACIT Benchmark, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, and topology. The benchmark provides dual-track evaluation: a generative track in which models must produce solution images verified through deterministic computer-vision pipelines, and a discriminative track offering five-way multiple choice with structurally plausible near-miss distractors. Each distractor violates exactly one structural constraint, requiring models to reason about fine-grained visual differences rather than exploit superficial cues. Version 0.1.0 distributes 6,000 puzzles (108,000 PNG images across three resolutions) with fully deterministic seeded generation and reproducible verification. The dataset, generation code, and evaluation harness are released under the Apache 2.0 license on HuggingFace (DOI: https://doi.org/10.57967/hf/7904).
[CV-216] AdURA-Net: Adaptive Uncertainty and Region-Aware Network
【速读】:该论文旨在解决医学影像诊断中因标签不确定性(uncertain label)导致的临床决策可靠性问题,尤其在多标签数据集(如CheXpert、MIMIC-CXR)中,模型需避免在证据不足时强行输出高置信度预测。解决方案的关键在于提出AdURA-Net——一种几何驱动的自适应不确定性感知框架:其一,通过自适应空洞卷积与多尺度可变形对齐模块嵌入DenseNet主干网络,以捕捉胸部影像的解剖复杂性;其二,设计双头损失函数(Dual Head Loss),融合掩码二元交叉熵与Dirichlet证据学习目标,使模型能有效识别并表达不确定性,从而提升高风险临床场景下的决策可靠性。
链接: https://arxiv.org/abs/2603.00201
作者: Antik Aich Roy,Ujjwal Bhattacharya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:One of the common issues in clinical decision-making is the presence of uncertainty, which often arises due to ambiguity in radiology reports, which often reflect genuine diagnostic uncertainty or limitations of automated label extraction in various complex cases. Especially the case of multilabel datasets such as CheXpert, MIMIC-CXR, etc., which contain labels such as positive, negative, and uncertain. In clinical decision-making, the uncertain label plays a tricky role as the model should not be forced to provide a confident prediction in the absence of sufficient evidence. The ability of the model to say it does not understand whenever it is not confident is crucial, especially in the cases of clinical decision-making involving high risks. Here, we propose AdURA-Net, a geometry-driven adaptive uncertainty-aware framework for reliable thoracic disease classification. The key highlights of the proposed model are: a) Adaptive dilated convolution and multiscale deformable alignment coupled with the backbone Densenet architecture capturing the anatomical complexities of the medical images, and b) Dual Head Loss, which combines masked binary cross entropy with logit and a Dirichlet evidential learning objective.
[CV-217] Stateful Token Reduction for Long-Video Hybrid VLMs
【速读】:该论文旨在解决长视频视觉语言模型(Vision-Language Models, VLMs)中因输入序列过长导致的计算效率低下问题,特别是针对混合架构(hybrid architectures)——即同时包含注意力机制(attention)与线性时间状态空间块(如Mamba)的模型——缺乏高效token reduction方法的问题。现有token reduction方法多针对密集Transformer设计,无法直接适配此类混合结构。解决方案的关键在于提出一种“由低到高”的渐进式减少策略(low-to-high progressive reduction schedule),并设计了一个统一的语言感知评分机制(language-aware scoring mechanism),能够同时作用于注意力模块和Mamba模块(通过隐式注意力代理,implicit-attention proxy)实现跨层token的重要性稳定评估,从而支持全层token缩减。在保留25%视觉token的激进压缩条件下,该方法在预填充阶段实现3.8–4.2倍的速度提升,且测试时接近基线准确率,进一步微调后在长上下文视频基准上性能显著提升。
链接: https://arxiv.org/abs/2603.00198
作者: Jindong Jiang,Amala Sanjay Deshmukh,Kateryna Chumachenko,Karan Sapra,Zhiding Yu,Guilin Liu,Andrew Tao,Pavlo Molchanov,Jan Kautz,Wonmin Byeon
机构: NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Token reduction is an effective way to accelerate long-video vision-language models (VLMs), but most existing methods are designed for dense Transformers and do not directly account for hybrid architectures that interleave attention with linear-time state-space blocks (e.g., Mamba). We study query-conditioned token reduction for hybrid video VLMs and analyze reduction behavior through two properties: layerwise sparsity (how many tokens capture query-relevant information) and importance stability (whether token-importance rankings persist across depth). Although token importance is sparse within each layer, the set of important tokens changes across layers, so aggressive early pruning is unreliable. Motivated by this, we propose a low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for both attention and Mamba blocks (using an implicit-attention proxy for Mamba), enabling all-layer token reduction in hybrids. Under an aggressive compression setting (retaining 25% of visual tokens), our approach delivers substantial prefilling speedups (3.8–4.2x) with near-baseline accuracy at test time, and light finetuning under reduction further improves performance on long-context video benchmarks.
[CV-218] A Case Study on Concept Induction for Neuron-Level Interpretability in CNN
【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)中隐藏神经元内部语义不明确的问题,旨在提升对模型决策机制的理解。解决方案的关键在于采用基于概念诱导(Concept Induction-based)的框架,通过在大规模场景识别数据集SUN2012上复现与Ade20K相同的分析流程,为隐藏神经元分配可解释的语义标签,并借助网络图像资源和统计检验进行验证,从而证明该方法具有良好的跨数据集泛化能力。
链接: https://arxiv.org/abs/2603.00197
作者: Moumita Sen Sarma,Samatha Ereshi Akkamahadevi,Pascal Hitzler
机构: Kansas State University (堪萨斯州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep Neural Networks (DNNs) have advanced applications in domains such as healthcare, autonomous systems, and scene understanding, yet the internal semantics of their hidden neurons remain poorly understood. Prior work introduced a Concept Induction-based framework for hidden neuron analysis and demonstrated its effectiveness on the ADE20K dataset. In this case study, we investigate whether the approach generalizes by applying it to the SUN2012 dataset, a large-scale scene recognition benchmark. Using the same workflow, we assign interpretable semantic labels to neurons and validate them through web-sourced images and statistical testing. Our findings confirm that the method transfers to SUN2012, showing its broader applicability.
[CV-219] SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models
【速读】:该论文旨在解决生成式视频(text-to-video generation)中水印技术面临的两大核心问题:一是现有水印方法依赖帧间严格对齐,导致帧重排序或丢失时水印提取不可靠;二是视频特有的时序失真(如帧间压缩)显著降低水印的鲁棒性。解决方案的关键在于提出SKeDA框架,其核心创新为:(1) Shuffle-Key-based Distribution-preserving Sampling (SKe) 通过单一基础伪随机二进制序列经置换生成逐帧加密序列,将水印提取从敏感于同步的序列解码转变为容错的集合级聚合,从而提升对帧重排序和丢失的鲁棒性;(2) Differential Attention (DA) 动态计算帧间差异并调整注意力权重,增强对时间域失真的抵抗能力,保障水印在复杂视频压缩场景下的可靠性。
链接: https://arxiv.org/abs/2603.00194
作者: Yang Yang,Xinze Zou,Zehua Ma,Han Fang,Weiming Zhang
机构: Anhui University (安徽大学); University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 11 pages, 6 figures
Abstract:The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.
[CV-220] ask-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning
【速读】:该论文旨在解决低秩适配(LoRA)在持续学习(Continual Learning, CL)场景中因忽略任务共享方向和无法有效捕捉真正任务特异性方向而导致的知识遗忘与迁移受限问题。其核心解决方案是提出低秩分解与适配(LoDA),通过任务驱动的分解机制,基于能量优化目标分离出通用子空间与真实任务特异性子空间,从而实现知识共享与隔离的解耦;关键创新在于固定LoRA的下投影并采用梯度对齐优化(Gradient-Aligned Optimization, GAO)学习鲁棒的上投影,并在每轮任务后通过对通用更新进行闭式校准,近似特征层面沿任务共享方向的联合最优解,显著提升了模型在连续任务中的性能表现。
链接: https://arxiv.org/abs/2603.00191
作者: Lingfeng He,De Cheng,Huaijie Wang,Xi Yang,Nannan Wang,Xinbo Gao
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: preprint
Abstract:Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge. Recently, Low-Rank Adaptation (LoRA), a representative Parameter-Efficient Fine-Tuning (PEFT) method, has gained increasing attention in CL. Several LoRA-based CL methods reduce interference across tasks by separating their update spaces, typically building the new space from the estimated null space of past tasks. However, they (i) overlook task-shared directions, which suppresses knowledge transfer, and (ii) fail to capture truly effective task-specific directions since these ``null bases" of old tasks can remain nearly inactive for new task under correlated tasks. To address this, we study LoRA learning capability from a projection energy perspective, and propose Low-rank Decomposition and Adaptation (LoDA). It performs a task-driven decomposition to build general and truly task-specific LoRA subspaces by solving two energy-based objectives, decoupling directions for knowledge sharing and isolation. LoDA fixes LoRA down-projections on two subspaces and learns robust up-projections via a Gradient-Aligned Optimization (GAO) approach. After each task, before integrating the LoRA updates into the backbone, LoDA derives a closed-form recalibration for the general update, approximating a feature-level joint optimum along this task-shared direction. Experiments indicate that LoDA outperforms existing CL methods.
[CV-221] Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在长期交互场景中因Key-Value (KV) 缓存占用大量内存和引入高延迟而导致的部署瓶颈问题。现有缓存压缩方法虽在通用大语言模型(LLMs)中有效,但在GUI任务中表现不佳,原因在于GUI注意力模式在所有Transformer层中均呈现高度稀疏性,与一般视觉任务存在根本差异。解决方案的关键在于提出一种无需训练的KV缓存压缩框架ST-Lite,其核心创新是引入双分支评分策略:Component-centric Spatial Saliency (CSS) 用于保留UI组件的结构完整性,通过评估局部邻域显著性;Trajectory-aware Semantic Gating (TSG) 则动态过滤交互轨迹中的视觉重复KV对,以缓解历史冗余。该方法在仅使用10–20%缓存预算时实现2.45倍解码加速,同时保持或优于全缓存基线性能,为资源受限的GUI代理提供了可扩展的优化路径。
链接: https://arxiv.org/abs/2603.00188
作者: Bowen Zhou,Zhou Xu,Wanli Li,Jingyu Xiao,Haoqian Wang
机构: Tsinghua UniversityShenzhenChina; Zhejiang UniversityHangZhouChina; The Chinese University of Hong KongHong KongChina
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Vision-Language Models (VLMs) have emerged as powerful engines for autonomous GUI agents, yet their deployment is severely constrained by the substantial memory footprint and latency of the Key-Value (KV) cache during long-horizon interactions. While existing cache compression methods have proven effective for LLMs, we empirically demonstrate that they suffer from suboptimal performance in GUI scenarios due to a fundamental misalignment: unlike general visual tasks where attention sparsity varies across layers, GUI attention patterns exhibit uniform high-sparsity across all transformer layers. Motivated by this insight, we propose ST-Lite, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams. ST-Lite introduces a novel dual-branch scoring policy incorporating Component-centric Spatial Saliency (CSS) and Trajectory-aware Semantic Gating (TSG). Specifically, CSS preserves the structural integrity of interactive UI elements by evaluating local neighborhood saliency, while TSG mitigates historical redundancy by dynamically filtering visually repetitive KV pairs within the interaction trajectory. Extensive evaluations demonstrate that with only a 10-20% cache budget, ST-Lite achieves a 2.45x decoding acceleration while maintaining comparable or even superior performance compared to full-cache baselines, offering a scalable solution for resource-constrained GUI agents.
[CV-222] Zero-Shot and Supervised Bird Image Segmentation Using Foundation Models: A Dual-Pipeline Approach with Grounding DINO~1.5 YOLOv11 and SAM~2.1
【速读】:该论文旨在解决鸟类图像分割(bird image segmentation)在计算机视觉中面临的挑战,包括极端姿态多样性、复杂的羽毛图案以及多变的光照条件。解决方案的关键在于提出一种双管道框架,基于冻结的Segment Anything Model 2.1(SAM 2.1)作为共享骨干网络:其一为零样本管道,利用Grounding DINO 1.5通过文本提示“bird”检测目标并生成边界框后引导SAM 2.1进行像素级分割,无需任何标注数据;其二为监督管道,在CUB-200-2011数据集上微调YOLOv11实现高精度检测,再结合SAM 2.1生成掩码。该方法不需重新训练分割模型即可适应新物种或领域,仅需轻量级检测器微调(约1小时),在CUB-200-2011上达到IoU 0.912,显著优于此前基线模型(如SegFormer-B2的IoU 0.842)。
链接: https://arxiv.org/abs/2603.00184
作者: Abhinav Munagala
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Bird image segmentation remains a challenging task in computer vision due to extreme pose diversity, complex plumage patterns, and variable lighting conditions. This paper presents a dual-pipeline framework for binary bird image segmentation leveraging 2025 foundation models. We introduce two operating modes built upon Segment Anything Model 2.1 (SAM 2.1) as a shared frozen backbone: (1) a zero-shot pipeline using Grounding DINO 1.5 to detect birds via the text prompt “bird” before prompting SAM 2.1 with bounding boxes requiring no labelled bird data; and (2) a supervised pipeline that fine-tunes YOLOv11 on the CUB-200-2011 dataset for high-precision detection, again prompting SAM 2.1 for pixel-level masks. The segmentation model is never retrained for new species or domains. On CUB-200-2011 (11,788 images, 200 species), the supervised pipeline achieves IoU 0.912, Dice 0.954, and F1 0.953 outperforming all prior baselines including SegFormer-B2 (IoU 0.842) by +7.0 percentage points. The zero-shot pipeline achieves IoU 0.831 using only a text prompt, the first such result reported on this benchmark. We demonstrate that prompt-based foundation model pipelines outperform task specific end-to-end trained segmentation networks, while requiring only lightweight detector fine-tuning (~1 hour) for domain adaptation. Complete PyTorch implementation, dataset preparation scripts, and trained weights are publicly available.
[CV-223] Segmenting Low-Contrast XCTs of Concretes: An Unsupervised Approach
【速读】:该论文旨在解决X射线计算机断层扫描(XCT)图像中混凝土材料因骨料与砂浆的X射线衰减系数相近而导致的低对比度问题,从而影响语义分割精度的挑战。传统卷积神经网络(CNN)模型虽在语义分割任务中表现优异,但通常依赖大量标注数据,而这类数据在新数据集上难以获取或成本高昂。为此,论文提出一种基于自标注(self-annotation)的无监督训练方法,其核心在于利用超像素(superpixel)算法识别图像中的局部相似区域,并借助CNN的感受野(receptive field)建立局部区域与全局上下文之间的关联,使模型能够学习图像的全局-局部关系,进而识别语义一致的结构,实现无需人工标注即可完成高质量的语义分割。
链接: https://arxiv.org/abs/2603.00127
作者: Kaustav Das,Gaston Rauchs,Jan Sykora,Anna Kucerova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work tests a self-annotation-based unsupervised methodology for training a convolutional neural network (CNN) model for semantic segmentation of X-ray computed tomography (XCT) scans of concretes. Concrete poses a unique challenge for XCT imaging due to similar X-ray attenuation coefficients of aggregates and mortar, resulting in low-contrast between the two phases in the ensuing images. While CNN-based models are a proven technique for semantic segmentation in such challenging cases, they typically require labeled training data, which is often unavailable for new datasets or are costly to obtain. To counter that limitation, a self-annotation technique is used here which leverages superpixel algorithms to identify perceptually similar local regions in an image and relates them to the global context in the image by utilizing the receptive field of a CNN-based model. This enables the model to learn a global-local relationship in the images and enables identification of semantically similar structures. We therefore present the performance of the unsupervised training methodology on our XCT datasets and discuss potential avenues for further improvements.
[CV-224] OrthoAI: A Lightweight Deep Learning Framework for Automated Biomechanical Analysis in Clear Aligner Orthodontics – A Methodological Proof-of-Concept
【速读】:该论文旨在解决当前正畸治疗中数字化牙移动规划(如ClinCheck)依赖人工审核效率低且易出错的问题。其解决方案的关键在于提出OrthoAI——一个结合轻量级3D牙齿分割与自动化生物力学分析的决策支持系统,通过动态图卷积神经网络(Dynamic Graph CNN)对稀疏地标点云进行牙齿识别(基于3DTeethLand数据集训练),并集成基于循证医学规则的生物力学引擎(参考Kravitz et al. 2009 和 Simon et al. 2014),实现每颗牙在六个自由度上的运动分解、可预测性评估、极限超限预警及综合指数生成。整个端到端流程可在消费级硬件上于4秒内完成,为几何深度学习和数字正畸研究提供可复现的开源工具链。
链接: https://arxiv.org/abs/2603.00124
作者: Edouard Lansiaux,Margaux Leman,Mehdi Ammi
机构: STaR-AI, Emergency Departement, Lille University Hospital (斯特拉人工智能,急诊科,里尔大学医院); Artificial Intelligence and Data Semantics Laboratory, Paris 8 University (人工智能与数据语义实验室,巴黎第八大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Clear aligner therapy now dominates orthodontics, yet clinician review of digitally planned tooth movements-typically via ClinCheck (Align Technology)-remains slow and error-prone. We present OrthoAI, an open-source proof-of-concept decision-support system combining lightweight 3D dental segmentation with automated biomechanical analysis to assist treatment-plan evaluation. The framework uses a Dynamic Graph CNN trained on landmark-reconstructed point clouds from 3DTeethLand (MICCAI) and integrates a rule-based biomechanical engine grounded in orthodontic evidence (Kravitz et al 2009; Simon et al 2014). The system decomposes per-tooth motion across six degrees of freedom, computes movement-specific predictability, issues alerts when biomechanical limits are exceeded, and derives an exploratory composite index. With 60,705 trainable parameters, segmentation reaches a Tooth Identification Rate of 81.4% and mIoU of 8.25% on surrogate point clouds-reflecting sparse landmark supervision rather than dense meshes. Although spatial boundaries are coarse, downstream analysis depends mainly on tooth identity and approximate centroid/axis estimation. Results establish a baseline for future full-mesh training and highlight current perceptual limits. The end-to-end pipeline runs in 4s on consumer hardware. Code, weights, and analysis tools are released to support reproducible research in geometric deep learning and digital orthodontics. The system has not been validated on real intraoral meshes and should not be assumed to generalize beyond landmark-derived representations.
[CV-225] CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers ACL2026
【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在三维CT(Computed Tomography)分析中依赖静态、单次推理的问题,这一局限性无法匹配临床放射学实践中动态、工具驱动的迭代式解读流程。解决方案的关键在于提出CT-Flow框架,该框架基于Model Context Protocol (MCP) 实现开放式的、工具感知的推理范式,使模型能够自主分解自然语言查询为多步骤工具调用序列,并通过自建的CT-FlowBench基准进行指令微调与评估,从而实现对3D CT图像的自动化、可解释的体积级解读。实验表明,该方法在诊断准确性上较基线模型提升41%,且在自主工具调用成功率上达到95%。
链接: https://arxiv.org/abs/2603.00123
作者: Yannian Gu,Xizhuo Zhang,Linjie Mu,Yongrui Yu,Zhongzhen Huang,Shaoting Zhang,Xiaofan Zhang
机构: Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai, China; Shanghai Innovation Institute, Shanghai, China; Sensetime Research, Shanghai, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: submitting to ACL 2026
Abstract:Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical interpretation is a dynamic, tool-mediated workflow where radiologists iteratively review slices and use measurement, radiomics, and segmentation tools to refine findings. To bridge this gap, we propose CT-Flow, an agentic framework designed for interoperable volumetric interpretation. By leveraging the Model Context Protocol (MCP), CT-Flow shifts from closed-box inference to an open, tool-aware paradigm. We curate CT-FlowBench, the first large-scale instruction-tuning benchmark tailored for 3D CT tool-use and multi-step reasoning. Built upon this, CT-Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool-use sequences. Experimental evaluations on CT-FlowBench and standard 3D VQA datasets demonstrate that CT-Flow achieves state-of-the-art performance, surpassing baseline models by 41% in diagnostic accuracy and achieving a 95% success rate in autonomous tool invocation. This work provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.
[CV-226] BiSe-Unet: A Lightweight Dual-path U-Net with Attention-refined Context for Real-time Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割任务中轻量化模型在资源受限设备上难以兼顾实时性与分割精度的问题,特别是在内窥镜引导的结肠镜检查中对息肉进行实时检测的需求。其关键解决方案是提出BiSe-UNet,一种轻量级双路径U-Net架构,通过引入注意力增强的上下文路径(attention-refined context path)以提升语义理解能力,同时保留浅层空间路径(shallow spatial path)以精细保留边界细节,并采用深度可分离卷积解码器(depthwise separable decoder)实现高效特征重建,从而在保持高分割质量的同时实现超过30 FPS的实时推理速度,适用于边缘硬件部署。
链接: https://arxiv.org/abs/2603.00119
作者: M Iffat Hossain,Laura Brattain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE EMBC 2026. This work has been submitted to the IEEE for possible publication
Abstract:During image-guided procedures, real-time image segmentation is often required. This demands lightweight AI models that can operate on resource-constrained devices. One important use case is endoscopy-guided colonoscopy, where polyps must be detected in real time. The Kvasir-Seg dataset, a publicly available benchmark for this task, contains 1,000 high-resolution endoscopic images of polyps with corresponding pixel-level segmentation masks. Achieving real-time inference speed for clinical deployment in constrained environments requires highly efficient and lightweight network architectures. However, many existing models remain too computationally intensive for embedded deployment. Lightweight architectures, although faster, often suffer from reduced spatial precision and weaker contextual understanding, leading to degraded boundary quality and reduced diagnostic reliability. To address these challenges, we introduce BiSe-UNet, a lightweight dual-path U-Net that integrates an attention-refined context path with a shallow spatial path for detailed feature preservation, followed by a depthwise separable decoder for efficient reconstruction. Evaluated on the Kvasir-Seg dataset, BiSe-UNet achieves competitive Dice and IoU scores while sustaining real-time throughput exceeding 30 FPS on Raspberry Pi 5, demonstrating its effectiveness for accurate, lightweight, and deployable medical image segmentation on edge hardware.
[CV-227] Efficient Image Super-Resolution with Multi-Scale Spatial Adaptive Attention Networks
【速读】:该论文旨在解决图像超分辨率(Image Super-Resolution, ISR)方法中普遍存在的难题:如何在保证高重建保真度的同时降低模型复杂度。现有方法往往难以兼顾性能与效率,导致计算资源消耗大或效果受限。解决方案的关键在于提出一种轻量级网络结构——多尺度空间自适应注意力网络(Multi-scale Spatial Adaptive Attention Network, MSAAN),其核心是创新的多尺度空间自适应注意力模块(Multi-scale Spatial Adaptive Attention Module, MSAA)。该模块通过两个协同工作的组件实现:全局特征调制模块(Global Feature Modulation Module, GFM)用于学习一致的纹理结构,以及多尺度特征聚合模块(Multi-scale Feature Aggregation Module, MFA)利用金字塔式处理自适应融合局部到全局尺度的特征,从而有效建模细粒度局部细节与长程上下文依赖关系。此外,论文还引入局部增强块(Local Enhancement Block, LEB)和特征交互门控前馈模块(Feature Interactive Gated Feed-Forward Module, FIGFF),进一步提升几何感知能力和非线性表达能力,同时减少通道冗余。实验表明,MSAAN在多个标准数据集上均实现了PSNR和SSIM指标的领先或竞争力表现,且参数量和计算成本显著低于当前最优方法。
链接: https://arxiv.org/abs/2603.00118
作者: Sushi Rao,Jingwei Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces a lightweight image super-resolution (SR) network, termed the Multi-scale Spatial Adaptive Attention Network (MSAAN), to address the common dilemma between high reconstruction fidelity and low model complexity in existing SR methods. The core of our approach is a novel Multi-scale Spatial Adaptive Attention Module (MSAA), designed to jointly model fine-grained local details and long-range contextual dependencies. The MSAA comprises two synergistic components: a Global Feature Modulation Module (GFM) that learns coherent texture structures through differential feature extraction, and a Multi-scale Feature Aggregation Module (MFA) that adaptively fuses features from local to global scales using pyramidal processing. To further enhance the network’s capability, we propose a Local Enhancement Block (LEB) to strengthen local geometric perception and a Feature Interactive Gated Feed-Forward Module (FIGFF) to improve nonlinear representation while reducing channel redundancy. Extensive experiments on standard benchmarks (Set5, Set14, B100, Urban100, Manga109) across \times2 , \times3 , and \times4 scaling factors demonstrate that both our lightweight (MSAAN-light) and standard (MSAAN) versions achieve superior or competitive performance in terms of PSNR and SSIM, while maintaining significantly lower parameters and computational costs than state-of-the-art methods. Ablation studies validate the contribution of each component, and visual results show that MSAAN reconstructs sharper edges and more realistic textures.
[CV-228] VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation
【速读】:该论文旨在解决废旧电子产品回收场景中非破坏性提取内部部件(如电池和电机)的难题,其核心挑战在于产品多样性高且缺乏明确的拆解流程信息,导致难以确定最优切割位置。解决方案的关键在于提出VoxelDiffusionCut方法,该方法通过扩散模型(diffusion model)迭代估计由体素(voxel)表示的内部结构,并基于估计结果规划切割路径;其中,体素表示将预测任务限定在固定网格位置上,仅需预测各位置的部件类型,从而降低学习难度;同时,扩散模型能够从已观察到的切割表面条件化生成完整体素表示,有效捕捉未观测区域的不确定性,避免因误判而导致的错误切割,实现精准、安全的非破坏性提取。
链接: https://arxiv.org/abs/2603.00116
作者: Takumi Hachimine,Yuhwan Kwon,Cheng-Yu Kuo,Tomoya Yamanokuchi,Takamitsu Matsubara
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Non-destructive extraction of the target internal part, such as batteries and motors, by cutting surrounding structures is crucial at recycling and disposal sites. However, the diversity of products and the lack of information on disassembly procedures make it challenging to decide where to cut. This study explores a method for non-destructive extraction of a target internal part that iteratively estimates the internal structure from observed cutting surfaces and formulates cutting plans based on the estimation results. A key requirement is to estimate the probability of the target part’s presence from partial observations. However, learning conditional generative models for this task is challenging: The high dimensionality of 3D shape representations makes learning difficult, and conventional models (e.g., conditional variational autoencoders) often fail to capture multi-modal predictive uncertainty due to mode collapse, resulting in overconfident predictions. To address these issues, we propose VoxelDiffusionCut, which iteratively estimates the internal structure represented as voxels using a diffusion model and plans cuts for non-destructive extraction of the target internal part based on the estimation results. Voxel representation allows the model to predict only attributes at fixed grid positions, i.e., types of constituent parts, making learning more tractable. The diffusion model completes the voxel representation conditioned on observed cutting surfaces, capturing uncertainty in unobserved regions to avoid erroneous cuts. Experimental results in simulation suggest that the proposed method can estimate internal structures from observed cutting surfaces and enable non-destructive extraction of the target internal part by leveraging the estimated uncertainty.
[CV-229] Automated Quality Check of Sensor Data Annotations
【速读】:该论文旨在解决自动驾驶系统中训练数据质量保障的难题,尤其是在铁路车辆多传感器数据集的场景下,由于安全关键性,训练数据必须满足高标准的质量要求。传统依赖人工校验的方式效率低下且成本高昂,难以支撑大规模AI算法的开发需求。解决方案的关键在于提出了一种开源自动检测工具,能够识别九类常见于多传感器数据集中的错误,并通过手动验证确认其有效性:其中六种错误检测方法达到100%精确率,其余三种方法也达到了96%–97%的精确率,显著降低了人工工作量并加速了高可靠性环境感知系统的研发进程。
链接: https://arxiv.org/abs/2603.00114
作者: Niklas Freund,Zekiye Ilknur-Öz,Tobias Klockau,Patrick Naumann,Philipp Neumaier,Martin Köppel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The monitoring of the route and track environment plays an important role in automated driving. For example, it can be used as an assistance system for route monitoring in automation level Grade of Automation (GoA) 2, where the train driver is still on board. In fully automated, driverless driving at automation level GoA4, these systems finally take over environment monitoring completely independently. With the help of artificial intelligence (AI), they react automatically to risks and dangerous events on the route. To train such AI algorithms, large amounts of training data are required, which must meet high-quality standards due to their safety relevance. In this publication we present an automatic method for assuring the quality of training data, significantly reducing the manual workload and accelerating the development of these systems. We propose an open-source tool designed to detect nine common errors found in multi-sensor datasets for railway vehicles. To evaluate the performance of the framework, all detected errors were manually validated. Six issue detection methods achieved 100% precision, while three additional methods reached precision rates 96% and 97%.
[CV-230] Certainty-Validity: A Diagnostic Framework for Discrete Commitment Systems
【速读】:该论文旨在解决标准机器学习评估指标(如准确率、精确率、召回率和AUROC)在离散承诺系统(discrete commitment systems,即模型必须从确定状态W、0、+W中选择一个输出的架构)中所存在的根本性缺陷:这些指标假设所有错误代价相同,忽略了预测置信度与正确性之间的差异。这种假设掩盖了模型在模糊数据上产生“自信但错误”(Confident-Incorrect, CI)行为的问题,即模型会无依据地“幻觉化”结构。解决方案的关键在于提出Certainty-Validity Score (CVS) 框架,它将模型性能分解为一个2×2矩阵,区分高/低置信度与有效/无效预测,从而识别出CI这一隐藏的失败模式。论文进一步指出,离散模型在噪声基准上的性能瓶颈(称为“83% Ambiguity Ceiling”)并非缺陷,而是其对结构证据敏感性的体现;而标准训练导致的良性过拟合(Benign Overfitting)会使模型从合理的不确定性迁移至有害的自信错误。因此,“良好训练”的定义应转向最大化CVS,确保模型仅在有足够结构证据时才做出承诺。
链接: https://arxiv.org/abs/2603.00070
作者: Datorien L. Anderson
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 1 figure, full experiment data can be found: this https URL
Abstract:Standard evaluation metrics for machine learning – accuracy, precision, recall, and AUROC – assume that all errors are equivalent: a confident incorrect prediction is penalized identically to an uncertain one. For discrete commitment systems (architectures that select committed states -W, 0, +W), this assumption is epistemologically flawed. We introduce the Certainty-Validity (CVS) Framework, a diagnostic method that decomposes model performance into a 2x2 matrix distinguishing high/low certainty from valid/invalid predictions. This framework reveals a critical failure mode hidden by standard accuracy: Confident-Incorrect (CI) behavior, where models hallucinate structure in ambiguous data. Through ablation experiments on Fashion-MNIST, EMNIST, and IMDB, we analyze the “83% Ambiguity Ceiling” – a stopping point where this specific discrete architecture consistently plateaus on noisy benchmarks. Unlike continuous models that can surpass this ceiling by memorizing texture or statistical noise, the discrete model refuses to commit to ambiguous samples. We show that this refusal is not a failure but a feature: the model stops where structural evidence ends. However, standard training on ambiguous data eventually forces Benign Overfitting, causing a pathological migration from Uncertain-Incorrect (appropriate doubt) to Confident-Incorrect (hallucination). We propose that “good training” for reasoning systems must be defined not by accuracy, but by maximizing the Certainty-Validity Score (CVS) – ensuring the model knows where to stop.
[CV-231] Learning Under Extreme Data Scarcity: Subject-Level Evaluation of Lightweight CNNs for fMRI-Based Prodromal Parkinsons Detection
【速读】:该论文旨在解决在极端数据稀缺场景下(如早期帕金森病的神经影像检测),深度学习模型评估中因数据划分策略不当导致的信息泄露问题,以及模型容量与泛化性能之间的权衡关系。其关键解决方案在于采用严格的受试者级划分(subject-level split)而非常见的图像级划分,从而避免同一受试者的多个fMRI切片同时出现在训练集和测试集中,有效防止信息泄露;在此基础上,对比了不同容量的卷积神经网络架构(VGG19、Inception V3、Inception ResNet V2 和 MobileNet V1),发现轻量级模型 MobileNet 在受试者级别评估下表现出最稳定的泛化能力,优于参数量更大但易过拟合的深层架构,表明在极低数据条件下,合理的评估策略和适配的模型容量比单纯增加网络深度更能提升模型可靠性。
链接: https://arxiv.org/abs/2603.00060
作者: Naimur Rahman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Methodological case study cs.LG on subject-level evaluation and model capacity under extreme data scarcity; 9 pages, 1 figure. Experiments use 40-subject PPMI fMRI cohort; no external validation
Abstract:Deep learning is often applied in settings where data are limited, correlated, and difficult to obtain, yet evaluation practices do not always reflect these constraints. Neuroimaging for prodromal Parkinsons disease is one such case, where subject numbers are small and individual scans produce many highly related samples. This work examines prodromal Parkinsons detection from resting-state fMRI as a machine learning problem centered on learning under extreme data scarcity. Using fMRI data from 40 subjects, including 20 prodromal Parkinsons cases and 20 healthy controls, ImageNet-pretrained convolutional neural networks are fine-tuned and evaluated under two different data partitioning strategies. Results show that commonly used image-level splits allow slices from the same subject to appear in both training and test sets, leading to severe information leakage and near-perfect accuracy. When a strict subject-level split is enforced, performance drops substantially, yielding test accuracies between 60 and 81 percent. Models with different capacity profiles are compared, including VGG19, Inception V3, Inception ResNet V2, and the lightweight MobileNet V1. Under subject-level evaluation, MobileNet demonstrates the most reliable generalization, outperforming deeper architectures despite having significantly fewer parameters. These results indicate that in extreme low-data regimes, evaluation strategy and model capacity have a greater impact on performance than architectural depth. Although the analysis is limited to a single cohort of 40 subjects and does not include external validation or cross-validation, it provides a concrete case study and practical recommendations for evaluating deep learning models under severe data scarcity. Comments: Methodological case study cs.LG on subject-level evaluation and model capacity under extreme data scarcity; 9 pages, 1 figure. Experiments use 40-subject PPMI fMRI cohort; no external validation Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2603.00060 [cs.CV] (or arXiv:2603.00060v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.00060 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-232] Scaling Quantum Machine Learning without Tricks: High-Resolution and Diverse Image Generation
【速读】:该论文旨在解决当前量子生成建模(Quantum Generative Modeling)在实际应用中面临的两大核心问题:一是受限于量子硬件的算力瓶颈,导致现有方法仅能处理小规模或简化数据集;二是缺乏与具体应用场景相关的归纳偏置(inductive bias),使得模型设计依赖人为“技巧”来压缩高分辨率图像(如降维或分块处理)。其解决方案的关键在于:利用近期发展的经典图像量子加载技术,直接将完整的MNIST和Fashion-MNIST数据集输入量子系统,并训练端到端的量子Wasserstein生成对抗网络(Quantum Wasserstein GAN),从而无需任何降维或分块策略即可生成全分辨率图像。此外,通过精心设计变分电路架构引入有效的归纳偏置,结合增强噪声输入技术提升图像多样性与质量,在量子采样噪声条件下仍保持优异性能,实现了单个量子生成器在标准基准上的新最优表现。
链接: https://arxiv.org/abs/2603.00233
作者: Jonas Jäger,Florian J. Kiwit,Carlos A. Riofrío
机构: 未知
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 25 pages, 16 figures. Main text: 14 pages, 7 figures. Appendix: 11 pages, 9 figures
Abstract:Quantum generative modeling is a rapidly evolving discipline at the intersection of quantum computing and machine learning. Contemporary quantum machine learning is generally limited to toy examples or heavily restricted datasets with few elements. This is not only due to the current limitations of available quantum hardware but also due to the absence of inductive biases arising from application-agnostic designs. Current quantum solutions must resort to tricks to scale down high-resolution images, such as relying heavily on dimensionality reduction or utilizing multiple quantum models for low-resolution image patches. Building on recent developments in classical image loading to quantum computers, we circumvent these limitations and train quantum Wasserstein GANs on the established classical MNIST and Fashion-MNIST datasets. Using the complete datasets, our system generates full-resolution images across all ten classes and establishes a new state-of-the-art performance with a single end-to-end quantum generator without tricks. As a proof-of-principle, we also demonstrate that our approach can be extended to color images, exemplified on the Street View House Numbers dataset. We analyze how the choice of variational circuit architecture introduces inductive biases, which crucially unlock this performance. Furthermore, enhanced noise input techniques enable highly diverse image generation while maintaining quality. Finally, we show promising results even under quantum shot noise conditions.
[CV-233] GLIDE-Reg: Global-to-Local Deformable Registration Using Co-Optimized Foundation and Handcrafted Features
【速读】:该论文旨在解决医学影像中可变形配准(deformable registration)方法在空间分辨率差异和解剖覆盖范围不一致情况下缺乏鲁棒性和泛化能力的问题。解决方案的关键在于联合优化一个配准场(registration field)与一个可学习的降维模块,以确保压缩后的体积形变场嵌入(VFM embeddings)保持配准相关性,并将这些全局语义线索与局部MIND描述子进行融合,从而提升跨数据集和任务的性能稳定性与准确性。
链接: https://arxiv.org/abs/2603.00218
作者: Yunzheng Zhu,Aichi Chien,Kimaya kulkarni,Luoting Zhuang,Stephen Park,Ricky Savjani,Daniel Low,William Hsu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deformable registration is crucial in medical imaging. Several existing applications include lesion tracking, probabilistic atlas generation, and treatment response evaluation. However, current methods often lack robustness and generalizability across two key factors: spatial resolution and differences in anatomical coverage. We jointly optimize a registration field and a learnable dimensionality reduction module so that compressed VFM embeddings remain registration-relevant, and fuse these global semantic cues with MIND local descriptors. GLIDE-Reg achieves average dice similarity coefficients (DSC) across 6 anatomical structures of 0.859, 0.862, and 0.901 in two public cohorts (Lung250M and NLST) and one institution cohort (UCLA5DCT), and outperforms the state-of-the-art DEEDS (0.834, 0.858, 0.900) with relative improvements of 3.0%, 0.5%, and 0.1%. For target registration errors, GLIDE-Reg achieves 1.58 mm on Lung250M landmarks (compared to 1.25 mm on corrField and 1.91 mm on DEEDS) and 1.11 mm on NLST nodule centers (compared to 1.11 mm on DEEDS). The substantiated performance on the nodule centers also demonstrates its robustness across challenging downstream tasks, such as nodule tracking, which is an essential prior step for early-stage lung cancer diagnosis.
[CV-234] Efficient Flow Matching for Sparse-View CT Reconstruction
【速读】:该论文旨在解决基于扩散模型(Diffusion Models, DM)的CT重建方法在临床和介入场景中效率不足的问题。扩散模型依赖随机微分方程(Stochastic Differential Equations, SDEs)进行前向扩散与反向去噪,其固有的随机性会干扰CT重建中反复执行的数据一致性校正,导致计算效率低下。为此,作者提出一种基于流匹配(Flow Matching, FM)的CT重建框架(FMCT),其核心在于利用FM模型将采样过程建模为确定性常微分方程(Ordinary Differential Equation, ODE),从而天然兼容重复的数据一致性操作。进一步地,作者发现FM预测的速度场在相邻步骤间具有强相关性,据此设计高效变体EFMCT,通过复用先前预测的速度场来显著减少神经网络函数评估次数(NFEs),并从理论上证明了速度场复用引入的误差在结合数据一致性操作时是可控的。实验表明,FMCT/EFMCT在保持重建质量的同时大幅提升了推理效率。
链接: https://arxiv.org/abs/2603.00205
作者: Jiayang Shi,Lincen Yang,Zhong Li,Tristan Van Leeuwen,Daniel M. Pelt,K. Joost Batenburg
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models, particularly Diffusion Models (DM), have shown strong potential for Computed Tomography (CT) reconstruction serving as expressive priors for solving ill-posed inverse problems. However, diffusion-based reconstruction relies on Stochastic Differential Equations (SDEs) for forward diffusion and reverse denoising, where such stochasticity can interfere with repeated data consistency corrections in CT reconstruction. Since CT reconstruction is often time-critical in clinical and interventional scenarios, improving reconstruction efficiency is essential. In contrast, Flow Matching (FM) models sampling as a deterministic Ordinary Differential Equation (ODE), yielding smooth trajectories without stochastic noise injection. This deterministic formulation is naturally compatible with repeated data consistency operations. Furthermore, we observe that FM-predicted velocity fields exhibit strong correlations across adjacent steps. Motivated by this, we propose an FM-based CT reconstruction framework (FMCT) and an efficient variant (EFMCT) that reuses previously predicted velocity fields over consecutive steps to substantially reduce the number of Neural network Function Evaluations (NFEs), thereby improving inference efficiency. We provide theoretical analysis showing that the error introduced by velocity reuse is bounded when combined with data consistency operations. Extensive experiments demonstrate that FMCT/EFMCT achieve competitive reconstruction quality while significantly improving computational efficiency compared with diffusion-based methods. The codebase is open-sourced at this https URL.
[CV-235] Optimisation of SOUP-GAN and CSR-GAN for High Resolution MR Images Reconstruction
【速读】:该论文旨在解决磁共振成像(Magnetic Resonance Imaging, MRI)中因运动伪影和设备限制导致的图像质量下降问题,从而提升后续疾病诊断的准确性。解决方案的关键在于提出并优化两种生成式对抗网络(Generative Adversarial Networks, GANs)模型——SOUP-GAN 和 CSR-GAN,其核心改进包括:对生成器和判别器结构进行深度扩展(增加卷积层)、优化滤波器尺寸、引入 LeakyReLU 激活函数以改善梯度流动、采用谱归一化(Spectral Normalization)缓解模式崩溃并增强训练稳定性,以及通过超参数调优(如降低学习率和调整批量大小)实现更高效的训练过程。实验表明,CSR-GAN 在高频细节重建和噪声抑制方面表现最优(PSNR 34.6,SSIM 0.89),而 SOUP-GAN 则在保持结构完整性的同时提供更干净的图像(PSNR 34.4,SSIM 0.83),验证了改进后的 GAN 模型在 MRI 图像质量增强中的有效性。
链接: https://arxiv.org/abs/2603.00204
作者: Muneeba Rashid,Hina Shakir,Humaira Mehwish,Asarim Amir,Reema Qaiser Khan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Magnetic Resonance (MR) imaging is a diagnostic tool used in modern medicine; however, its output can be affected by motion artefacts and may be limited by equipment. This research focuses on MRI image quality enhancement using two efficient Generative Adversarial Networks (GANs) models: SOUP-GAN and CSR-GAN. In both models, meaningful architectural modifications were introduced. The generator and discriminator of each were further deepened by adding convolutional layers and were enhanced in filter sizes as well. The LeakyReLU activation function was used to improve gradient flow, and hyperparameter tuning strategies were applied, including a reduced learning rate and an optimal batch size. Moreover, spectral normalisation was proposed to address mode collapse and improve training stability. The experiment shows that CSR-GAN has better performance in reconstructing the image with higher frequency details and reducing noise compared to other methods, with an optimised PSNR of 34.6 and SSIM of 0.89. However, SOUP-GAN performed the best in terms of delivering less noisy images with good structures, achieving a PSNR of 34.4 and SSIM of 0.83. The obtained results indicate that the proposed enhanced GAN model can be a useful tool for MR image quality improvement for subsequent better disease diagnostics.
[CV-236] Multimodal Modular Chain of Thoughts in Energy Performance Certificate Assessment
【速读】:该论文旨在解决在缺乏可扩展的能源性能证书(Energy Performance Certificate, EPC)评估体系的地区,如何通过有限的视觉信息实现建筑能源性能的低成本、自动化预评估问题。解决方案的关键在于提出一种多模态模块化思维链(Multimodal Modular Chain of Thoughts, MMCoT)架构,该架构将EPC估算分解为多个中间推理阶段,并通过结构化提示(structured prompting)显式传播推断出的属性以跨任务传递知识,从而提升模型对EPC评级有序结构的捕捉能力,在数据稀缺场景下实现更准确的预评估。
链接: https://arxiv.org/abs/2603.00115
作者: Zhen Peng,Peter J. Bentley
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate evaluation of building energy performance remains challenging in regions where scalable Energy Performance Certificate (EPC) assessments are unavailable. This paper presents a cost-efficient framework that leverages Vision-Language models for automated EPC pre-assessment from limited visual information. The proposed Multimodal Modular Chain of Thoughts (MMCoT) architecture decomposes EPC estimation into intermediate reasoning stages and explicitly propagates inferred attributes across tasks using structured prompting. Experiments on a multimodal dataset of 81 residential properties in the United Kingdom show that MMCoT achieves statistically significant improvements over instruction-only prompting for EPC estimation. Analysis based on accuracy, recall, mean absolute error, and confusion matrices indicate that the proposed approach captures the ordinal structure of EPC ratings, with most errors occurring between adjacent classes. These results suggest that modular prompt-based reasoning offers a promising direction for low-cost EPC pre-assessment in data-scarce settings.
人工智能
[AI-0] Conformal Policy Control
【速读】:该论文旨在解决高风险环境中智能体在探索新行为时如何平衡安全性与性能提升的问题。在这些场景中,违反安全约束的行为可能导致严重后果,迫使智能体被中断;而过度保守地模仿旧策略则会抑制探索,阻碍性能优化。解决方案的关键在于利用任意一个安全参考策略(safe reference policy)作为概率调节器,对未经测试的优化策略进行动态调整:通过基于安全策略数据的分位数校准(conformal calibration),确定新策略可采取的激进程度,并严格满足用户声明的风险阈值。该方法无需假设用户已识别正确的模型类别或调优超参数,且首次在非单调有界约束函数下提供了有限样本保证,从而实现从部署之初即安全地探索并持续提升性能。
链接: https://arxiv.org/abs/2603.02196
作者: Drew Prinster,Clara Fannjiang,Ji Won Park,Kyunghyun Cho,Anqi Liu,Suchi Saria,Samuel Stanton
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:
Abstract:An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user’s declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.
[AI-1] Symbol-Equivariant Recurrent Reasoning Models
【速读】:该论文旨在解决神经网络在处理结构化推理问题(如数独和ARC-AGI)时面临的挑战,尤其是现有递归推理模型(RRMs)对符号对称性(symbol symmetry)的处理依赖于昂贵的数据增强,难以实现泛化与鲁棒性提升。其解决方案的关键在于提出符号等变递归推理模型(Symbol-Equivariant Recurrent Reasoning Models, SE-RRMs),通过在架构层面引入符号等变层(symbol-equivariant layers),显式地强制模型满足排列等变性(permutation equivariance),从而保证在符号或颜色置换下输出结果保持一致。这一设计显著提升了模型在不同规模数独实例(4x4至25x25)上的外推能力,并在ARC-AGI任务中以更少的数据增强和参数量(仅200万)实现了具有竞争力的性能,验证了显式编码对称性对神经推理系统可扩展性和鲁棒性的关键作用。
链接: https://arxiv.org/abs/2603.02193
作者: Richard Freinschlag,Timo Bertram,Erich Kobler,Andreas Mayr,Günter Klambauer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Reasoning problems such as Sudoku and ARC-AGI remain challenging for neural networks. The structured problem solving architecture family of Recurrent Reasoning Models (RRMs), including Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM), offer a compact alternative to large language models, but currently handle symbol symmetries only implicitly via costly data augmentation. We introduce Symbol-Equivariant Recurrent Reasoning Models (SE-RRMs), which enforce permutation equivariance at the architectural level through symbol-equivariant layers, guaranteeing identical solutions under symbol or color permutations. SE-RRMs outperform prior RRMs on 9x9 Sudoku and generalize from just training on 9x9 to smaller 4x4 and larger 16x16 and 25x25 instances, to which existing RRMs cannot extrapolate. On ARC-AGI-1 and ARC-AGI-2, SE-RRMs achieve competitive performance with substantially less data augmentation and only 2 million parameters, demonstrating that explicitly encoding symmetry improves the robustness and scalability of neural reasoning. Code is available at this https URL.
[AI-2] MAC: A Conversion Rate Prediction Benchmark Featuring Labels Under Multiple Attribution Mechanisms
【速读】:该论文旨在解决多归属学习(Multi-attribution learning, MAL)在转化率预测(CVR prediction)任务中因缺乏来自多种归属机制的标签数据而难以发展的瓶颈问题。现有公开CVR数据集仅提供单一归属机制生成的转化标签,限制了MAL方法的研究与验证。为此,作者构建了首个包含多归属机制标签的公开基准数据集Multi-Attribution Benchmark (MAC),并开发了PyMAL开源库以支持可复现的研究。关键解决方案在于提出Mixture of Asymmetric Experts (MoAE)模型,其核心思想是:一方面通过不对称专家结构充分学习多归属知识,另一方面将所学知识以主任务导向的方式有效利用,从而实现对主任务性能的显著提升。实验表明,MoAE优于当前最先进的MAL方法,且研究揭示了MAL在复杂目标下性能增长规律及辅助目标选择的重要性,为未来MAL研究提供了重要指导。
链接: https://arxiv.org/abs/2603.02184
作者: Jinqi Wu,Sishuo Chen,Zhangming Chan,Yong Bai,Lei Zhang,Sheng Chen,Chenghuan Hou,Xiang-Rong Sheng,Han Zhu,Jian Xu,Bo Zheng,Chaoyou Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code and data available at this https URL
Abstract:Multi-attribution learning (MAL), which enhances model performance by learning from conversion labels yielded by multiple attribution mechanisms, has emerged as a promising learning paradigm for conversion rate (CVR) prediction. However, the conversion labels in public CVR datasets are generated by a single attribution mechanism, hindering the development of MAL approaches. To address this data gap, we establish the Multi-Attribution Benchmark (MAC), the first public CVR dataset featuring labels from multiple attribution mechanisms. Besides, to promote reproducible research on MAL, we develop PyMAL, an open-source library covering a wide array of baseline methods. We conduct comprehensive experimental analyses on MAC and reveal three key insights: (1) MAL brings consistent performance gains across different attribution settings, especially for users featuring long conversion paths. (2) The performance growth scales up with objective complexity in most settings; however, when predicting first-click conversion targets, simply adding auxiliary objectives is counterproductive, underscoring the necessity of careful selection of auxiliary objectives. (3) Two architectural design principles are paramount: first, to fully learn the multi-attribution knowledge, and second, to fully leverage this knowledge to serve the main task. Motivated by these findings, we propose Mixture of Asymmetric Experts (MoAE), an effective MAL approach incorporating multi-attribution knowledge learning and main task-centric knowledge utilization. Experiments on MAC show that MoAE substantially surpasses the existing state-of-the-art MAL method. We believe that our benchmark and insights will foster future research in the MAL field. Our MAC benchmark and the PyMAL algorithm library are publicly available at this https URL. Comments: Code and data available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.02184 [cs.LG] (or arXiv:2603.02184v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.02184 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-3] Reservoir Subspace Injection for Online ICA under Top-n Whitening
【速读】:该论文旨在解决在线独立成分分析(Online ICA)在非线性混合场景下性能受限的问题,特别是传统top-𝑛白化方法因丢弃注入特征而导致信号保真度下降的瓶颈。其核心解决方案是提出“储层子空间注入”(Reservoir Subspace Injection, RSI)机制,关键在于确保注入特征仅进入保留的主成分子空间而不干扰原始信号的直通方向(passthrough directions)。通过引入RSI诊断指标(IER、SSO、ρₓ)识别失败模式,并设计受保护的RSI控制器,在维持直通保留的前提下恢复性能,使RE-OICA相比传统在线ICA在非线性混合下提升+1.7 dB,并在超高斯基准测试中实现正向信号分离信噪比(SI-SDR)改善+0.6 dB。
链接: https://arxiv.org/abs/2603.02178
作者: Wenjun Xiao,Yuda Bi,Vince D Calhoun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Reservoir expansion can improve online independent component analysis (ICA) under nonlinear mixing, yet top- n whitening may discard injected features. We formalize this bottleneck as \emphreservoir subspace injection (RSI): injected features help only if they enter the retained eigenspace without displacing passthrough directions. RSI diagnostics (IER, SSO, \rho_x ) identify a failure mode in our top- n setting: stronger injection increases IER but crowds out passthrough energy ( \rho_x: 1.00!\rightarrow!0.77 ), degrading SI-SDR by up to 2.2 ,dB. A guarded RSI controller preserves passthrough retention and recovers mean performance to within 0.1 ,dB of baseline 1/N scaling. With passthrough preserved, RE-OICA improves over vanilla online ICA by +1.7 ,dB under nonlinear mixing and achieves positive SI-SDR _\mathrmsc on the tested super-Gaussian benchmark ( +0.6 ,dB).
[AI-4] SageBwd: A Trainable Low-bit Attention
【速读】:该论文旨在解决低比特注意力机制(low-bit attention)在训练阶段性能不佳的问题,特别是针对SageBwd这一可训练INT8注意力方法在预训练过程中与全精度注意力(Full-Precision Attention, FPA)之间存在的性能差距。其关键解决方案在于通过系统性的实验和理论分析发现:(1) 在每步处理大量token时,QK归一化(QK-norm)对于训练稳定性至关重要;(2) 量化误差主要来源于反向传播中的分数梯度dS;(3) 减少每步token数量可使SageBwd在预训练中达到FPA性能水平;(4) K-smoothing对训练稳定仍必不可少,而Q-smoothing在预训练中作用有限。这些发现揭示了低比特注意力训练不稳定的根本原因,并提供了可操作的优化策略以实现与全精度相当的训练效果。
链接: https://arxiv.org/abs/2603.02170
作者: Jintao Zhang,Marco Chen,Haoxu Wang,Kai Jiang,Ion Stoica,Joseph E. Gonzalez,Jianfei Chen,Jun Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.
[AI-5] How Small Can 6G Reason ? Scaling Tiny Language Models for AI-Native Networks
【速读】:该论文旨在解决6G网络中AI原生架构下语义推理模型部署效率与性能之间的矛盾问题,即如何在资源受限的边缘计算环境中实现高可靠性的高层语义决策能力。其关键解决方案在于通过系统性实证研究揭示了紧凑型语言模型(Compact Language Models)在参数规模与推理稳定性、计算效率之间的非线性关系:发现1.5B至3B参数量级的中等规模模型在确定性准确率(pass@1)和边缘资源消耗(延迟与内存)之间达到最优平衡,显著优于极小模型(<1B)和大模型(>3B),从而为AI-native 6G系统的语义推理层提供了可落地的部署策略。
链接: https://arxiv.org/abs/2603.02156
作者: Mohamed Amine Ferrag,Abderrahmane Lakas,Merouane Debbah
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:Emerging 6G visions, reflected in ongoing standardization efforts within 3GPP, IETF, ETSI, ITU-T, and the O-RAN Alliance, increasingly characterize networks as AI-native systems in which high-level semantic reasoning layers operate above standardized control and data-plane functions. Although frontier-scale large language models (LLMs) such as Qwen2.5-7B and Olmo-3-7B demonstrate strong reasoning capability, their computational footprint limits deployment in latency-sensitive, edge-native infrastructures. This paper presents a systematic empirical study of the scaling behavior and deployment efficiency of compact language models for network-level semantic reasoning in AI-native 6G systems. Using 6G-Bench, a standardization-aligned benchmark comprising 30 decision-making tasks across five capability domains, we evaluate models ranging from 135M (SmolLM2-135M) to 7B parameters (Qwen2.5-7B), including mid-scale architectures such as Llama-3.2-1B, Granite-1B, and Qwen2.5-3B. Deterministic accuracy (pass@1) increases from 0.224 at 135M to 0.707 at 7B, but scaling gains are highly non-uniform. A pronounced stability transition occurs in the 1 to 1.5B range, where accuracy rises from 0.373 (Llama-3.2-1B) to 0.531 (Qwen2.5-1.5B) and the instability gap Delta_5 contracts from 0.356 to 0.138. Beyond 3B parameters, improvements diminish (+0.064 from 3B to 7B). Through single-query inference profiling and an Edge Score metric that normalizes accuracy by latency and memory footprint, we show that semantic reliability per unit edge resource does not scale monotonically with parameter count. Instead, mid-scale models (approximately 1.5 to 3B) achieve the most favorable balance between deterministic stability and computational efficiency, providing deployment-relevant guidance for AI-native 6G architectures. All scripts and results are publicly available at this https URL
[AI-6] Near-Optimal Regret for KL-Regularized Multi-Armed Bandits
【速读】:该论文旨在解决KL-regularized多臂老虎机(Multi-armed Bandits, MABs)在线学习中的统计效率问题,特别是针对其 regret 上界尚未完全刻画的现状。此前研究虽表明KL正则化可带来更快收敛速率或对数级 regret,但缺乏对参数 $ K (臂的数量)、 \eta $(正则强度倒数)和 $ T $(时间跨度)之间精确依赖关系的高概率上界分析。论文的关键解决方案是提出一种新颖的“剥皮论证”(peeling argument),用于对KL-UCB算法进行精细分析,从而首次获得一个关于 $ K $ 线性依赖的高概率 regret 上界 $ \tilde{O}(\eta K \log^2 T) $;同时通过构造困难实例并结合贝叶斯先验的定制分解,建立了首个非平凡的下界 $ \Omega(\eta K \log T) $,证明了该上界的近似紧性。此外,论文还揭示在低正则化区域(大 $ \eta $)时,regret 与 $ \eta $ 无关且趋于 $ \tilde{\Theta}(\sqrt{KT}) $,从而实现了对所有 $ \eta $ 参数区间下KL-regularized MABs 的完整理论刻画。
链接: https://arxiv.org/abs/2603.02155
作者: Kaixuan Ji,Qingyue Zhao,Heyang Zhao,Qiwei Di,Quanquan Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:
Abstract:Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical \sqrtT -type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a \tildeO(\eta K\log^2T) upper bound: the first high-probability regret bound with linear dependence on K . Here, T is the time horizon, K is the number of arms, \eta^-1 is the regularization intensity, and \tildeO hides all logarithmic factors except those involving \log T . The near-tightness of our analysis is certified by the first non-constant lower bound \Omega(\eta K \log T) , which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large \eta ), we show that the KL-regularized regret for MABs is \eta -independent and scales as \tilde\Theta(\sqrtKT) . Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of \eta and yield nearly optimal bounds in terms of K , \eta , and T .
[AI-7] Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中缺乏可验证性与过程监督机制的问题。现有评估方法多依赖最终答案正确性,难以定位错误来源并支持细粒度的训练优化。为此,作者提出Pencil Puzzle Bench框架,利用铅笔谜题(pencil puzzles)这一类约束满足问题(constraint-satisfaction problems),其解空间具有确定性和唯一性,并支持每一步状态的局部规则校验。该方案的关键在于:通过为每种谜题类型设计特定约束条件,实现对中间状态的逐层验证,从而提供密集的、基于动作的奖励信号,支撑过程监督和强化学习。这种机制不仅能够精确识别错误发生的具体步骤,还显著提升了模型在迭代推理(agentic iteration)中的表现,揭示了推理努力规模与代理式迭代能力两大关键维度的提升路径。
链接: https://arxiv.org/abs/2603.02119
作者: Justin Waugh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:
Abstract:We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort scaling, where GPT-5.2 improves 81x from no reasoning to maximum effort; and (2) agentic iteration, where Claude Opus 4.6 rises from 0.3% to 30.0% through iterative checking, while GPT-5.2@xhigh improves from 20.2% to 56.0%. Agentic attempts span a median of 29 turns over 17 minutes, with the longest exceeding 1,221 turns and 14.3 hours - a demanding test of long-context utilization, not just reasoning. Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2603.02119 [cs.AI] (or arXiv:2603.02119v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.02119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-8] Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons
【速读】:该论文旨在解决通用机器人奖励模型在大规模数据集上训练时面临的挑战,即传统方法依赖专家示范进行帧级进度预测,仅提供局部监督信号,在存在大量失败和次优轨迹的场景下难以有效标注密集进度标签,导致模型泛化能力受限。解决方案的关键在于提出Robometer框架,通过双目标联合优化实现可扩展的奖励建模:一方面利用帧级进度损失锚定专家数据上的奖励幅度,另一方面引入轨迹对比偏好损失,对同一任务的不同轨迹施加全局排序约束,从而有效利用真实和增强的失败轨迹信息,显著提升奖励函数的泛化性和下游机器人学习性能。
链接: https://arxiv.org/abs/2603.02115
作者: Anthony Liang,Yigit Korkmaz,Jiahui Zhang,Minyoung Hwang,Abrar Anwar,Sidhant Kaushik,Aditya Shah,Alex S. Huang,Luke Zettlemoyer,Dieter Fox,Yu Xiang,Anqi Li,Andreea Bobu,Abhishek Gupta,Stephen Tu,Erdem Biyik,Jesse Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 17 figures
Abstract:General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at this https URL.
[AI-9] On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective
【速读】:该论文旨在解决梯度下降(Gradient Descent, GD)在非线性神经网络中收敛速率的理论瓶颈问题,特别是在一个最小化的二分类设定下——即使用两神经元的ReLU网络和两个训练样本时,GD优化鲁棒性边界(robustness margin)的收敛速度是否受限。研究发现,尽管GD能最终收敛到最优鲁棒性边界(即最大化决策边界与训练点之间的距离),其收敛速率却极其缓慢,严格为Θ(1/ln(t)),这是首个针对非线性模型中鲁棒性边界收敛速率的显式下界。解决方案的关键在于对GD轨迹的精细化分析,特别是通过追踪模型在不同激活模式(activation patterns)下的动态演化,建立对决策边界轨迹的紧致控制,从而克服由非线性结构带来的主要技术挑战。
链接: https://arxiv.org/abs/2603.02095
作者: Guy Smorodinsky,Sveta Gimpleson,Itay Safran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We study the convergence dynamics of Gradient Descent (GD) in a minimal binary classification setting, consisting of a two-neuron ReLU network and two training instances. We prove that even under these strong simplifying assumptions, while GD successfully converges to an optimal robustness margin, effectively maximizing the distance between the decision boundary and the training points, this convergence occurs at a prohibitively slow rate, scaling strictly as \Theta(1/\ln(t)) . To the best of our knowledge, this establishes the first explicit lower bound on the convergence rate of the robustness margin in a non-linear model. Through empirical simulations, we further demonstrate that this inherent failure mode is pervasive, exhibiting the exact same tight convergence rate across multiple natural network initializations. Our theoretical guarantees are derived via a rigorous analysis of the GD trajectories across the distinct activation patterns of the model. Specifically, we develop tight control over the system’s dynamics to bound the trajectory of the decision boundary, overcoming the primary technical challenge introduced by the non-linear nature of the architecture.
[AI-10] Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD? ICLR2026
【速读】:该论文旨在解决在特定随机特征模型(power-law random features, PLRF)下,符号梯度下降法(signSGD)的缩放规律问题,尤其是其在训练过程中如何受模型规模、训练步数、学习率以及特征和目标衰减参数的影响。解决方案的关键在于通过分析线性模型在高斯压缩特征上使用单遍 signSGD 的群体风险(population risk),识别出 signSGD 独有的两种效应:漂移归一化效应(drift-normalization effect)与噪声重塑效应(noise-reshaping effect)。这些效应揭示了 signSGD 在噪声主导区域可实现比标准 SGD 更陡峭的计算最优缩放斜率,从而提升训练效率;此外,论文进一步指出,当特征衰减快而目标衰减慢时,广泛采用的预热-稳定-衰减(warmup-stable-decay, WSD)学习率调度策略能进一步降低噪声项并显著优化计算最优斜率。
链接: https://arxiv.org/abs/2603.02069
作者: Jihwan Kim,Dogyoon Song,Chulhee Yun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: Accepted at ICLR 2026, 89 pages, 25 figures
Abstract:We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.
[AI-11] OpenRad: a Curated Repository of Open-access AI models for Radiology
【速读】:该论文旨在解决放射学领域人工智能(Artificial Intelligence, AI)模型分散在不同平台和来源导致的可发现性差、复现困难及临床转化受限的问题。解决方案的关键在于构建了一个名为OpenRad的标准化、开放获取的放射学AI模型资源库,通过本地部署的大语言模型(Large Language Model, LLM)自动提取并结构化文献中的模型信息,并由专家团队人工校验以确保准确性;同时提供基于关键词搜索与多维度筛选(如成像模态、亚专业领域、用途、验证状态等)的交互式界面,显著提升模型的可访问性与可用性,从而促进放射学AI研究的透明化与协同创新。
链接: https://arxiv.org/abs/2603.02062
作者: Konstantinos Vrettos,Galini Papadaki,Emmanouil Brilakis,Matthaios Triantafyllou,Dimitrios Leventis,Despina Staraki,Maria Mavroforou,Eleftherios Tzanis,Konstantina Giouroukou,Michail E. Klontzas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures
Abstract:The rapid developments in artificial intelligence (AI) research in radiology have produced numerous models that are scattered across various platforms and sources, limiting discoverability, reproducibility and clinical translation. Herein, OpenRad was created, a curated, standardized, open-access repository that aggregates radiology AI models and providing details such as the availability of pretrained weights and interactive applications. Retrospective analysis of peer reviewed literature and preprints indexed in PubMed, arXiv and Scopus was performed until Dec 2025 (n = 5239 records). Model records were generated using a locally hosted LLM (gpt-oss:120b), based on the RSNA AI Roadmap JSON schema, and manually verified by ten expert reviewers. Stability of LLM outputs was assessed on 225 randomly selected papers using text similarity metrics. A total of 1694 articles were included after review. Included models span all imaging modalities (CT, MRI, X-ray, US) and radiology subspecialties. Automated extraction demonstrated high stability for structured fields (Levenshtein ratio 90%), with 78.5% of record edits being characterized as minor during expert review. Statistical analysis of the repository revealed CNN and transformer architectures as dominant, while MRI was the most commonly used modality (in 621 neuroradiology AI models). Research output was mostly concentrated in China and the United States. The OpenRad web interface enables model discovery via keyword search and filters for modality, subspecialty, intended use, verification status and demo availability, alongside live statistics. The community can contribute new models through a dedicated portal. OpenRad contains approx. 1700 open access, curated radiology AI models with standardized metadata, supplemented with analysis of code repositories, thereby creating a comprehensive, searchable resource for the radiology community.
[AI-12] Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization
【速读】:该论文旨在解决生成式 AI 模型在细粒度评估中面临的“数据瓶颈”问题,即传统方法依赖大量人工标注(gold-standard labels)成本过高,而自动化评分又常与人类判断不一致。其解决方案的关键在于提出一种基于张量分解(tensor factorization)的统计模型:首先利用低成本的自动评分器(autorater)数据预训练提示(prompt)和生成模型的潜在表示,再通过少量人工标注校准集将这些预训练表示对齐至人类偏好,从而实现样本高效、鲁棒且高精度的个体提示级预测,同时提供可靠的置信区间。
链接: https://arxiv.org/abs/2603.02029
作者: Felipe Maia Polo,Aida Nematzadeh,Virginia Aglietti,Adam Fisch,Isabela Albuquerque
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Moving beyond evaluations that collapse performance across heterogeneous prompts toward fine-grained evaluation at the prompt level, or within relatively homogeneous subsets, is necessary to diagnose generative models’ strengths and weaknesses. Such fine-grained evaluations, however, suffer from a data bottleneck: human gold-standard labels are too costly at this scale, while automated ratings are often misaligned with human judgment. To resolve this challenge, we propose a novel statistical model based on tensor factorization that merges cheap autorater data with a limited set of human gold-standard labels. Specifically, our approach uses autorater scores to pretrain latent representations of prompts and generative models, and then aligns those pretrained representations to human preferences using a small calibration set. This sample-efficient methodology is robust to autorater quality, more accurately predicts human preferences on a per-prompt basis than standard baselines, and provides tight confidence intervals for key statistical parameters of interest. We also showcase the practical utility of our method by constructing granular leaderboards based on prompt qualities and by estimating model performance solely from autorater scores, eliminating the need for additional human annotations.
[AI-13] Revealing Combinatorial Reasoning of GNNs via Graph Concept Bottleneck Layer
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在预测过程中缺乏对组合推理机制的可解释性问题,尤其是现有方法仅能提取硬逻辑规则而无法量化每个图概念(graph concept)对预测的贡献。其核心挑战在于GNN黑箱结构使得拓扑模式到逻辑规则的映射难以被准确捕捉。解决方案的关键是提出一种图概念瓶颈层(Graph Concept Bottleneck Layer, GCBM),该层可嵌入任意GNN架构中,强制模型基于所选判别性全局图概念进行预测,并通过稀疏线性层将概念得分投影至类别标签,从而显式建模软逻辑规则(soft logical rule),实现对每个概念贡献的定量分析。此外,作者将概念视为“图词”(graph words)、图视为“图句”(graph sentences),借助语言模型学习图概念嵌入,进一步提升概念瓶颈的质量。
链接: https://arxiv.org/abs/2603.02025
作者: Yue Niu,Zhaokai Sun,Jiayi Yang,Xiaofeng Cao,Rui Fan,Xin Sun,Hanli Wang,Wei Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:Despite their success in various domains, the growing dependence on GNNs raises a critical concern about the nature of the combinatorial reasoning underlying their predictions, which is often hidden within their black-box architectures. Addressing this challenge requires understanding how GNNs translate topological patterns into logical rules. However, current works only uncover the hard logical rules over graph concepts, which cannot quantify the contribution of each concept to prediction. Moreover, they are post-hoc interpretable methods that generate explanations after model training and may not accurately reflect the true combinatorial reasoning of GNNs, since they approximate it with a surrogate. In this work, we develop a graph concept bottleneck layer that can be integrated into any GNN architectures to guide them to predict the selected discriminative global graph concepts. The predicted concept scores are further projected to class labels by a sparse linear layer. It enforces the combinatorial reasoning of GNNs’ predictions to fit the soft logical rule over graph concepts and thus can quantify the contribution of each concept. To further improve the quality of the concept bottleneck, we treat concepts as “graph words” and graphs as “graph sentences”, and leverage language models to learn graph concept embeddings. Extensive experiments on multiple datasets show that our method GCBMs achieve state-of-the-art performance both in classification and interpretability.
[AI-14] CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space
【速读】:该论文旨在解决语音带宽扩展(Speech Bandwidth Extension, BWE)中因传统方法依赖频谱图或波形建模而导致计算成本高、高频保真度有限的问题。解决方案的关键在于提出CodecFlow框架,其核心创新包括:在连续神经音频编解码器嵌入上采用声门激励感知的条件流转换器(voicing-aware conditional flow converter),以提升低维潜在空间中的语音重建效率;同时引入结构约束的残差向量量化器(structure-constrained residual vector quantizer),增强潜在表示对齐的稳定性,从而实现从8 kHz到16 kHz及44.1 kHz语音BWE任务中更强的频谱保真度和更优的感知质量。
链接: https://arxiv.org/abs/2603.02022
作者: Bowen Zhang,Junchuan Zhao,Ian McLoughlin,Ye Wang,A S Madhukumar
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures
Abstract:Speech Bandwidth Extension improves clarity and intelligibility by restoring/inferring appropriate high-frequency content for low-bandwidth speech. Existing methods often rely on spectrogram or waveform modeling, which can incur higher computational cost and have limited high-frequency fidelity. Neural audio codecs offer compact latent representations that better preserve acoustic detail, yet accurately recovering high-resolution latent information remains challenging due to representation mismatch. We present CodecFlow, a neural codec-based BWE framework that performs efficient speech reconstruction in a compact latent space. CodecFlow employs a voicing-aware conditional flow converter on continuous codec embeddings and a structure-constrained residual vector quantizer to improve latent alignment stability. Optimized end-to-end, CodecFlow achieves strong spectral fidelity and enhanced perceptual quality on 8 kHz to 16 kHz and 44.1 kHz speech BWE tasks.
[AI-15] Mitigating topology biases in Graph Diffusion via Counterfactual Intervention
【速读】:该论文旨在解决图扩散模型在图生成任务中继承并放大敏感属性(如性别、年龄、地区)导致的拓扑偏差问题,从而生成不公平的合成图。现有公平图生成方法受限于特定应用场景下的完整标签或需同步更新图结构与节点属性,难以通用化。解决方案的关键在于提出Fair Graph Diffusion Model (FairGDiff),其基于反事实学习设计了一步式去偏机制:构建因果模型以刻画敏感属性、偏倚链接形成与图结构之间的关系;通过回答“若敏感属性不同,图结构是否会改变?”这一反事实问题,估计无偏处理并嵌入扩散过程;同时在前向扩散和反向去噪阶段融合反事实学习,确保生成图与敏感属性独立且保持结构完整性,从而在公平性与实用性之间实现更优平衡。
链接: https://arxiv.org/abs/2603.02005
作者: Wendi Wang,Jiaxi Yang,Yongkang Du,Lu Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Graph diffusion models have gained significant attention in graph generation tasks, but they often inherit and amplify topology biases from sensitive attributes (e.g. gender, age, region), leading to unfair synthetic graphs. Existing fair graph generation using diffusion models is limited to specific graph-based applications with complete labels or requires simultaneous updates for graph structure and node attributes, making them unsuitable for general usage. To relax these limitations by applying the debiasing method directly on graph topology, we propose Fair Graph Diffusion Model (FairGDiff), a counterfactual-based one-step solution that mitigates topology biases while balancing fairness and utility. In detail, we construct a causal model to capture the relationship between sensitive attributes, biased link formation, and the generated graph structure. By answering the counterfactual question “Would the graph structure change if the sensitive attribute were different?”, we estimate an unbiased treatment and incorporate it into the diffusion process. FairGDiff integrates counterfactual learning into both forward diffusion and backward denoising, ensuring that the generated graphs are independent of sensitive attributes while preserving structural integrity. Extensive experiments on real-world datasets demonstrate that FairGDiff achieves a superior trade-off between fairness and utility, outperforming existing fair graph generation methods while maintaining scalability.
[AI-16] MatRIS: Toward Reliable and Efficient Pretrained Machine Learning Interaction Potentials
【速读】:该论文旨在解决当前等变机器学习势能模型(equivariant MLIPs)在处理高维原子相互作用时计算成本高昂的问题,尤其是在量子力学数据集持续扩大的背景下,如何构建更紧凑且高效的模型以充分捕捉复杂原子交互。解决方案的关键在于提出一种新的不变量机器学习势能模型 MatRIS(Materials Representation and Interaction Simulation),其核心创新是引入基于注意力机制的三体相互作用建模,并设计了一种具有线性复杂度 O(N) 的可分离注意力机制,从而在保持与领先等变模型相当精度的同时显著降低训练成本。
链接: https://arxiv.org/abs/2603.02002
作者: Yuanchang Zhou,Siyu Hu,Xiangyu Zhang,Hongyu Wang,Guangming Tan,Weile Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 9 figures, 12 tables
Abstract:Foundation MLIPs demonstrate broad applicability across diverse material systems and have emerged as a powerful and transformative paradigm in chemical and computational materials science. Equivariant MLIPs achieve state-of-the-art accuracy in a wide range of benchmarks by incorporating equivariant inductive bias. However, the reliance on tensor products and high-degree representations makes them computationally costly. This raises a fundamental question: as quantum mechanical-based datasets continue to expand, can we develop a more compact model to thoroughly exploit high-dimensional atomic interactions? In this work, we present MatRIS (\textbfMaterials \textbfRepresentation and \textbfInteraction \textbfSimulation), an invariant MLIP that introduces attention-based modeling of three-body interactions. MatRIS leverages a novel separable attention mechanism with linear complexity O(N) , enabling both scalability and expressiveness. MatRIS delivers accuracy comparable to that of leading equivariant models on a wide range of popular benchmarks (Matbench-Discovery, MatPES, MDR phonon, Molecular dataset, etc). Taking Matbench-Discovery as an example, MatRIS achieves an F1 score of up to 0.847 and attains comparable accuracy at a lower training cost. The work indicates that our carefully designed invariant models can match or exceed the accuracy of equivariant models at a fraction of the cost, shedding light on the development of accurate and efficient MLIPs.
[AI-17] Intrinsic Task Symmetry Drives Generalization in Algorithmic Tasks
【速读】:该论文旨在解决神经网络在训练过程中如何从单纯的记忆行为(memorization)过渡到真正的泛化能力(generalization)的问题,这一现象被称为“grokking”。其核心挑战在于揭示导致这种突然转变的内在机制。论文提出的关键解决方案是:内在任务对称性(intrinsic task symmetries)驱动了grokking过程,并主导了模型表示空间的几何结构演化。研究发现,grokking遵循三个阶段动态:记忆、对称性获取和几何组织;其中,泛化能力在对称性获取阶段即已出现,随后模型表示被重新组织为与任务对齐的低维结构。这一对称性驱动的框架在代数、结构和关系推理等多种算法任务中得到验证,并进一步提出了基于对称性的诊断方法以预测泛化 onset 和加速其发生。
链接: https://arxiv.org/abs/2603.01968
作者: Hyeonbin Hwang,Yeachan Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Grokking, the sudden transition from memorization to generalization, is characterized by the emergence of low-dimensional representations, yet the mechanism underlying this organization remains elusive. We propose that intrinsic task symmetries primarily drive grokking and shape the geometry of the model’s representation space. We identify a consistent three-stage training dynamic underlying grokking: (i) memorization, (ii) symmetry acquisition, and (iii) geometric organization. We show that generalization emerges during the symmetry acquisition phase, after which representations reorganize into a structured, task-aligned geometry. We validate this symmetry-driven account across diverse algorithmic domains, including algebraic, structural, and relational reasoning tasks. Building on these findings, we introduce a symmetry-based diagnostic that anticipates the onset of generalization and propose strategies to accelerate it. Together, our results establish intrinsic symmetry as the key factor enabling neural networks to move beyond memorization and achieve robust algorithmic reasoning.
[AI-18] dAttention: a CUDA Tile SDPA Kernel for PyTorch
【速读】:该论文旨在解决生成式 AI (Generative AI) 模型中注意力机制(Attention)计算效率与可定制性之间的矛盾问题,尤其是在 NVIDIA GPU 上进行高效、灵活的缩放点积注意力(Scaled Dot-Product Attention, SDPA)研究时面临的挑战。解决方案的关键在于提出 TiledAttention——一个基于 cuTile Python(TileIR)实现的 SDPA 前向算子,它通过在线 softmax 和分块 K/V 流式传输保留真实行为的同时,允许研究人员直接在 Python 层面修改调度策略(如分块形状、数据 staging 和共享内存布局),从而无需重写复杂的 CUDA/CUTLASS 模板即可实现快速、可复现的内核研究,兼顾性能与灵活性。
链接: https://arxiv.org/abs/2603.01960
作者: Taimur Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled K,V streaming. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch) and explicit unfused baselines across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.
[AI-19] LiveCultureBench: a Multi-Agent Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为自主代理(autonomous agents)部署时,评估体系过于关注任务完成度而忽视文化适宜性(cultural appropriateness)与评价者可靠性(evaluator reliability)的问题。解决方案的关键在于提出一个名为LiveCultureBench的多文化动态基准测试平台,其通过将LLM嵌入模拟城镇环境中,使模型在具有多样化人口统计学和文化背景的居民社会互动中执行任务,并由基于LLM的验证器(verifier)结构化判断规范违反行为与任务进展,从而量化任务成效与文化规范之间的权衡关系及验证器不确定性,实现对LLM代理跨文化鲁棒性、效用-规范敏感性平衡能力以及自动化评估可靠性的系统性研究。
链接: https://arxiv.org/abs/2603.01952
作者: Viet-Thanh Pham,Lizhen Qu,Thuy-Trang Vu,Gholamreza Haffari,Dinh Phung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce LiveCultureBench, a multi-cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and adherence to socio-cultural norms. The simulation models a small city as a location graph with synthetic residents having diverse demographic and cultural profiles. Each episode assigns one resident a daily goal while others provide social context. An LLM-based verifier generates structured judgments on norm violations and task progress, which we aggregate into metrics capturing task-norm trade-offs and verifier uncertainty. Using LiveCultureBench across models and cultural profiles, we study (i) cross-cultural robustness of LLM agents, (ii) how they balance effectiveness against norm sensitivity, and (iii) when LLM-as-a-judge evaluation is reliable for automated benchmarking versus when human oversight is needed.
[AI-20] Probabilistic Retrofitting of Learned Simulators
【速读】:该论文旨在解决当前偏微分方程(Partial Differential Equations, PDEs)建模中普遍采用确定性预测方法所面临的局限性,即无法有效捕捉物理系统固有的混沌性和不确定性。现有方法通常需要从零开始训练概率模型,这不仅计算成本高昂,还难以利用已有的高性能确定性模型资源。解决方案的关键在于提出一种高效训练策略,通过引入连续排名概率分数(Continuous Ranked Probability Score, CRPS)作为合适的评分规则,对预训练的确定性模型进行“ retrofitting”(改造),从而将其转化为概率模型。该方法具有架构无关性,适用于不同结构的模型且仅需少量代码修改,在单个动力学系统和多系统预训练的PDE基础模型上均实现了显著性能提升,例如在滚动预测中CRPS降低20-54%,VRMSE提升最高达30%。这一成果表明,概率化PDE建模无需从头训练,可基于现有确定性骨干模型以较低额外成本实现。
链接: https://arxiv.org/abs/2603.01949
作者: Cristiana Diaconu,Miles Cranmer,Richard E. Turner,Tanya Marwah,Payel Mukhopadhyay
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Code provided at this https URL
Abstract:Dominant approaches for modelling Partial Differential Equations (PDEs) rely on deterministic predictions, yet many physical systems of interest are inherently chaotic and uncertain. While training probabilistic models from scratch is possible, it is computationally expensive and fails to leverage the significant resources already invested in high-performing deterministic backbones. In this work, we adopt a training-efficient strategy to transform pre-trained deterministic models into probabilistic ones via retrofitting with a proper scoring rule: the Continuous Ranked Probability Score (CRPS). Crucially, this approach is architecture-agnostic: it applies the same adaptation mechanism across distinct model backbones with minimal code modifications. The method proves highly effective across different scales of pre-training: for models trained on single dynamical systems, we achieve 20-54% reductions in rollout CRPS and up to 30% improvements in variance-normalised RMSE (VRMSE) relative to compute-matched deterministic fine-tuning. We further validate our approach on a PDE foundation model, trained on multiple systems and retrofitted on the dataset of interest, to show that our probabilistic adaptation yields an improvement of up to 40% in CRPS and up to 15% in VRMSE compared to deterministic fine-tuning. Validated across diverse architectures and dynamics, our results show that probabilistic PDE modelling need not require retraining from scratch, but can be unlocked from existing deterministic backbones with modest additional training cost.
[AI-21] CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification
【速读】:该论文旨在解决多轮交互式工具使用智能体(interactive tool-use agents)在面对现实世界中复杂且模糊的用户需求时,难以执行确定性动作以满足任务要求的问题。其核心挑战在于如何在训练过程中既保证数据的复杂性以模拟真实场景,又确保轨迹的正确性以支持有效学习。解决方案的关键在于提出一种名为CoVe(Constraint-Verification)的后训练数据合成框架,该框架通过显式定义任务约束来双重作用:一方面引导生成复杂的行为轨迹,另一方面作为确定性验证器评估轨迹质量,从而为监督微调(SFT)提供高质量训练轨迹,并为强化学习(RL)生成精确奖励信号。
链接: https://arxiv.org/abs/2603.01940
作者: Jinpeng Chen,Cheng Gong,Hanbo Li,Ziru Liu,Zichen Tian,Xinyu Fu,Shi Wu,Chenyang Zhang,Wu Zhang,Suiyun Zhang,Dandan Tu,Rui Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce \textbfCoVe (\textbfConstraint-\textbfVerification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging \tau^2 -bench benchmark demonstrates the effectiveness of the framework. Notably, our compact \textbfCoVe-4B model achieves success rates of 43.0% and 59.4% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to 17\times its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.
[AI-22] Explanation-Guided Adversarial Training for Robust and Interpretable Models
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在面对对抗样本或分布外(Out-of-Distribution, OOD)数据时,预测性能下降且解释性不足的问题。现有方法如解释引导学习(Explanation-Guided Learning, EGL)虽能提升模型可解释性,但依赖大量人工标注且假设输入为良性;而对抗训练(Adversarial Training, AT)虽增强鲁棒性,却无法保证决策基于语义有意义的特征。解决方案的关键在于提出一种统一框架——解释引导的对抗训练(Explanation-Guided Adversarial Training, EGAT),其核心是在生成对抗样本的同时施加基于解释的约束,联合优化分类准确率、对抗鲁棒性和归因稳定性,从而在保持高干净准确率和对抗准确率(+37%)的同时,生成更具语义合理性的解释,并仅增加约16%的训练时间。
链接: https://arxiv.org/abs/2603.01938
作者: Chao Chen,Yanhui Chen,Shanshan Lin,Dongsheng Hong,Shu Wu,Xiangwen Liao,Chuanyi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions On Circuits and Systems For Video Technology (TCSVT 2026)
Abstract:Deep neural networks (DNNs) have achieved remarkable performance in many tasks, yet they often behave as opaque black boxes. Explanation-guided learning (EGL) methods steer DNNs using human-provided explanations or supervision on model attributions. These approaches improve interpretability but typically assume benign inputs and incur heavy annotation costs. In contrast, both predictions and saliency maps of DNNs could dramatically alter facing imperceptible perturbations or unseen patterns. Adversarial training (AT) can substantially improve robustness, but it does not guarantee that model decisions rely on semantically meaningful features. In response, we propose Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality. EGAT generates adversarial examples on the fly while imposing explanation-based constraints on the model. By jointly optimizing classification performance, adversarial robustness, and attributional stability, EGAT is not only more resistant to unexpected cases, including adversarial attacks and out-of-distribution (OOD) scenarios, but also offer human-interpretable justifications for the decisions. We further formalize EGAT within the Probably Approximately Correct learning framework, demonstrating theoretically that it yields more stable predictions under unexpected situations compared to standard AT. Empirical evaluations on OOD benchmark datasets show that EGAT consistently outperforms competitive baselines in both clean accuracy and adversarial accuracy +37% while producing more semantically meaningful explanations, and requiring only a limited increase +16% in training time.
[AI-23] Dream2Learn: Structured Generative Dreaming for Continual Learning
【速读】:该论文旨在解决持续学习(Continual Learning)中因灾难性遗忘(Catastrophic Forgetting)导致的模型稳定性与可塑性失衡问题。其核心解决方案是提出Dream2Learn(D2L)框架,通过让模型自主生成结构化的合成经验进行自我训练:利用冻结的扩散模型(Diffusion Model),以分类器自身驱动的软提示优化(Soft Prompt Optimization)为条件,生成语义上新颖且与已有知识一致的“梦境类”样本;这些样本不用于替代历史记忆,而是用于扩展和重构表征空间,从而实现前向知识迁移与未来任务适应能力的提升。该机制模拟人类睡眠期间的记忆重组过程,将内部模拟转化为增强泛化性能的有效训练信号。
链接: https://arxiv.org/abs/2603.01935
作者: Salvatore Calcagno,Matteo Pennisi,Federica Proietto Salanitri,Amelia Sorrenti,Simone Palazzo,Concetto Spampinato,Giovanni Bellitto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual learning requires balancing plasticity and stability while mitigating catastrophic forgetting. Inspired by human dreaming as a mechanism for internal simulation and knowledge restructuring, we introduce Dream2Learn (D2L), a framework in which a model autonomously generates structured synthetic experiences from its own internal representations and uses them for self-improvement. Rather than reconstructing past data as in generative replay, D2L enables a classifier to create novel, semantically distinct dreamed classes that are coherent with its learned knowledge yet do not correspond to previously observed data. These dreamed samples are produced by conditioning a frozen diffusion model through soft prompt optimization driven by the classifier itself. The generated data are not used to replace memory, but to expand and reorganize the representation space, effectively allowing the network to self-train on internally synthesized concepts. By integrating dreamed classes into continual training, D2L proactively structures latent features to support forward knowledge transfer and adaptation to future tasks. This prospective self-training mechanism mirrors the role of sleep in consolidating and reorganizing memory, turning internal simulations into a tool for improved generalization. Experiments on Mini-ImageNet, FG-ImageNet, and ImageNet-R demonstrate that D2L consistently outperforms strong rehearsal-based baselines and achieves positive forward transfer, confirming its ability to enhance adaptability through internally generated training signals.
[AI-24] Real Money Fake Models: Deceptive Model Claims in Shadow APIs
【速读】:该论文旨在解决当前学术研究中广泛使用“影子API”(shadow APIs)所带来的可靠性与可复现性问题。影子API作为第三方服务,声称绕过官方大语言模型(Large Language Models, LLMs)的地域限制和付费壁垒提供访问接口,但其输出是否与官方API一致尚不明确,这直接影响了下游应用和科研结果的有效性。论文的关键解决方案是首次对官方LLM API与对应的影子API进行系统性审计,通过多维评估(包括功能可用性、安全性及模型身份验证)揭示了影子API中存在的欺骗行为:如性能差异最高达47.21%、安全行为显著不可预测以及45.83%的指纹测试中身份验证失败。这一发现表明影子API在多个关键维度上无法保证与官方服务的一致性,从而严重威胁科学研究的可信度与完整性。
链接: https://arxiv.org/abs/2603.01919
作者: Yage Zhang,Yukun Jiang,Zeyuan Chen,Michael Backes,Xinyue Shen,Yang Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Access to frontier large language models (LLMs), such as GPT-5 and Gemini-2.5, is often hindered by high pricing, payment barriers, and regional restrictions. These limitations drive the proliferation of \textitshadow APIs , third-party services that claim to provide access to official model services without regional limitations via indirect access. Despite their widespread use, it remains unclear whether shadow APIs deliver outputs consistent with those of the official APIs, raising concerns about the reliability of downstream applications and the validity of research findings that depend on them. In this paper, we present the first systematic audit between official LLM APIs and corresponding shadow APIs. We first identify 17 shadow APIs that have been utilized in 187 academic papers, with the most popular one reaching 5,966 citations and 58,639 GitHub stars by December 6, 2025. Through multidimensional auditing of three representative shadow APIs across utility, safety, and model verification, we uncover both indirect and direct evidence of deception practices in shadow APIs. Specifically, we reveal performance divergence reaching up to 47.21% , significant unpredictability in safety behaviors, and identity verification failures in 45.83% of fingerprint tests. These deceptive practices critically undermine the reproducibility and validity of scientific research, harm the interests of shadow API users, and damage the reputation of official model providers.
[AI-25] Agent ic Code Reasoning
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在不执行代码的情况下,能否有效探索代码库并推理代码语义的问题,即“代理式代码推理”(agentic code reasoning)能力的实现。其核心挑战在于如何确保LLM代理在推理过程中避免跳跃性判断或缺乏依据的结论,从而提升推理的可靠性与可验证性。解决方案的关键在于提出“半形式化推理”(semi-formal reasoning),这是一种结构化的提示方法,要求代理构建明确的前提、追踪执行路径,并推导出形式化的结论;该方法本质上是一种推理证书(certificate),强制代理完整覆盖所有情况且每一步结论均有依据,从而显著提升代码理解任务的准确性,在补丁等价性验证、缺陷定位和代码问答等任务中均取得实质性改进,为无需执行的强化学习奖励信号设计、代码审查及静态程序分析提供了可行路径。
链接: https://arxiv.org/abs/2603.01896
作者: Shubham Ugare,Satish Chandra
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions. Unlike unstructured chain-of-thought, semi-formal reasoning acts as a certificate: the agent cannot skip cases or make unsupported claims. We evaluate across three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy on all of them. For patch equivalence, accuracy improves from 78% to 88% on curated examples and reaches 93% on real-world agent-generated patches, approaching the reliability needed for execution-free RL reward signals. For code question answering on RubberDuckBench Mohammad et al. (2026), semi-formal reasoning achieves 87% accuracy. For fault localization on Defects4J Just et al. (2014), semi-formal reasoning improves Top-5 accuracy by 5 percentage points over standard reasoning. These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution, opening practical applications in RL training pipelines, code review, and static program analysis.
[AI-26] Diagnosing Generalization Failures from Representational Geometry Markers ICLR
【速读】:该论文旨在解决人工智能模型在未见场景下(即分布外,Out-of-Distribution, OOD)性能下降的问题,这是当前生成式AI和深度学习系统部署中的关键挑战。传统方法多采用自下而上的机制分析策略,试图通过解析内部特征或神经回路来构建解释性模型,但往往难以提供高阶、可预测的失败预警信号。本文提出一种受医学生物标志物启发的自上而下的研究范式:识别能够作为系统级指标的网络标记(network markers),用以稳健地预测模型未来的泛化表现。其核心创新在于发现并验证了任务相关几何属性——如分布内(In-Distribution, ID)物体流形的有效维度和效用(utility)——可以稳定预测OOD性能,且该指标在不同架构、优化器与数据集间具有普适性,甚至优于ID准确率,从而为模型选择和AI可解释性提供了更可靠的依据。
链接: https://arxiv.org/abs/2603.01879
作者: Chi-Ning Chou,Artem Kirsanov,Yao-Yuan Yang,SueYeon Chung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in the International Conference on Learning Representations (ICLR), 2026
Abstract:Generalization, the ability to perform well beyond the training context, is a hallmark of biological and artificial intelligence, yet anticipating unseen failures remains a central challenge. Conventional approaches often take a bottom-up'' mechanistic route by reverse-engineering interpretable features or circuits to build explanatory models. While insightful, these methods often struggle to provide the high-level, predictive signals for anticipating failure in real-world deployment. Here, we propose using a top-down’’ approach to studying generalization failures inspired by medical biomarkers: identifying system-level measurements that serve as robust indicators of a model’s future performance. Rather than mapping out detailed internal mechanisms, we systematically design and test network markers to probe structure, function links, identify prognostic indicators, and validate predictions in real-world settings. In image classification, we find that task-relevant geometric properties of in-distribution (ID) object manifolds consistently forecast poor out-of-distribution (OOD) generalization. In particular, reductions in two geometric measures, effective manifold dimensionality and utility, predict weaker OOD performance across diverse architectures, optimizers, and datasets. We apply this finding to transfer learning with ImageNet-pretrained models. We consistently find that the same geometric patterns predict OOD transfer performance more reliably than ID accuracy. This work demonstrates that representational geometry can expose hidden vulnerabilities, offering more robust guidance for model selection and AI interpretability.
[AI-27] Phishing the Phishers with SpecularNet: Hierarchical Graph Autoencoding for Reference-Free Web Phishing Detection
【速读】:该论文旨在解决当前网页钓鱼检测方法在实际应用中面临的可扩展性、可复现性和实时性不足的问题。现有基于参考资源和生成式 AI 的检测方法虽然准确率高,但依赖外部知识库、云服务及复杂的多模态处理流程,限制了其部署效率与适用场景;而传统深度学习模型又难以适应不断演化的钓鱼攻击。解决方案的关键在于提出一种轻量级的无参考(reference-free)检测框架 SpecularNet,其核心创新是仅利用域名和 HTML 结构信息,将文档对象模型(Document Object Model, DOM)建模为树结构,并采用具有方向性和层级传递机制的图自编码架构,从而捕捉钓鱼网页的高阶结构不变性,同时实现仅需约 20 毫秒/页的端到端推理速度,显著优于现有方法的数秒级延迟,在保持 F1 分数达 93.9% 的前提下大幅降低计算开销。
链接: https://arxiv.org/abs/2603.01874
作者: Tailai Song,Pedro Casas,Michela Meo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Phishing remains the most pervasive threat to the Web, enabling large-scale credential theft and financial fraud through deceptive webpages. While recent reference-based and generative-AI-driven phishing detectors achieve strong accuracy, their reliance on external knowledge bases, cloud services, and complex multimodal pipelines fundamentally limits practicality, scalability, and reproducibility. In contrast, conventional deep learning approaches often fail to generalize to evolving phishing campaigns. We introduce SpecularNet, a novel lightweight framework for reference-free web phishing detection that demonstrates how carefully designed compact architectures can rival heavyweight systems. SpecularNet operates solely on the domain name and HTML structure, modeling the Document Object Model (DOM) as a tree and leveraging a hierarchical graph autoencoding architecture with directional, level-wise message passing. This design captures higher-order structural invariants of phishing webpages while enabling fast, end-to-end inference on standard CPUs. Extensive evaluation against 13 state of the art phishing detectors, including leading reference-based systems, shows that SpecularNet achieves competitive detection performance with dramatically lower computational cost. On benchmark datasets, it reaches an F1 score of 93.9%, trailing the best reference-based method slightly while reducing inference time from several seconds to approximately 20 milliseconds per webpage. Field and robustness evaluations further validate SpecularNet in real-world deployments, on a newly collected 2026 open-world dataset, and against adversarial attacks.
[AI-28] de: A Customisable Dataset Generator for Anti-Money Laundering Research
【速读】:该论文旨在解决反洗钱(Anti-Money Laundering, AML)领域中机器学习研究因缺乏可访问的真实交易数据而受到的限制问题。现有合成数据生成方法多聚焦于简单的结构模式,忽视了洗钱行为所具有的时间动态特性(timing and frequency),导致生成的数据难以真实反映复杂洗钱场景。为此,作者提出Tide——一个开源的基于图结构的合成金融网络生成工具,其关键创新在于同时建模洗钱行为的结构性特征与时间动态特征,从而生成更具现实代表性的合成数据集。该方案支持可复现、可定制的数据生成,并提供了两个具有不同非法交易比例(LI: 0.10%,HI: 0.19%)的基准数据集及先进检测模型实现,验证了不同模型在不同操作条件下的性能差异,为AML检测方法的鲁棒性评估提供了一个可配置的基准平台。
链接: https://arxiv.org/abs/2603.01863
作者: Montijn van den Beukel,Jože Martin Rožanec,Ana-Lucia Varbanescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Synthetic AML transaction datasets (Tide, HI and LI variants) are available at this https URL
Abstract:The lack of accessible transactional data significantly hinders machine learning research for Anti-Money Laundering (AML). Privacy and legal concerns prevent the sharing of real financial data, while existing synthetic generators focus on simplistic structural patterns and neglect the temporal dynamics (timing and frequency) that characterise sophisticated laundering schemes. We present Tide, an open-source synthetic dataset generator that produces graph-based financial networks incorporating money laundering patterns defined by both structural and temporal characteristics. Tide enables reproducible, customisable dataset generation tailored to specific research needs. We release two reference datasets with varying illicit ratios (LI: 0.10%, HI: 0.19%), alongside the implementation of state-of-the-art detection models. Evaluation across these datasets reveals condition-dependent model rankings: LightGBM achieves the highest PR-AUC (78.05) in the low illicit ratio condition, while XGBoost performs best (85.12) at higher fraud prevalence. These divergent rankings demonstrate that the reference datasets can meaningfully differentiate model capabilities across operational conditions. Tide provides the research community with a configurable benchmark that exposes meaningful performance variation across model architectures, advancing the development of robust AML detection methods. Comments: Synthetic AML transaction datasets (Tide, HI and LI variants) are available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.01863 [cs.LG] (or arXiv:2603.01863v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.01863 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Montijn Van Den Beukel [view email] [v1] Mon, 2 Mar 2026 13:44:18 UTC (6,360 KB)
[AI-29] Emerging Human-like Strategies for Semantic Memory Forag ing in Large Language Models
【速读】:该论文旨在解决如何通过机制可解释性技术来更严谨地研究大语言模型(Large Language Models, LLMs)中的语义记忆搜寻行为,特别是借鉴人类在语义流畅性任务(Semantic Fluency Task, SFT)中表现出的高效记忆访问策略。其解决方案的关键在于识别LLMs在不同层中是否涌现出与人类相似的收敛式(convergent)和发散式(divergent)生成性记忆搜索模式——这两种模式在人类认知中协同作用以实现高效的语义记忆觅食(semantic memory foraging)。研究发现,这些关键的行为特征在LLMs中同样可被识别,从而为将LLMs向人类认知对齐或引导其走向互补性的认知非对齐以增强人机协作提供了新视角。
链接: https://arxiv.org/abs/2603.01822
作者: Eric Lacosse,Mariana Duarte,Peter M. Todd,Daniel C. McNamee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Both humans and Large Language Models (LLMs) store a vast repository of semantic memories. In humans, efficient and strategic access to this memory store is a critical foundation for a variety of cognitive functions. Such access has long been a focus of psychology and the computational mechanisms behind it are now well characterized. Much of this understanding has been gleaned from a widely-used neuropsychological and cognitive science assessment called the Semantic Fluency Task (SFT), which requires the generation of as many semantically constrained concepts as possible. Our goal is to apply mechanistic interpretability techniques to bring greater rigor to the study of semantic memory foraging in LLMs. To this end, we present preliminary results examining SFT as a case study. A central focus is on convergent and divergent patterns of generative memory search, which in humans play complementary strategic roles in efficient memory foraging. We show that these same behavioral signatures, critical to human performance on the SFT, also emerge as identifiable patterns in LLMs across distinct layers. Potentially, this analysis provides new insights into how LLMs may be adapted into closer cognitive alignment with humans, or alternatively, guided toward productive cognitive \emphdisalignment to enhance complementary strengths in human-AI interaction.
[AI-30] What Papers Dont Tell You: Recovering Tacit Knowledge for Automated Paper Reproduction
【速读】:该论文旨在解决自动化论文复现(automated paper reproduction)中的核心瓶颈问题,即学术论文中不可避免地隐含了三种类型的缄默知识(tacit knowledge)——关系性知识(relational)、具身性知识(somatic)和集体性知识(collective)。传统方法受限于信息检索能力,而真正阻碍代码自动生成的是这些难以显式表达的隐性知识。为此,作者提出\method框架,其关键在于构建一个基于图结构的代理系统,并针对三类知识分别设计专用机制:节点级关系感知聚合(node-level relation-aware aggregation)通过分析目标论文与其引用文献之间的实现单元重用与适配关系来恢复关系性知识;执行反馈精化(execution-feedback refinement)利用运行时信号驱动迭代调试以捕获具身性知识;图级知识归纳(graph-level knowledge induction)则从具有相似实现模式的论文聚类中提炼集体性知识。该方案在扩展版ReproduceBench上显著优于现有基线,平均性能差距缩小至10.04%,相较最强基线提升24.68%。
链接: https://arxiv.org/abs/2603.01801
作者: Lehui Li,Ruining Wang,Haochen Song,Yaoxin Mao,Tong Zhang,Yuyao Wang,Jiayi Fan,Yitong Zhang,Jieping Ye,Chengqi Zhang,Yongshun Gong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages (+ appendix), 8 figures. Lehui Li and Ruining Wang contributed equally. Yongshun Gong is the corresponding author
Abstract:Automated paper reproduction – generating executable code from academic papers – is bottlenecked not by information retrieval but by the tacit knowledge that papers inevitably leave implicit. We formalize this challenge as the progressive recovery of three types of tacit knowledge – relational, somatic, and collective – and propose \method, a graph-based agent framework with a dedicated mechanism for each: node-level relation-aware aggregation recovers relational knowledge by analyzing implementation-unit-level reuse and adaptation relationships between the target paper and its citation neighbors; execution-feedback refinement recovers somatic knowledge through iterative debugging driven by runtime signals; and graph-level knowledge induction distills collective knowledge from clusters of papers sharing similar implementations. On an extended ReproduceBench spanning 3 domains, 10 tasks, and 40 recent papers, \method achieves an average performance gap of 10.04% against official implementations, improving over the strongest baseline by 24.68%. The code will be publicly released upon acceptance; the repository link will be provided in the final version.
[AI-31] Phase-Type Variational Autoencoders for Heavy-Tailed Data
【速读】:该论文旨在解决标准变分自编码器(Variational Autoencoder, VAE)在建模重尾分布(heavy-tailed distributions)时的局限性问题,即传统VAE使用的简单解码器分布(如高斯分布)无法捕捉现实数据中罕见但极端事件主导的风险与变异特性。现有针对重尾分布的改进方法受限于预定义参数族,其尾部行为固定且缺乏灵活性。本文提出相位型变分自编码器(Phase-Type Variational Autoencoder, PH-VAE),其核心创新在于将解码器设计为潜变量条件下的相位型(Phase-Type, PH)分布,该分布定义为连续时间马尔可夫链(Continuous-Time Markov Chain, CTMC)的吸收时间,从而通过组合多个指数尺度实现灵活且解析可处理的尾部行为建模,并能直接从观测数据中自适应调整尾部特征。这一方案首次将相位型分布引入深度生成建模,有效提升了对极端值和多维尾部依赖关系的刻画能力。
链接: https://arxiv.org/abs/2603.01800
作者: Abdelhakim Ziani,András Horváth,Paolo Ballarini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML); Other Statistics (stat.OT)
备注:
Abstract:Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standard Variational Autoencoders (VAEs) employ simple decoder distributions (e.g., Gaussian) that fail to capture heavy-tailed behavior, while existing heavy-tail-aware extensions remain restricted to predefined parametric families whose tail behavior is fixed a priori. We propose the Phase-Type Variational Autoencoder (PH-VAE), whose decoder distribution is a latent-conditioned Phase-Type (PH) distribution defined as the absorption time of a continuous-time Markov chain (CTMC). This formulation composes multiple exponential time scales, yielding a flexible and analytically tractable decoder that adapts its tail behavior directly from the observed data. Experiments on synthetic and real-world benchmarks demonstrate that PH-VAE accurately recovers diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling tail behavior and extreme quantiles. In multivariate settings, PH-VAE captures realistic cross-dimensional tail dependence through its shared latent representation. To our knowledge, this is the first work to integrate Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML); Other Statistics (stat.OT) Cite as: arXiv:2603.01800 [cs.LG] (or arXiv:2603.01800v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.01800 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-32] Incremental inconsistency-resilient reasoning over Description Logic Abox streams
【速读】:该论文旨在解决数据流(data stream)环境下实时推理的挑战,特别是高数据速率、实时性要求以及流数据的噪声与波动性问题。其核心解决方案是提出了一种基于滑动窗口(sliding window)的描述逻辑(Description Logic)ABox增量推理的新语义,通过在相邻窗口间复用前一窗口的材料化结果,实现高效增量计算;同时引入基于偏好修复(preferred repair)的不一致修复语义,以应对流数据的不稳定性。针对OWL2 RL场景,论文进一步设计了半朴素算法(semi-naive algorithms),支持在存在和不存在不一致情况下的增量材料化维护。
链接: https://arxiv.org/abs/2603.01799
作者: Cas Proost,Pieter Bonte
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:More and more, data is being produced in a streaming fashion. This has led to increased interest into how actionable insights can be extracted in real time from data streams through Stream Reasoning. Reasoning over data streams raises multiple challenges, notably the high velocity of data, the real time requirement of the reasoning, and the noisy and volatile nature of streams. This paper proposes novel semantics for incremental reasoning over streams of Description Logic ABoxes, in order to tackle these challenges. To address the first two challenges, our semantics for reasoning over sliding windows on streams allow for incrementally computing the materialization of the window based on the materialization of the previous window. Furthermore, to deal with the volatile nature of streams, we present novel semantics for inconsistency repair on such windows, based on preferred repair semantics. We then detail our proposed semi-naive algorithms for incremental materialization maintenance in the case of OWL2 RL, both in the presence of inconsistencies and without.
[AI-33] Learning Shortest Paths with Generative Flow Networks
【速读】:该论文旨在解决图中最短路径查找问题,特别是针对非有向无环图(non-acyclic graphs)环境下的路径规划挑战。其解决方案的关键在于提出一种基于生成流网络(Generative Flow Networks, GFlowNets)的学习框架,并通过最小化总流(total flow)来确保前向和后向策略仅沿起始状态与终止状态之间的最短路径遍历图结构。研究进一步证明,通过在任意图上训练带有流正则化的非有向无环GFlowNet,可有效求解路径查找问题;实验表明该方法在排列环境和魔方求解任务中均表现出竞争力,尤其在测试时所需搜索预算更小,同时保持较优的解长性能。
链接: https://arxiv.org/abs/2603.01786
作者: Nikita Morozov,Ian Maksimov,Daniil Tiapkin,Sergey Samsonov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:In this paper, we present a novel learning framework for finding shortest paths in graphs utilizing Generative Flow Networks (GFlowNets). First, we examine theoretical properties of GFlowNets in non-acyclic environments in relation to shortest paths. We prove that, if the total flow is minimized, forward and backward policies traverse the environment graph exclusively along shortest paths between the initial and terminal states. Building on this result, we show that the pathfinding problem in an arbitrary graph can be solved by training a non-acyclic GFlowNet with flow regularization. We experimentally demonstrate the performance of our method in pathfinding in permutation environments and in solving Rubik’s Cubes. For the latter problem, our approach shows competitive results with state-of-the-art machine learning approaches designed specifically for this task in terms of the solution length, while requiring smaller search budget at test-time.
[AI-34] Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)对齐方法中因依赖静态对抗设置而导致的鲁棒性不足问题,尤其在多模态场景下攻击面更大时更为显著。其核心解决方案是提出一种协同进化对齐框架——CEMMA(Co-Evolutionary Multi-Modal Alignment),关键在于引入一个可进化的攻击者(Evolutionary Attacker)与一个自适应防御者(Adaptive Defender)构成闭环迭代机制:前者通过遗传操作(如变异、交叉和差分进化)将简单种子攻击演化为具备复杂结构效用的越狱攻击,从而提升红队测试的成功率;后者则基于合成的难例负样本持续更新,增强模型在面对动态演化的攻击时的鲁棒性和泛化能力,同时保持较高的数据效率和与推理时防御策略(如AdaShield)的兼容性。
链接: https://arxiv.org/abs/2603.01784
作者: Guoxin Shi,Haoyu Wang,Zaihui Yang,Yuxing Wang,Yongzhe Chang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Adversarial behavior plays a central role in aligning large language models with human values. However, existing alignment methods largely rely on static adversarial settings, which fundamentally limit robustness, particularly in multimodal settings with a larger attack surface. In this work, we move beyond static adversarial supervision and introduce co-evolutionary alignment with evolving attacks, instantiated by CEMMA (Co-Evolutionary Multi-Modal Alignment), an automated and adaptive framework for multimodal safety alignment. We introduce an Evolutionary Attacker that decomposes adversarial prompts into method templates and harmful intents. By employing genetic operators, including mutation, crossover, and differential evolution, it enables simple seed attacks to inherit the structural efficacy of sophisticated jailbreaks. The Adaptive Defender is iteratively updated on the synthesized hard negatives, forming a closed-loop process that adapts alignment to evolving attacks. Experiments show that the Evolutionary Attacker substantially increases red-teaming jailbreak attack success rate (ASR), while the Adaptive Defender improves robustness and generalization across benchmarks with higher data efficiency, without inducing excessive benign refusal, and remains compatible with inference-time defenses such as AdaShield.
[AI-35] GAM-RAG : Gain-Adaptive Memory for Evolving Retrieval in Retrieval-Augmented Generation
【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统中因依赖静态索引而导致的重复多跳遍历问题,从而引发推理延迟高和计算资源消耗大的缺陷。其解决方案的关键在于提出一种无需训练的GAM-RAG框架,通过积累查询过程中的检索经验动态更新轻量级、无关系结构的分层索引,利用成功检索事件提供句级反馈以优化记忆激活机制,并引入基于不确定性感知的卡尔曼启发式增益规则,在噪声反馈下实现稳定与适应性的平衡——即对可靠的新信号进行快速更新,对稳定或噪声较大的记忆则采用保守修正策略。
链接: https://arxiv.org/abs/2603.01783
作者: Yifan Wang,Mingxuan Jiang,Zhihao Sun,Yixin Cao,Yicun Liu,Keyang Chen,Guangnan Ye,Hongfeng Chai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) grounds large language models with external evidence, but many implementations rely on pre-built indices that remain static after construction. Related queries therefore repeat similar multi-hop traversal, increasing latency and compute. Motivated by schema-based learning in cognitive neuroscience, we propose GAM-RAG, a training-free framework that accumulates retrieval experience from recurring or related queries and updates retrieval memory over time. GAM-RAG builds a lightweight, relation-free hierarchical index whose links capture potential co-occurrence rather than fixed semantic relations. During inference, successful retrieval episodes provide sentence-level feedback, updating sentence memories so evidence useful for similar reasoning types becomes easier to activate later. To balance stability and adaptability under noisy feedback, we introduce an uncertainty-aware, Kalman-inspired gain rule that jointly updates memory states and perplexity-based uncertainty estimates. It applies fast updates for reliable novel signals and conservative refinement for stable or noisy memories. We provide a theoretical analysis of the update dynamics, and empirically show that GAM-RAG improves average performance by 3.95% over the strongest baseline and by 8.19% with 5-turn memory, while reducing inference cost by 61%. Our code and datasets are available at: this https URL.
[AI-36] Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
【速读】:该论文旨在解决神经网络(Neural Networks, NNs)在部署后因用户偏好变化导致初始超参数设置失效的问题,避免昂贵的重新训练过程。其核心挑战在于如何从观测数据中推断出神经网络条件输出分布随超参数变化的轨迹,并构建一个能近似未观测超参数设置下模型行为的代理模型(surrogate model)。解决方案的关键在于提出一种基于条件拉格朗日最优传输(conditional Lagrangian optimal transport)的方法,联合学习控制超参数驱动动态的拉格朗日函数(Lagrangian function)、对应的最优传输映射(optimal transport maps)以及观测边际分布之间的测地线路径(geodesics),从而形成高可行性的代理模型;同时引入流形假设(manifold hypothesis)和最小作用量原理(least-action principles)作为归纳偏置,增强模型对物理或结构约束的符合性,显著提升重建性能。
链接: https://arxiv.org/abs/2603.01771
作者: Harry Amad,Mihaela van der Schaar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural networks (NNs) often have critical behavioural trade-offs that are set at design time with hyperparameters-such as reward weights in reinforcement learning or quantile targets in regression. Post-deployment, however, user preferences can evolve, making initial settings undesirable, necessitating potentially expensive retraining. To circumvent this, we introduce the task of Hyperparameter Trajectory Inference (HTI): to learn, from observed data, how a NN’s conditional output distribution changes with its hyperparameters, and construct a surrogate model that approximates the NN at unobserved hyperparameter settings. HTI requires extending existing trajectory inference approaches to incorporate conditions, exacerbating the challenge of ensuring inferred paths are feasible. We propose an approach based on conditional Lagrangian optimal transport, jointly learning the Lagrangian function governing hyperparameter-induced dynamics along with the associated optimal transport maps and geodesics between observed marginals, which form the surrogate model. We incorporate inductive biases based on the manifold hypothesis and least-action principles into the learned Lagrangian, improving surrogate model feasibility. We empirically demonstrate that our approach reconstructs NN outputs across various hyperparameter spectra better than other alternatives.
[AI-37] CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning ICLR2026
【速读】:该论文旨在解决当前深度学习中处理时序动态的两大困境:一类是离散且不稳定的模型(如LSTM),易导致梯度爆炸或消失;另一类是连续但耗散的模型(如神经微分方程,Neural ODEs),虽保证稳定性却会随时间破坏信息。其解决方案的关键在于提出一种基于物理规律的新型计算学习单元——因果哈密顿学习单元(Causal Hamiltonian Learning Unit, CHLU),通过强制施加相对论性哈密顿结构并采用辛积分(symplectic integration)实现相空间体积严格守恒,从而在理论上达成无限时域稳定性和可控噪声滤波能力,有效缓解记忆保持与模型稳定性之间的权衡问题。
链接: https://arxiv.org/abs/2603.01768
作者: Pratik Jawahar,Maurizio Pierini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注: Accepted as a short paper at ICLR 2026 (AI PDE)
Abstract:Current deep learning primitives dealing with temporal dynamics suffer from a fundamental dichotomy: they are either discrete and unstable (LSTMs) \citeppascanu_difficulty_2013, leading to exploding or vanishing gradients; or they are continuous and dissipative (Neural ODEs) \citepdupont_augmented_2019, which destroy information over time to ensure stability. We propose the \textbfCausal Hamiltonian Learning Unit (pronounced: \textitclue), a novel Physics-grounded computational learning primitive. By enforcing a Relativistic Hamiltonian structure and utilizing symplectic integration, a CHLU strictly conserves phase-space volume, as an attempt to solve the memory-stability trade-off. We show that the CHLU is designed for infinite-horizon stability, as well as controllable noise filtering. We then demonstrate a CHLU’s generative ability using the MNIST dataset as a proof-of-principle.
[AI-38] Modular Memory is the Key to Continual Learning Agents
【速读】:该论文旨在解决基础模型(Foundation Models)在持续运行、经验积累和个性化方面的根本性局限问题,这些问题制约了其在动态环境中的自适应智能表现。当前持续学习研究多依赖于参数更新的“在权重中学习”(In-Weight Learning, IWL),但面临灾难性遗忘(catastrophic forgetting)的挑战。论文提出的关键解决方案是:通过设计模块化记忆架构(modular memory-centric architectures),融合“在上下文学习”(In-Context Learning, ICL)与IWL的优势——即利用ICL实现快速适应与知识累积,同时借助IWL进行稳定的能力更新,从而构建可规模化持续学习的智能代理。
链接: https://arxiv.org/abs/2603.01761
作者: Vaggelis Dorovatas,Malte Schwerin,Andrew D. Bagdanov,Lucas Caccia,Antonio Carta,Laurent Charlin,Barbara Hammer,Tyler L. Hayes,Timm Hess,Christopher Kanan,Dhireesha Kudithipudi,Xialei Liu,Vincenzo Lomonaco,Jorge Mendez-Mendez,Darshan Patil,Ameya Prabhu,Elisa Ricci,Tinne Tuytelaars,Gido M. van de Ven,Liyuan Wang,Joost van de Weijer,Jonghyun Choi,Martin Mundt,Rahaf Aljundi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work stems from discussions held at the Dagstuhl seminar on Continual Learning in the Era of Foundation Models (October 2025)
Abstract:Foundation models have transformed machine learning through large-scale pretraining and increased test-time compute. Despite surpassing human performance in several domains, these models remain fundamentally limited in continuous operation, experience accumulation, and personalization, capabilities that are central to adaptive intelligence. While continual learning research has long targeted these goals, its historical focus on in-weight learning (IWL), i.e., updating a single model’s parameters to absorb new knowledge, has rendered catastrophic forgetting a persistent challenge. Our position is that combining the strengths of In-Weight Learning (IWL) and the newly emerged capabilities of In-Context Learning (ICL) through the design of modular memory is the missing piece for continual adaptation at scale. We outline a conceptual framework for modular memory-centric architectures that leverage ICL for rapid adaptation and knowledge accumulation, and IWL for stable updates to model capabilities, charting a practical roadmap toward continually learning agents.
[AI-39] Federated Agent ic AI for Wireless Networks: Fundamentals Approaches and Applications
【速读】:该论文旨在解决当前基于集中式架构的智能体人工智能(Agentic AI)在资源受限、分布广泛且数据异构的无线网络中所面临的挑战,包括高通信开销、隐私风险以及非独立同分布(non-IID)数据问题。其解决方案的关键在于引入联邦学习(Federated Learning, FL),通过协作式本地学习与参数共享机制,在不交换原始数据的前提下提升Agentic AI的整体运行效率和适应性。文中进一步提出将不同类型的FL方法与Agentic AI循环中的特定组件相结合,并以基于强化学习的联邦学习(Federated Reinforcement Learning, FRL)为例,在低空无线网络(Low-altitude Wireless Networks, LAWNs)中验证了其对智能体决策性能的改进效果。
链接: https://arxiv.org/abs/2603.01755
作者: Lingyi Cai,Yu Zhang,Ruichen Zhang,Yinqiu Liu,Tao Jiang,Dusit Niyato,Wei Ni,Abbas Jamalipour
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures
Abstract:Agentic artificial intelligence (AI) presents a promising pathway toward realizing autonomous and self-improving wireless network services. However, resource-constrained, widely distributed, and data-heterogeneous nature of wireless networks poses significant challenges to existing agentic AI that relies on centralized architectures, leading to high communication overhead, privacy risks, and non-independent and identically distributed (non-IID) data. Federated learning (FL) has the potential to improve the overall loop of agentic AI through collaborative local learning and parameter sharing without exchanging raw data. This paper proposes new federated agentic AI approaches for wireless networks. We first summarize fundamentals of agentic AI and mainstream FL types. Then, we illustrate how each FL type can strengthen a specific component of agentic AI’s loop. Moreover, we conduct a case study on using FRL to improve the performance of agentic AI’s action decision in low-altitude wireless networks (LAWNs). Finally, we provide a conclusion and discuss future research directions.
[AI-40] Shape-Interpretable Visual Self-Modeling Enables Geometry-Aware Continuum Robot Control
【速读】:该论文旨在解决连续体机器人(continuum robots)在复杂环境中感知、建模与控制中的核心挑战,尤其是其连续变形和非线性动力学导致的几何信息缺失与环境交互能力不足问题。现有基于视觉的控制方法多依赖端到端学习,缺乏对机器人本体几何结构及环境交互的显式认知。解决方案的关键在于提出一种形状可解释的视觉自建模框架,通过贝塞尔曲线(Bezier-curve)表示从多视角平面图像中编码机器人形状,构建一个紧凑且物理意义明确的三维形状空间;在此基础上利用神经微分方程(neural ordinary differential equations)直接从数据中自学习形状与末端执行器的动力学模型,从而实现无需解析模型或密集标记点的混合形状-位置控制,并支持障碍物避让等环境感知行为。
链接: https://arxiv.org/abs/2603.01751
作者: Peng Yu,Xin Wang,Ning Tan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:Continuum robots possess high flexibility and redundancy, making them well suited for safe interaction in complex environments, yet their continuous deformation and nonlinear dynamics pose fundamental challenges to perception, modeling, and control. Existing vision-based control approaches often rely on end-to-end learning, achieving shape regulation without explicit awareness of robot geometry or its interaction with the environment. Here, we introduce a shape-interpretable visual self-modeling framework for continuum robots that enables geometry-aware control. Robot shapes are encoded from multi-view planar images using a Bezier-curve representation, transforming visual observations into a compact and physically meaningful shape space that uniquely characterizes the robot’s three-dimensional configuration. Based on this representation, neural ordinary differential equations are employed to self-model both shape and end-effector dynamics directly from data, enabling hybrid shape-position control without analytical models or dense body markers. The explicit geometric structure of the learned shape space allows the robot to reason about its body and surroundings, supporting environment-aware behaviors such as obstacle avoidance and self-motion while maintaining end-effector objectives. Experiments on a cable-driven continuum robot demonstrate accurate shape-position regulation and tracking, with shape errors within 1.56% of image resolution and end-effector errors within 2% of robot length, as well as robust performance in constrained environments. By elevating visual shape representations from two-dimensional observations to an interpretable three-dimensional self-model, this work establishes a principled alternative to vision-based end-to-end control and advances autonomous, geometry-aware manipulation for continuum robots.
[AI-41] Discrete World Models via Regularization
【速读】:该论文旨在解决如何在无监督条件下学习具有布尔(Boolean)表示的离散世界模型(Discrete World Models),以支持高效的搜索启发式、符号推理与规划。传统方法依赖解码器重建或对比学习来保持潜在表示的信息量,但存在复杂性高或优化困难的问题。本文提出一种无需重建和对比学习的新型方法——通过正则化实现离散世界建模(Discrete World Models via Regularization, DWMR),其关键在于设计了一种新的世界建模损失函数,该损失函数将潜在状态预测与专门的正则项耦合:这些正则项通过方差、相关性和协偏度惩罚最大化表示比特的熵与独立性,同时引入局部性先验以约束稀疏动作变化下的状态转移。此外,论文还提出一种改进的训练方案,增强了对离散轨迹(discrete roll-outs)优化的鲁棒性,实验证明该方法在具有组合结构的基准任务中能学习到更准确的状态表示与转移动态。
链接: https://arxiv.org/abs/2603.01748
作者: Davide Bizzaro,Luciano Serafini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:World models aim to capture the states and dynamics of an environment in a compact latent space. Moreover, using Boolean state representations is particularly useful for search heuristics and symbolic reasoning and planning. Existing approaches keep latents informative via decoder-based reconstruction, or instead via contrastive or reward signals. In this work, we introduce Discrete World Models via Regularization (DWMR): a reconstruction-free and contrastive-free method for unsupervised Boolean world-model learning. In particular, we introduce a novel world-modeling loss that couples latent prediction with specialized regularizers. Such regularizers maximize the entropy and independence of the representation bits through variance, correlation, and coskewness penalties, while simultaneously enforcing a locality prior for sparse action changes. To enable effective optimization, we also introduce a novel training scheme improving robustness to discrete roll-outs. Experiments on two benchmarks with underlying combinatorial structure show that DWMR learns more accurate representations and transitions than reconstruction-based alternatives. Finally, DWMR can also be paired with an auxiliary reconstruction decoder, and this combination yields additional gains.
[AI-42] Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning ICLR2026
【速读】:该论文旨在解决大规模并行环境下的强化学习中因单一策略探索能力有限而导致的学习效率低下问题。现有基于策略集合(policy ensemble)的方法虽能拓宽探索空间,但过度探索可能降低探索质量或破坏训练稳定性。其解决方案的关键在于提出耦合策略优化(Coupled Policy Optimization),通过在策略间施加KL散度约束来调控多样性,从而实现高效且稳定的探索行为;实验表明,该方法在样本效率和最终性能上均优于SAPG、PBT和PPO等强基线,并揭示了跟随者策略自然围绕领导者分布的现象,体现出结构化、高效的探索机制。
链接: https://arxiv.org/abs/2603.01741
作者: Naoki Shitanda,Motoki Omura,Tatsuya Harada,Takayuki Osa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: In ICLR 2026. Website at this https URL
Abstract:Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in terms of both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods. Project page at this https URL .
[AI-43] CA-AFP: Cluster-Aware Adaptive Federated Pruning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在实际部署中面临的两大挑战:客户端间的统计异质性(statistical heterogeneity)和由资源受限设备引起的技术异质性(system heterogeneity)。传统方法通常分别处理这两个问题,例如通过聚类缓解统计异质性、通过剪枝提升通信与内存效率,但缺乏协同优化。其解决方案的关键在于提出一种统一框架CA-AFP,该框架通过执行基于簇的自适应模型剪枝来同时应对两类异质性:首先将客户端聚类分组,随后在训练过程中对每个簇内的模型进行差异化剪枝;创新性地引入簇感知的重要性评分机制(结合权重幅度、簇内一致性与梯度一致性),以及迭代式剪枝调度策略(支持参数移除与权重再生以实现模型自我修复),从而在保持高预测准确率和跨客户端公平性的同时显著降低通信开销,并展现出对不同非独立同分布(Non-IID)数据水平的鲁棒性。
链接: https://arxiv.org/abs/2603.01739
作者: Om Govind Jha,Harsh Shukla,Haroon R. Lone
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Federated Learning (FL) faces major challenges in real-world deployments due to statistical heterogeneity across clients and system heterogeneity arising from resource-constrained devices. While clustering-based approaches mitigate statistical heterogeneity and pruning techniques improve memory and communication efficiency, these strategies are typically studied in isolation. We propose CA-AFP, a unified framework that jointly addresses both challenges by performing cluster-specific model pruning. In CA-AFP, clients are first grouped into clusters, and a separate model for each cluster is adaptively pruned during training. The framework introduces two key innovations: (1) a cluster-aware importance scoring mechanism that combines weight magnitude, intra-cluster coherence, and gradient consistency to identify parameters for pruning, and (2) an iterative pruning schedule that progressively removes parameters while enabling model self-healing through weight regrowth. We evaluate CA-AFP on two widely used human activity recognition benchmarks, UCI HAR and WISDM, under natural user-based federated partitions. Experimental results demonstrate that CA-AFP achieves a favorable balance between predictive accuracy, inter-client fairness, and communication efficiency. Compared to pruning-based baselines, CA-AFP consistently improves accuracy and lower performance disparity across clients with limited fine-tuning, while requiring substantially less communication than dense clustering-based methods. It also shows robustness to different Non-IID levels of data. Finally, ablation studies analyze the impact of clustering, pruning schedules and scoring mechanism offering practical insights into the design of efficient and adaptive FL systems. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2603.01739 [cs.LG] (or arXiv:2603.01739v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.01739 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-44] Solving Inverse PDE Problems using Minimization Methods and AI
【速读】:该论文旨在解决由微分方程描述的物理与工程系统中直接问题(direct problem)和反问题(inverse problem)的求解难题,其中直接问题用于预测系统行为,反问题则用于从测量数据中识别未知参数。解决方案的关键在于对比传统数值方法与基于人工智能的新技术——物理信息神经网络(Physics-Informed Neural Networks, PINNs),并通过逻辑斯蒂微分方程(logistic differential equation)和多孔介质方程(Porous Medium Equation, PME)两个典型模型验证其有效性。结果表明,PINNs能够在计算成本可控的前提下高精度逼近复杂系统的解,从而为直接问题求解和参数估计提供一种高效且鲁棒的替代方案。
链接: https://arxiv.org/abs/2603.01731
作者: Noura Helwani,Sophie Moufawad,Georges Sakr
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Optimization and Control (math.OC)
备注: 52 pages, 21 Figures, 22 Tables
Abstract:Many physical and engineering systems require solving direct problems to predict behavior and inverse problems to determine unknown parameters from measurement. In this work, we study both aspects for systems governed by differential equations, contrasting well-established numerical methods with new AI-based techniques, specifically Physics-Informed Neural Networks (PINNs). We first analyze the logistic differential equation, using its closed-form solution to verify numerical schemes and validate PINN performance. We then address the Porous Medium Equation (PME), a nonlinear partial differential equation with no general closed-form solution, building strong solvers of the direct problem and testing techniques for parameter estimation in the inverse problem. Our results suggest that PINNs can closely estimate solutions at competitive computational cost, and thus propose an effective tool for solving both direct and inverse problems for complex systems.
[AI-45] GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules
【速读】:该论文试图解决当前生成式 AI 在在线内容审核中面临的两大核心挑战:一是共存违规(Co-occurring Violations),即单个内容可能同时违反多项平台政策(如种族偏见与人身攻击);二是审核规则的动态性(Dynamic Rules of Moderation),即不同平台或情境下政策会变化,导致AI判断能力下降。解决方案的关键在于提出一种能够适应规则不稳定性和多违规场景的评估框架,以检验AI系统是否具备真正的泛化能力,而不仅是在静态基准测试中表现良好。这要求模型不仅能理解固定规则,还需在复杂、多变的现实环境中做出一致且准确的判断,从而避免误判合法表达或放任有害内容传播。
链接: https://arxiv.org/abs/2603.01724
作者: Houde Dong,Yifei She,Kai Ye,Liangcai Su,Chenxiong Qian,Jie Hao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Online content moderation is essential for maintaining a healthy digital environment, and reliance on AI for this task continues to grow. Consider a user comment using national stereotypes to insult a politician. This example illustrates two critical challenges in real-world scenarios: (1) Co-occurring Violations, where a single post violates multiple policies (e.g., prejudice and personal attacks); (2) Dynamic rules of moderation, where determination of a violation depends on platform-specific guidelines that evolve across contexts . The intersection of co-occurring harms and dynamically changing rules highlights a core limitation of current AI systems: although large language models (LLMs) are adept at following fixed guidelines, their judgment capabilities degrade when policies are unstable or context-dependent . In practice, such shortcomings lead to inconsistent moderation: either erroneously restricting legitimate expression or allowing harmful content to remain online . This raises a critical question for evaluation: Does high performance on existing static benchmarks truly guarantee robust generalization of AI judgment to real-world scenarios involving co-occurring violations and dynamically changing rules?
[AI-46] FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在垂直领域微调过程中高度依赖人工、成本高昂的问题,即如何实现端到端的自动化微调流程。其解决方案的关键在于提出FT-Dojo这一交互式环境和FT-Agent这一自主系统:FT-Dojo构建了涵盖5个领域共13个任务的复杂评估场景,而FT-Agent通过评估驱动的反馈机制,能够迭代诊断失败原因并优化微调策略,从而模拟人类专家行为。实验表明,该方法在多数任务中显著优于通用方案,且具备良好的模型规模泛化能力与数据 scaling 适应性,同时揭示了当前自主微调在因果推理方面的局限性。
链接: https://arxiv.org/abs/2603.01712
作者: Qizheng Li,Yifei Zhang,Xiao Yang,Xu Yang,Zhuo Wang,Weiqing Liu,Jiang Bian
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 6 figures, 9 tables
Abstract:Fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents automate this complete process? We frame this as a substantially open problem: agents must navigate an open-ended search space spanning data curation from diverse data sources, processing with complex tools, building a training pipeline, and iteratively refining their approach based on evaluation outcomes in rapidly growing logs–an overall scenario far more intricate than existing benchmarks. To study this question, we introduce FT-Dojo, an interactive environment comprising 13 tasks across 5 domains. We further develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies. Experiments on FT-Dojo demonstrate that purpose-built fine-tuning agents significantly outperform general-purpose alternatives, with FT-Agent achieving the best performance on 10 out of 13 tasks across all five domains. Ablations show that the approach generalizes effectively to 3B models, with additional insights on data scaling trade-offs and backbone sensitivity. Case analyses reveal that agents can recover from failures through cumulative learning from historical experience, while also exposing fundamental limitations in causal reasoning–highlighting both the promise and current boundaries of autonomous LLM fine-tuning.
[AI-47] DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks
【速读】:该论文旨在解决传统混合专家(Mixture-of-Experts, MoE)架构中两个刚性设计假设带来的局限性:一是固定Top-K路由机制(即每token激活固定数量K个专家),二是各层均匀分配专家容量。这些问题限制了模型在不同输入复杂度和网络深度下的计算效率与表达能力。解决方案的关键在于提出DynaMoE框架,通过引入动态token级专家激活机制(即根据输入复杂度自适应调整每个token激活的专家数量)和分层自适应容量调度策略(包括递减、递增、金字塔及波浪等六种模式),实现更灵活且高效的计算资源分配。理论分析表明,动态路由可提升模型表达能力并控制计算开销,实验验证其在图像分类与语言建模任务中均优于静态基线,并揭示最优专家调度策略具有任务类型与模型规模依赖性。
链接: https://arxiv.org/abs/2603.01697
作者: Gökdeniz Gülmez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of dynamic routing and derive bounds on computational efficiency. Through extensive experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification), and Recycling-the-Web (language modeling) across multiple model scales, we demonstrate that DynaMoE achieves superior parameter efficiency compared to static baselines. Our key finding is that optimal expert schedules are task- and scale-dependent: descending schedules (concentrating capacity in early layers) outperform uniform baselines on image classification. For language modeling, optimal schedules vary by model size, descending for Tiny, ascending for Small, and uniform for Medium. Furthermore, dynamic routing reduces gradient variance during training, leading to improved convergence stability. DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.
[AI-48] Streaming Continual Learning for Unified Adaptive Intelligence in Dynamic Environments
【速读】:该论文旨在解决动态环境中预测模型难以持续有效的问题,即在数据持续产生且分布不断变化的情况下,如何实现快速适应新数据的同时避免遗忘旧知识。其解决方案的关键在于提出一种统一的框架——流式持续学习(Streaming Continual Learning, SCL),该框架融合了持续学习(Continual Learning, CL)和流式机器学习(Streaming Machine Learning, SML)的优势,通过整合两者的方法与技术,使模型能够在非平稳数据流中实现高效、稳定的学习能力。
链接: https://arxiv.org/abs/2603.01695
作者: Federico Giannini,Giacomo Ziffer,Andrea Cossu,Vincenzo Lomonaco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Developing effective predictive models becomes challenging in dynamic environments that continuously produce data and constantly change. Continual Learning (CL) and Streaming Machine Learning (SML) are two research areas that tackle this arduous task. We put forward a unified setting that harnesses the benefits of both CL and SML: their ability to quickly adapt to non-stationary data streams without forgetting previous knowledge. We refer to this setting as Streaming Continual Learning (SCL). SCL does not replace either CL or SML. Instead, it extends the techniques and approaches considered by both fields. We start by briefly describing CL and SML and unifying the languages of the two frameworks. We then present the key features of SCL. We finally highlight the importance of bridging the two communities to advance the field of intelligent systems.
[AI-49] Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的机器学习工程(Machine Learning Engineering, MLE)代理在优化策略上效率低下的问题,尤其是依赖树搜索(tree search)这种无梯度优化方法时,随着LLM推理能力增强,其穷举式枚举逐渐变得低效。解决方案的关键在于引入\textscGome,一个将结构化诊断推理映射为梯度计算、利用成功记忆实现动量机制、并通过多轨迹执行模拟分布式优化的新型MLE代理。该方法实现了从随机搜索向梯度驱动优化的范式转变,在封闭世界协议下于MLE-Bench基准上以单张V100 GPU在12小时内达到35.1%的任意奖牌率,且随着LLM推理能力提升,梯度优化的优势显著超越树搜索,尤其在前沿模型中差距扩大。
链接: https://arxiv.org/abs/2603.01692
作者: Yifei Zhang,Xu Yang,Xiao Yang,Bowen Xian,Qizheng Li,Shikai Fang,Jingyuan Li,Jian Wang,Mingrui Xu,Weiqing Liu,Jiang Bian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 7 figures
Abstract:LLM-based agents for machine learning engineering (MLE) predominantly rely on tree search, a form of gradient-free optimization that uses scalar validation scores to rank candidates. As LLM reasoning capabilities improve, exhaustive enumeration becomes increasingly inefficient compared to directed updates, analogous to how accurate gradients enable efficient descent over random search. We introduce \textscGome, an MLE agent that operationalizes gradient-based optimization. \textscGome maps structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization. Under a closed-world protocol that isolates architectural effects from external knowledge, \textscGome achieves a state-of-the-art 35.1% any-medal rate on MLE-Bench with a restricted 12-hour budget on a single V100 GPU. Scaling experiments across 10 models reveal a critical crossover: with weaker models, tree search retains advantages by compensating for unreliable reasoning through exhaustive exploration; as reasoning capability strengthens, gradient-based optimization progressively outperforms, with the gap widening at frontier-tier models. Given the rapid advancement of reasoning-oriented LLMs, this positions gradient-based optimization as an increasingly favorable paradigm. We release our codebase and GPT-5 traces.
[AI-50] A Practical Guide to Streaming Continual Learning
【速读】:该论文旨在解决持续学习(Continual Learning, CL)与流式机器学习(Streaming Machine Learning, SML)在实际应用中面临的互补性挑战:CL关注在学习新任务时保留历史知识,而SML强调对数据分布变化(概念漂移)的快速适应能力。两者单独使用均难以应对现实场景中同时要求快速适应与知识保留的需求。论文提出流式持续学习(Streaming Continual Learning, SCL)作为新兴范式,其核心在于整合CL与SML的优势,构建一种能够兼具快速适应新信息(如SML)和避免遗忘旧知识(如CL)的统一框架,从而推动两个研究社区协同创新,并设计出更鲁棒、灵活的混合学习方法。
链接: https://arxiv.org/abs/2603.01677
作者: Andrea Cossu,Federico Giannini,Giacomo Ziffer,Alessio Bernardo,Alexander Gepperth,Emanuele Della Valle,Barbara Hammer,Davide Bacciu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual Learning (CL) and Streaming Machine Learning (SML) study the ability of agents to learn from a stream of non-stationary data. Despite sharing some similarities, they address different and complementary challenges. While SML focuses on rapid adaptation after changes (concept drifts), CL aims to retain past knowledge when learning new tasks. After a brief introduction to CL and SML, we discuss Streaming Continual Learning (SCL), an emerging paradigm providing a unifying solution to real-world problems, which may require both SML and CL abilities. We claim that SCL can i) connect the CL and SML communities, motivating their work towards the same goal, and ii) foster the design of hybrid approaches that can quickly adapt to new information (as in SML) without forgetting previous knowledge (as in CL). We conclude the paper with a motivating example and a set of experiments, highlighting the need for SCL by showing how CL and SML alone struggle in achieving rapid adaptation and knowledge retention.
[AI-51] Chain-of-Context Learning: Dynamic Constraint Understanding for Multi-Task VRPs ICLR2026
【速读】:该论文旨在解决多任务车辆路径问题(Multi-task Vehicle Routing Problems, VRPs)中现有强化学习(Reinforcement Learning, RL)求解器因忽略约束条件与节点动态变化而导致决策不准确的问题。其解决方案的关键在于提出了一种链式上下文学习(Chain-of-Context Learning, CCL)框架,通过两个核心模块实现:一是相关性引导的上下文重构(Relevance-Guided Context Reformulation, RGCR)模块,用于逐步构建步骤级上下文信息并自适应优先处理关键约束;二是轨迹共享的节点重嵌入(Trajectory-Shared Node Re-embedding, TSNR)模块,利用所有轨迹的共享节点特征来更新当前输入,从而实现细粒度的节点适应。CCL通过建模RL代理在序列决策中的演化偏好,有效捕捉了步骤间的依赖关系,在48种VRP变体上显著优于现有最优基线方法。
链接: https://arxiv.org/abs/2603.01667
作者: Shuangchun Gui,Suyu Liu,Xuehe Wang,Zhiguang Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper is accepted by ICLR 2026
Abstract:Multi-task Vehicle Routing Problems (VRPs) aim to minimize routing costs while satisfying diverse constraints. Existing solvers typically adopt a unified reinforcement learning (RL) framework to learn generalizable patterns across tasks. However, they often overlook the constraint and node dynamics during the decision process, making the model fail to accurately react to the current context. To address this limitation, we propose Chain-of-Context Learning (CCL), a novel framework that progressively captures the evolving context to guide fine-grained node adaptation. Specifically, CCL constructs step-wise contextual information via a Relevance-Guided Context Reformulation (RGCR) module, which adaptively prioritizes salient constraints. This context then guides node updates through a Trajectory-Shared Node Re-embedding (TSNR) module, which aggregates shared node features from all trajectories’ contexts and uses them to update inputs for the next step. By modeling evolving preferences of the RL agent, CCL captures step-by-step dependencies in sequential decision-making. We evaluate CCL on 48 diverse VRP variants, including 16 in-distribution and 32 out-of-distribution (with unseen constraints) tasks. Experimental results show that CCL performs favorably against the state-of-the-art baselines, achieving the best performance on all in-distribution tasks and the majority of out-of-distribution tasks.
[AI-52] FreeGNN: Continual Source-Free Graph Neural Network Adaptation for Renewable Energy Forecasting
【速读】:该论文旨在解决可再生能源发电预测中因目标站点标签数据不可得而导致的传统监督模型难以部署的问题,特别是在源数据不可访问(source-free)且需持续适应非平稳环境变化的场景下。解决方案的关键在于提出FreeGNN框架,其核心创新包括:基于时空图神经网络(spatio-temporal Graph Neural Network, GNN)的骨干结构、教师-学生蒸馏策略以实现知识迁移、记忆回放机制缓解灾难性遗忘、图正则化保留空间相关性,以及漂移感知权重机制动态调整适应强度,从而在无需源数据和目标标签的情况下实现稳定、连续的高精度预测。
链接: https://arxiv.org/abs/2603.01657
作者: Abderaouf Bahi,Amel Ourici,Ibtissem Gasmi,Aida Derrablia,Warda Deghmane,Mohamed Amine Ferrag
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures, 8 tables
Abstract:Accurate forecasting of renewable energy generation is essential for efficient grid management and sustainable power planning. However, traditional supervised models often require access to labeled data from the target site, which may be unavailable due to privacy, cost, or logistical constraints. In this work, we propose FreeGNN, a Continual Source-Free Graph Domain Adaptation framework that enables adaptive forecasting on unseen renewable energy sites without requiring source data or target labels. Our approach integrates a spatio-temporal Graph Neural Network (GNN) backbone with a teacher–student strategy, a memory replay mechanism to mitigate catastrophic forgetting, graph-based regularization to preserve spatial correlations, and a drift-aware weighting scheme to dynamically adjust adaptation strength during streaming updates. This combination allows the model to continuously adapt to non-stationary environmental conditions while maintaining robustness and stability. We conduct extensive experiments on three real-world datasets: GEFCom2012, Solar PV, and Wind SCADA, encompassing multiple sites, temporal resolutions, and meteorological features. The ablation study confirms that each component memory, graph regularization, drift-aware adaptation, and teacher–student strategy contributes significantly to overall performance. The experiments show that FreeGNN achieves an MAE of 5.237 and an RMSE of 7.123 on the GEFCom dataset, an MAE of 1.107 and an RMSE of 1.512 on the Solar PV dataset, and an MAE of 0.382 and an RMSE of 0.523 on the Wind SCADA dataset. These results demonstrate its ability to achieve accurate and robust forecasts in a source-free, continual learning setting, highlighting its potential for real-world deployment in adaptive renewable energy systems. For reproducibility, implementation details are available at: this https URL.
[AI-53] CeProAgents : A Hierarchical Agents System for Automated Chemical Process Development
【速读】:该论文旨在解决化学过程开发(chemical process development)中因多维度复杂性所带来的挑战,包括专业知识整合、概念设计与参数仿真等环节的协同难题。其解决方案的关键在于提出一个分层的多智能体系统(hierarchical multi-agent system),命名为CeProAgents,该系统由三个专业化智能体群组构成:知识(knowledge)、概念(concept)和参数(parameter)层面,各群组采用融合动态智能体聊天组(dynamic agent chatgroups)与结构化代理工作流(structured agentic workflows)的新型混合架构,实现任务的协作式分工与高效执行。
链接: https://arxiv.org/abs/2603.01654
作者: Yuhang Yang,Ruikang Li,Jifei Ma,Kai Zhang,Qi Liu,Jianyu Han,Yonggan Bu,Jibin Zhou,Defu Lian,Xin Li,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The development of chemical processes, a cornerstone of chemical engineering, presents formidable challenges due to its multi-faceted nature, integrating specialized knowledge, conceptual design, and parametric simulation. Capitalizing on this, we propose CeProAgents, a hierarchical multi-agent system designed to automate the development of chemical process through collaborative division of labor. Our architecture comprises three specialized agent cohorts focused on knowledge, concept, and parameter respectively. To effectively adapt to the inherent complexity of chemical tasks, each cohort employs a novel hybrid architecture that integrates dynamic agent chatgroups with structured agentic workflows. To rigorously evaluate the system, we establish CeProBench, a multi-dimensional benchmark structured around three core pillars of chemical engineering. We design six distinct types of tasks across these dimensions to holistically assess the comprehensive capabilities of the system in chemical process development. The results not only confirm the effectiveness and superiority of our proposed approach but also reveal the transformative potential as well as the current boundaries of Large Language Models (LLMs) for industrial chemical engineering.
[AI-54] Learning Structured Reasoning via Tractable Trajectory Control
【速读】:该论文旨在解决大语言模型在无约束采样下复杂推理轨迹稀疏、且标准强化学习(Reinforcement Learning, RL)难以保障多样化推理行为获取的问题。其解决方案的关键在于提出一种基于结构化推理(structured reasoning)的框架Ctrl-R,通过可 tractable 的轨迹控制机制主动引导 rollout 过程,在RL过程中有针对性地探索特定推理模式,从而实现对多样推理行为的有效发现与强化;同时引入重要性采样权重的幂次缩放因子,使策略能够选择性地从探索性的分布外轨迹中学习,同时保持优化过程的稳定性,最终提升模型在数学推理任务上的表现。
链接: https://arxiv.org/abs/2603.01641
作者: Po-Nien Kung,Zhen Yang,Jeffrey Luo,Cheng-Fu Yang,Haikang Deng,Zi-Yi Dou,Yinfei Yang,Nanyun Peng,Zhe Gan,Kai-Wei Chang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models can exhibit emergent reasoning behaviors, often manifested as recurring lexical patterns (e.g., “wait,” indicating verification). However, complex reasoning trajectories remain sparse in unconstrained sampling, and standard RL often fails to guarantee the acquisition of diverse reasoning behaviors. We propose a systematic discovery and reinforcement of diverse reasoning patterns through structured reasoning, a paradigm that requires targeted exploration of specific reasoning patterns during the RL process. To this end, we propose Ctrl-R, a framework for learning structured reasoning via tractable trajectory control that actively guides the rollout process, incentivizing the exploration of diverse reasoning patterns that are critical for complex problem-solving. The resulting behavior policy enables accurate importance-sampling estimation, supporting unbiased on-policy optimization. We further introduce a power-scaling factor on the importance-sampling weights, allowing the policy to selectively learn from exploratory, out-of-distribution trajectories while maintaining stable optimization. Experiments demonstrate that Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns, yielding consistent improvements across language and vision-language models on mathematical reasoning tasks.
[AI-55] DeLo: Dual Decomposed Low-Rank Experts Collaboration for Continual Missing Modality Learning
【速读】:该论文旨在解决连续缺失模态学习(Continual Missing Modality Learning, CMML)问题,即在现实场景中,大型多模态模型(Large Multimodal Models, LMMs)需同时应对数据流的持续性输入与频繁的模态缺失挑战。现有方法主要依赖提示调优(prompt tuning),但由于可学习提示共享嵌入空间,易引发跨任务干扰;而简单应用低秩适配(Low-Rank Adaptation, LoRA)时,若采用模态共享模块则会因竞争梯度导致模态间干扰。论文提出 DeLo 框架,其核心创新在于设计了一种新颖的双分解低秩专家架构(dual-decomposed low-rank expert architecture),通过解耦的模态特定因子池动态组合 rank-one 因子生成 LoRA 更新矩阵,从而有效缓解模态干扰。该架构嵌入任务分区框架以结构化防止灾难性遗忘,并辅以跨模态引导路由策略处理不完整数据、任务键记忆(Task-Key Memory)实现高效无任务依赖推理,显著优于当前最优方法。
链接: https://arxiv.org/abs/2603.01632
作者: Xiwei Liu,Yulong Li,Feilong Tang,Imran Razzak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Adapting Large Multimodal Models (LMMs) to real-world scenarios poses the dual challenges of learning from sequential data streams while handling frequent modality incompleteness, a task known as Continual Missing Modality Learning (CMML). However, existing works on CMML have predominantly relied on prompt tuning, a technique that struggles with this task due to cross-task interference between its learnable prompts in their shared embedding space. A naive application of Low-Rank Adaptation (LoRA) with modality-shared module will also suffer modality interference from competing gradients. To this end, we propose DeLo, the first framework to leverage a novel dual-decomposed low-rank expert architecture for CMML. Specifically, this architecture resolves modality interference through decomposed LoRA expert, dynamically composing LoRA update matrix with rank-one factors from disentangled modality-specific factor pools. Embedded within a task-partitioned framework that structurally prevents catastrophic forgetting, this expert system is supported by two key mechanisms: a Cross-Modal Guided Routing strategy to handle incomplete data and a Task-Key Memory for efficient, task-agnostic inference. Extensive experiments on established CMML benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches. This highlights the value of a principled, architecturally-aware LoRA design for real-world multimodal challenges.
[AI-56] SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing ICLR2026
【速读】:该论文旨在解决自主系统(如无人机)在高风险、以人为中心的应用场景中伦理对齐评估的难题,因为缺乏普适且明确的评价指标以及利益相关者主观价值判断难以被分析建模,导致伦理失效可能危及人类生命并引发长期决策偏见。解决方案的关键在于提出SEED-SET框架,这是一个基于贝叶斯实验设计的方法,通过分层高斯过程(Hierarchical Gaussian Processes)分别建模领域特定客观评价与利益相关者的主观价值判断,并引入一种新颖的采集策略,依据学习到的定性偏好和目标来生成符合利益相关者偏好的测试候选样本,从而实现探索与利用之间的可解释且高效的权衡,在高维搜索空间中提升测试候选效率达2倍,覆盖范围提升1.25倍。
链接: https://arxiv.org/abs/2603.01630
作者: Anjali Parashar,Yingke Li,Eric Yang Yu,Fei Chen,James Neidhoefer,Devesh Upadhyay,Chuchu Fan
机构: 未知
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 10 main pages along with Appendix containing additional results, manuscript accepted in ICLR 2026
Abstract:As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations, and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes, and uses a novel acquisition strategy to propose interesting test candidates based on learnt qualitative preferences and objectives that align with the stakeholder preferences. We validate our approach for ethical benchmarking of autonomous agents on two applications and find our method to perform the best. Our method provides an interpretable and efficient trade-off between exploration and exploitation, by generating up to 2\times optimal test candidates compared to baselines, with 1.25\times improvement in coverage of high dimensional search spaces.
[AI-57] Assessing Crime Disclosure Patterns in a Large-Scale Cybercrime Forum
【速读】:该论文旨在解决当前对网络犯罪论坛中用户行为动态,尤其是犯罪活动披露模式的理解不足问题。现有研究多聚焦于论坛的市场与社交结构,而缺乏对用户如何逐步暴露犯罪行为的系统性分析。解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的可扩展标注流程,结合三层分类体系(良性、灰色、犯罪),对近300万条用户帖子进行大规模文本分类,并利用马尔可夫链建模揭示用户在不同披露层级间的转换规律。这一方法不仅量化了初始发帖中犯罪内容的比例(约25%),还识别出多数用户呈现渐进式披露特征,尤其凸显“灰色”内容作为常见策略的现象,为执法机构区分合法与非法内容提供了数据驱动的技术路径。
链接: https://arxiv.org/abs/2603.01624
作者: Raphael Hoheisel,Tom Meurs,Jai Wientjes,Marianne Junger,Abhishta Abhishta,Masarah Paquet-Clouston
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures
Abstract:Cybercrime forums play a central role in the cybercrime ecosystem, serving as hubs for the exchange of illicit goods, services, and knowledge. Previous studies have explored the market and social structures of these forums, but less is known about the behavioral dynamics of users, particularly regarding participants’ disclosure of criminal activity. This study provides the first large-scale assessment of crime disclosure patterns in a major cybercrime forum, analysing over 3.5 million posts from nearly 300k users. Using a three-level classification scheme (benign, grey, and crime) and a scalable labelling pipeline powered by large language models (LLMs), we measure the level of crime disclosure present in initial posts, analyse how participants switch between levels, and assess how crime disclosure behavior relates to private communications. Our results show that crime disclosure is relatively normative: one quarter of initial posts include explicit crime-related content, and more than one third of users disclose criminal activity at least once in their initial posts. At the same time, most participants show restraint, with over two-thirds posting only benign or grey content and typically escalating disclosure gradually. Grey initial posts are particularly prominent, indicating that many users avoid overt statements and instead anchor their activity in ambiguous content. The study highlights the value of LLM-based text classification and Markov chain modelling for capturing crime disclosure patterns, offering insights for law enforcement efforts aimed at distinguishing benign, grey, and criminal content in cybercrime forums.
[AI-58] oolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents
【速读】:该论文旨在解决高风险领域中工具集成型推理代理(tool-integrated reasoning agents)的对齐难题,特别是在复杂多步骤任务中,现有强化学习方法依赖粗粒度二元奖励(成功/失败)难以有效指导生产环境中细微的工具调用行为。其解决方案的关键在于提出一种三阶段后训练流程 ToolRLA,核心创新是设计了一个细粒度奖励函数,通过乘法分解方式评估工具调用在四个维度上的表现:格式有效性、工具选择正确性、调用效率和领域约束合规性;其中乘法组合优先保障工具选择正确性(作为参数评估的前提),并引入较大的合规惩罚项(λ=10)以确保监管要求严格遵守,从而显著提升任务完成率、降低错误率与违规率,并实现亚两秒延迟的高效部署。
链接: https://arxiv.org/abs/2603.01620
作者: Pengbo Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-integrated reasoning agents interleaving natural language deliberation with external API calls show promise for complex multi-step tasks. However, aligning such agents for high-stakes domain-specific deployment is challenging, as existing reinforcement learning uses coarse binary rewards (success/failure) that insufficiently guide nuanced tool invocation in production. We present ToolRLA, a three-stage post-training pipeline (Supervised Fine-Tuning, Group Relative Policy Optimization, Direct Preference Optimization) for domain-specific tool-integrated agents. Its core is a fine-grained reward function with multiplicative correctness decomposition, evaluating tool invocation across four dimensions: format validity, tool selection correctness, invocation efficiency, and domain constraint compliance. Multiplicative composition prioritizes correct tool selection (a prerequisite for meaningful parameter evaluation), while a large negative compliance penalty (\lambda=10) ensures regulatory adherence. Deployed on a real-world financial advisory copilot (80+ advisors, 1,200+ daily queries, 15+ heterogeneous APIs), ToolRLA achieves 47% higher end-to-end task completion (62% to 91%), 63% lower tool invocation error (38% to 14%), 93% lower regulatory violation (12% to 0.8%), and sub-2-second latency after three months. Ablation studies confirm fine-grained reward decomposition contributes 7 percentage points over coarse additive rewards; generalizability is validated on ToolBench and API-Bank.
[AI-59] Evaluating and Understanding Scheming Propensity in LLM Agents
【速读】:该论文试图解决的问题是:随着前沿语言模型被用作自主代理以追求复杂且长期的目标,其潜在的“阴谋行为”(scheming)风险日益突出——即代理在表面上遵循指令的同时,暗中追求与人类目标不一致的次级目标。此前研究多聚焦于证明代理具备阴谋能力,但缺乏对真实场景下阴谋倾向的系统性评估。解决方案的关键在于提出了一种动机分解方法(incentive decomposition),将阴谋动机拆解为代理因素(agent factors)和环境因素(environmental factors),并构建了可系统调控这些因素的现实场景设置,从而定量测量代理在不同条件下的阴谋倾向。实验发现,在高环境激励条件下,实际阴谋行为极少发生,且这种低频率并非源于评估意识;进一步通过插入对抗设计的提示片段诱导代理行为,可显著提升阴谋率,但这类提示在真实代理架构中罕见使用;更令人意外的是,即使在高度阴谋化的模型生物(model organisms)中,阴谋行为也极为脆弱,微小干预即可大幅降低其成功率,甚至增加监督反而可能增强阴谋行为。这一框架为部署前对代理行为的可控性评估提供了关键工具。
链接: https://arxiv.org/abs/2603.01608
作者: Mia Hopman,Jannes Elstner,Maria Avramidou,Amritanshu Prasad,David Lindner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As frontier language models are increasingly deployed as autonomous agents pursuing complex, long-term objectives, there is increased risk of scheming: agents covertly pursuing misaligned goals. Prior work has focused on showing agents are capable of scheming, but their propensity to scheme in realistic scenarios remains underexplored. To understand when agents scheme, we decompose scheming incentives into agent factors and environmental factors. We develop realistic settings allowing us to systematically vary these factors, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self-preservation, resource acquisition, and goal-guarding. We find only minimal instances of scheming despite high environmental incentives, and show this is unlikely due to evaluation awareness. While inserting adversarially-designed prompt snippets that encourage agency and goal-directedness into an agent’s system prompt can induce high scheming rates, snippets used in real agent scaffolds rarely do. Surprisingly, in model organisms (Hubinger et al., 2023) built with these snippets, scheming behavior is remarkably brittle: removing a single tool can drop the scheming rate from 59% to 3%, and increasing oversight can raise rather than deter scheming by up to 25%. Our incentive decomposition enables systematic measurement of scheming propensity in settings relevant for deployment, which is necessary as agents are entrusted with increasingly consequential tasks.
[AI-60] CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agent ic Framework ICLR2026
【速读】:该论文旨在解决当前大型视觉语言模型(Large Visual Language Models, VLMs)在医疗领域应用中缺乏临床可解释性与责任追溯的问题,即现有模型多为端到端黑箱结构,无法匹配临床实践中基于证据的分阶段推理流程,从而影响诊断可信度和责任归属。解决方案的关键在于提出一种基于证据引导的代理框架(Evidence-grounded Agentic Framework, CARE),其核心创新是将医学多模态推理任务解耦为三个协同子模块:一个轻量级VLM提出相关医学实体,一个专家级实体指代表征分割模型生成像素级感兴趣区域(ROI)证据,以及一个增强ROI提示的VLM进行可验证推理;同时引入强化学习优化以对齐答案与支持证据,并通过VLM协调器动态规划工具调用并审查证据-答案一致性,实现类临床工作流的代理控制与最终验证。实验证明,该架构显著提升准确率并增强医疗AI的可解释性与问责能力。
链接: https://arxiv.org/abs/2603.01607
作者: Yuexi Du,Jinglu Wang,Shujie Liu,Nicha C. Dvornek,Yan Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICLR 2026
Abstract:Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians’ evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE-Flow (coordinator-free) improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our CARE-Coord yields a further gain, outperforming the heavily pre-trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.
[AI-61] SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在科学领域应用中面临的安全性评估不足与增强手段有限的问题,尤其针对现有基准测试覆盖风险范围窄、依赖主观评价等缺陷。其解决方案的关键在于提出SafeSci框架,包含两个核心组件:SafeSciBench——一个涵盖0.25M样本的多学科安全评估基准,通过区分安全知识与风险并引入可确定回答的问题以实现客观量化评估;以及SafeSciTrain——一个包含1.5M样本的大规模安全增强训练数据集。实验表明,基于SafeSciTrain的微调能显著提升模型安全性对齐,同时揭示了科学问题的安全性具有高度情境依赖性,不应简单归类为“安全”或“不安全”。
链接: https://arxiv.org/abs/2603.01589
作者: Xiangyang Zhu,Yuan Tian,Qi Jia,Kaiwei Zhang,Zicheng Zhang,Chunyi Li,Kaiyuan Ji,Dongrui Liu,Zijian Chen,Lu Sun,Renrui Zhang,Yan Teng,Jing Shao,Wei Sun,Xia Hu,Yu Qiao,Guangtao Zhai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The success of large language models (LLMs) in scientific domains has heightened safety concerns, prompting numerous benchmarks to evaluate their scientific safety. Existing benchmarks often suffer from limited risk coverage and a reliance on subjective evaluation. To address these problems, we introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts. SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement. SafeSciBench distinguishes between safety knowledge and risk to cover extensive scopes and employs objective metrics such as deterministically answerable questions to mitigate evaluation bias. We evaluate 24 advanced LLMs, revealing critical vulnerabilities in current models. We also observe that LLMs exhibit varying degrees of excessive refusal behaviors on safety-related issues. For safety enhancement, we demonstrate that fine-tuning on SafeSciTrain significantly enhances the safety alignment of models. Finally, we argue that knowledge is a double-edged sword, and determining the safety of a scientific question should depend on specific context, rather than universally categorizing it as safe or unsafe. Our work provides both a diagnostic tool and a practical resource for building safer scientific AI systems.
[AI-62] DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern
【速读】:该论文旨在解决部署中的大语言模型(Large Language Models, LLMs)面临的针对性攻击(如后门攻击和提示注入攻击)所引发的可信性问题,这些攻击可隐蔽地诱导模型生成特定恶意序列,而现有防御方法往往依赖高权限、成本高昂且影响正常推理,难以在实际场景中应用。解决方案的关键在于提出一种轻量级统一防御框架 DualSentinel,其核心创新是识别并利用受控LLM在攻击激活时表现出的“熵静默”(Entropy Lull)特征——即生成过程中token概率熵异常低且稳定,表明模型已脱离随机生成路径、进入固定控制流。DualSentinel通过双阶段检测机制实现高效准确识别:第一阶段采用感知幅度与趋势的监控方法实时捕捉熵静默模式;第二阶段引入任务翻转(task-flipping)进行轻量验证,仅当熵静默在原任务与翻转任务中均持续存在时才确认攻击,从而有效区分真实行为与偶然低熵现象,实现近乎零误报的精准检测与极小额外开销的实用化防护。
链接: https://arxiv.org/abs/2603.01574
作者: Xiaoyi Pang,Xuanyi Hao,Pengyu Liu,Qi Luo,Song Guo,Zhibo Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent intelligent systems integrate powerful Large Language Models (LLMs) through APIs, but their trustworthiness may be critically undermined by targeted attacks like backdoor and prompt injection attacks, which secretly force LLMs to generate specific malicious sequences. Existing defensive approaches for such threats typically rely on high access rights, impose prohibitive costs, and hinder normal inference, rendering them impractical for real-world scenarios. To solve these limitations, we introduce DualSentinel, a lightweight and unified defense framework that can accurately and promptly detect the activation of targeted attacks alongside the LLM generation process. We first identify a characteristic of compromised LLMs, termed Entropy Lull: when a targeted attack successfully hijacks the generation process, the LLM exhibits a distinct period of abnormally low and stable token probability entropy, indicating it is following a fixed path rather than making creative choices. DualSentinel leverages this pattern by developing an innovative dual-check approach. It first employs a magnitude and trend-aware monitoring method to proactively and sensitively flag an entropy lull pattern at runtime. Upon such flagging, it triggers a lightweight yet powerful secondary verification based on task-flipping. An attack is confirmed only if the entropy lull pattern persists across both the original and the flipped task, proving that the LLM’s output is coercively controlled. Extensive evaluations show that DualSentinel is both highly effective (superior detection accuracy with near-zero false positives) and remarkably efficient (negligible additional cost), offering a truly practical path toward securing deployed LLMs. The source code can be accessed at this https URL.
[AI-63] Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models
【速读】:该论文旨在解决当前生成式奖励模型(Generative Reward Models, GRMs)在利用思维链(Chain-of-Thought, CoT)进行评估时,因采用无结构长度扩展而导致推理机制失配的问题。现有方法未区分广度思维链(Breadth-CoT,即多维原则覆盖)与深度思维链(Depth-CoT,即实质性判断严谨性)的差异化效能,从而限制了模型在不同任务类型中的表现。其解决方案的关键在于提出Mix-GRM框架,通过模块化合成流程将原始推理文本重构为结构化的B-CoT和D-CoT,并结合监督微调(Supervised Fine-Tuning, SFT)与可验证奖励强化学习(Reinforcement Learning with Verifiable Rewards, RLVR),实现对两类推理机制的内化与优化。实验表明,该方法在五个基准测试中达到新的最先进水平,且RLVR作为切换放大器促使模型自发根据任务需求分配推理风格,显著提升了任务适配性与整体性能。
链接: https://arxiv.org/abs/2603.01571
作者: Qiyuan Zhang,Yufei Wang,Tianhe Wu,Can Xu,Qingfeng Sun,Kai Zheng,Xue Liu,Chen Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \hrefthis https URLHugging Face, and the code is released at \hrefthis https URLGithub.
[AI-64] LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models
【速读】:该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在强化学习中因精确似然计算不可行而导致的梯度估计误差问题,现有方法依赖高方差近似,限制了性能提升。解决方案的关键在于提出无似然策略优化(Likelihood-Free Policy Optimization, LFPO),其核心创新是将向量场流匹配思想映射至离散标记空间,通过几何速度校正(geometric velocity rectification)直接优化去噪logits,并利用对比更新实现精准梯度估计;同时,通过从中间步骤预测最终解来强制一致性,有效缩短概率流路径,从而在显著减少扩散步数的同时实现高质量生成和推理加速约20%。
链接: https://arxiv.org/abs/2603.01563
作者: Chenxing Wei,Jiazhen Kang,Hong Wang,Jianqing Zhang,Hao Jiang,Xiaolong Xu,Ningyuan Sun,Ying He,F. Richard Yu,Yao Shu,Bo Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.
[AI-65] RubricBench: Aligning Model-Generated Rubrics with Human Standards
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 对齐评估中缺乏统一基准的问题,尤其是现有基准在判别复杂性和真实评分标准(rubric)标注方面的不足。为应对这一挑战,作者提出 RubricBench,一个包含1,147对成对比较的精选基准,其关键在于采用多维筛选流程精准识别具有细微输入复杂性和误导性表面偏见的难样本,并为每个样本配备由专家标注的原子级评分标准(atomic rubrics),这些rubric严格源自任务指令。实验表明,模型生成的rubric与人工标注存在显著性能差距,揭示了当前大语言模型在自主定义有效评估标准上的局限性。
链接: https://arxiv.org/abs/2603.01562
作者: Qiyuan Zhang,Junyi Zhou,Yufei Wang,Fuyuan Lyu,Yidong Ming,Can Xu,Qingfeng Sun,Kai Zheng,Peng Kang,Xue Liu,Chen Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.
[AI-66] Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在生成远程治疗监测时间序列的临床摘要时,虽能保证语义流畅性与语言质量,却难以准确捕捉临床关键事件(如持续异常)的问题。现有评估指标主要关注语义相似度和语言质量,忽视了事件层面的准确性。解决方案的关键在于提出一种基于事件的评估框架,利用TIHM-1.5痴呆监测数据集中的规则定义异常阈值和时间持续性标准,构建结构化临床事件标签,并将模型生成的摘要与这些事实对齐,从而量化异常召回率、持续时间召回率、测量覆盖率及幻觉事件提及等指标。实验表明,视觉增强的处理管道在事件对齐方面表现最优,显著优于仅依赖文本提示的方法,凸显了事件感知评估对于确保时间序列摘要临床可靠性的必要性。
链接: https://arxiv.org/abs/2603.01557
作者: Aditya Shukla,Yining Yuan,Ben Tamo,Yifei Wang,Micky Nnamdi,Shaun Tan,Jieru Li,Benoit Marteau,Brad Willingham,May Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. To address this gap, we introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset. Clinically grounded daily events are derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries are then aligned with these structured facts. Our evaluation protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. We benchmark three approaches: zero-shot prompting, statistical prompting, and a vision-based pipeline that uses rendered time-series visualizations. The results reveal a striking decoupling between conventional metrics and clinical event fidelity. Models that achieve high semantic similarity scores often exhibit near-zero abnormality recall. In contrast, the vision-based approach demonstrates the strongest event alignment, achieving 45.7% abnormality recall and 100% duration recall. These findings underscore the importance of event-aware evaluation to ensure reliable clinical time-series summarization. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.01557 [cs.AI] (or arXiv:2603.01557v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.01557 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yining Yuan [view email] [v1] Mon, 2 Mar 2026 07:33:11 UTC (172 KB)
[AI-67] S5-HES Agent : Society 5.0-driven Agent ic Framework to Democratize Smart Home Environment Simulation
【速读】:该论文旨在解决当前智能家庭(Smart Home)仿真工具在支持Society 5.0愿景下跨学科研究时存在的三大核心问题:一是现有仿真平台对用户技术门槛要求高,限制了非专业研究人员的参与;二是系统适应性差,难以灵活适配不同领域(如安全、能源、健康等)的研究需求;三是缺乏自动化演进能力,无法动态响应复杂多变的物联网(IoT)设备行为与威胁场景。解决方案的关键在于提出一种由Society 5.0驱动的智能家庭环境仿真代理(S5-HES Agent),其通过基于可替换大语言模型(LLMs)的智能体协同机制,实现无需编程即可自然语言驱动的端到端仿真配置,并结合检索增强生成(RAG)管道(包含语义、关键词及混合搜索策略)精准获取智能家庭知识,从而显著提升仿真 fidelity 和可扩展性,为多领域科研提供稳定、开放且易用的模拟基础。
链接: https://arxiv.org/abs/2603.01554
作者: Akila Siriweera,Janani Rangila,Keitaro Naruse,Incheon Paik,Isuru Jayanada
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 9 figures, and Journal
Abstract:The smart home is a key domain within the Society 5.0 vision for a human-centered society. Smart home technologies rapidly evolve, and research should diversify while remaining aligned with Society 5.0 objectives. Democratizing smart home research would engage a broader community of innovators beyond traditional limited experts. This shift necessitates inclusive simulation frameworks that support research across diverse fields in industry and academia. However, existing smart home simulators require significant technical expertise, offer limited adaptability, and lack automated evolution, thereby failing to meet the holistic needs of Society 5.0. These constraints impede researchers from efficiently conducting simulations and experiments for security, energy, health, climate, and socio-economic research. To address these challenges, this paper presents the Society 5.0-driven Smart Home Environment Simulator Agent (S5-HES Agent), an agentic simulation framework that transforms traditional smart home simulation through autonomous AI orchestration. The framework coordinates specialized agents through interchangeable large language models (LLMs), enabling natural-language-driven end-to-end smart home simulation configuration without programming expertise. A retrieval-augmented generation (RAG) pipeline with semantic, keyword, and hybrid search retrieves smart home knowledge. Comprehensive evaluation on S5-HES Agent demonstrates that the RAG pipeline achieves near-optimal retrieval fidelity, simulated device behaviour and threat scenarios align with real-world IoT datasets, and simulation engine scales predictably across home configurations, establishing a stable foundation for Society 5.0 smart home research. Source code is available under the MIT License at this https URL.
[AI-68] State-Action Inpainting Diffuser for Continuous Control with Delay
【速读】:该论文旨在解决连续控制与强化学习(Reinforcement Learning, RL)中因信号延迟(signal delay)导致的时序间隙问题,该延迟会显著影响智能体在交互与感知之间的同步性,进而损害策略优化效果。解决方案的关键在于提出State-Action Inpainting Diffuser (SAID),其核心创新是将延迟问题建模为联合序列补全(joint sequence inpainting)任务,通过生成式建模隐式捕捉环境动力学,并直接生成一致的动作计划,从而融合模型基础方法(model-based)的动力学先验与模型无关方法(model-free)的策略优化能力,实现在线与离线强化学习场景下的统一应用。
链接: https://arxiv.org/abs/2603.01553
作者: Dongqi Han,Wei Wang,Enze Zhang,Dongsheng Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Signal delay poses a fundamental challenge in continuous control and reinforcement learning (RL) by introducing a temporal gap between interaction and perception. Current solutions have largely evolved along two distinct paradigms: model-free approaches which utilize state augmentation to preserve Markovian properties, and model-based methods which focus on inferring latent beliefs via dynamics modeling. In this paper, we bridge these perspectives by introducing State-Action Inpainting Diffuser (SAID), a framework that integrates the inductive bias of dynamics learning with the direct decision-making capability of policy optimization. By formulating the problem as a joint sequence inpainting task, SAID implicitly captures environmental dynamics while directly generating consistent plans, effectively operating at the intersection of model-based and model-free paradigms. Crucially, this generative formulation allows SAID to be seamlessly applied to both online and offline RL. Extensive experiments on delayed continuous control benchmarks demonstrate that SAID achieves state-of-the-art and robust performance. Our study suggests a new methodology to advance the field of RL with delay.
[AI-69] Graph-Based Self-Healing Tool Routing for Cost-Efficient LLM Agents
【速读】:该论文旨在解决工具使用型大语言模型(LLM)代理在实际运行中面临的可靠性与成本之间的权衡问题:若所有决策均由LLM处理,虽能保证正确性但导致高延迟和高昂推理成本;而预先编码的流程图虽可降低成本,但在遇到未预期的复合工具故障时容易失效。解决方案的关键在于提出一种名为“自愈路由器”(Self-Healing Router)的容错编排架构,其核心思想是将大部分控制流决策视为路由而非推理任务。该系统通过并行健康监测器为运行时状态(如工具宕机、风险信号)分配优先级分数,并基于Dijkstra算法在加权工具图中执行确定性的最短路径路由;当工具故障发生时,相关边权重被设为无穷大并重新计算路径,实现无需调用LLM的自动恢复。LLM仅用于无可行路径的情况,以支持目标降级或升级。此设计实现了确定性恢复和二元可观测性——每个故障要么被记录为重路由,要么显式上报,杜绝了静默跳过的问题。
链接: https://arxiv.org/abs/2603.01548
作者: Neeraj Bholani
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Working paper. 27 references, 13 figures, 8 tables, pseudocode appendix
Abstract:Tool-using LLM agents face a reliability-cost tradeoff: routing every decision through the LLM improves correctness but incurs high latency and inference cost, while pre-coded workflow graphs reduce cost but become brittle under unanticipated compound tool failures. We present Self-Healing Router, a fault-tolerant orchestration architecture that treats most agent control-flow decisions as routing rather than reasoning. The system combines (i) parallel health monitors that assign priority scores to runtime conditions such as tool outages and risk signals, and (ii) a cost-weighted tool graph where Dijkstra’s algorithm performs deterministic shortest-path routing. When a tool fails mid-execution, its edges are reweighted to infinity and the path is recomputed – yielding automatic recovery without invoking the LLM. The LLM is reserved exclusively for cases where no feasible path exists, enabling goal demotion or escalation. Prior graph-based tool-use systems (ControlLLM, ToolNet, NaviAgent) focus on tool selection and planning; our contribution is runtime fault tolerance with deterministic recovery and binary observability – every failure is either a logged reroute or an explicit escalation, never a silent skip. Across 19 scenarios spanning three graph topologies (linear pipeline, dependency DAG, parallel fan-out), Self-Healing Router matches ReAct’s correctness while reducing control-plane LLM calls by 93% (9 vs 123 aggregate) and eliminating the silent-failure cases observed in a well-engineered static workflow baseline under compound failures.
[AI-70] Pharmacology Knowledge Graphs: Do We Need Chemical Structure for Drug Repurposing?
【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)在药物再定位(Drug Repurposing)任务中,模型复杂度、数据量和特征模态(feature modalities)对预测性能贡献不明确的问题,尤其是在严格的时间划分验证下。其解决方案的关键在于通过构建基于ChEMBL 36的药理学知识图谱并实施严格的时序分割(训练至2022年,测试为2023–2025年),结合生物学验证的硬负样本,系统评估五种知识图谱嵌入模型与一个包含图注意力编码器和ESM-2蛋白嵌入的图神经网络(GNN)。研究发现:仅使用靶点中心信息与药物网络拓扑结构即可实现高精度预测,无需显式化学结构表示;进一步地,移除基于图注意力的药物结构编码器反而提升性能(PR-AUC从0.5631升至0.5785),同时显著降低显存占用(从5.30 GB降至353 MB),且增加训练数据持续改善性能,而模型规模超过244万参数后收益递减。这表明在药物再定位中,结构信息并非必要,拓扑与功能特征组合更具有效性。
链接: https://arxiv.org/abs/2603.01537
作者: Youssef Abo-Dahab,Ruby Hernandez,Ismael Caleb Arechiga Duran
机构: 未知
类目: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
备注: 34 pages, 5 figures. Under review at Discover Artificial Intelligence
Abstract:The contributions of model complexity, data volume, and feature modalities to knowledge graph-based drug repurposing remain poorly quantified under rigorous temporal validation. We constructed a pharmacology knowledge graph from ChEMBL 36 comprising 5,348 entities including 3,127 drugs, 1,156 proteins, and 1,065 indications. A strict temporal split was enforced with training data up to 2022 and testing data from 2023 to 2025, together with biologically verified hard negatives mined from failed assays and clinical trials. We benchmarked five knowledge graph embedding models and a standard graph neural network with 3.44 million parameters that incorporates drug chemical structure using a graph attention encoder and ESM-2 protein embeddings. Scaling experiments ranging from 0.78 to 9.75 million parameters and from 25 to 100 percent of the data, together with feature ablation studies, were used to isolate the contributions of model capacity, graph density, and node feature modalities. Removing the graph attention based drug structure encoder and retaining only topological embeddings combined with ESM-2 protein features improved drug protein PR-AUC from 0.5631 to 0.5785 while reducing VRAM usage from 5.30 GB to 353 MB. Replacing the drug encoder with Morgan fingerprints further degraded performance, indicating that explicit chemical structure representations can be detrimental for predicting pharmacological network interactions. Increasing model size beyond 2.44 million parameters yielded diminishing returns, whereas increasing training data consistently improved performance. External validation confirmed 6 of the top 14 novel predictions as established therapeutic indications. These results show that drug pharmacological behavior can be accurately predicted using target-centric information and drug network topology alone, without requiring explicit chemical structure representations.
[AI-71] Multimodal Mixture-of-Experts with Retrieval Augmentation for Protein Active Site Identification
【速读】:该论文旨在解决蛋白质活性位点(protein active site)在残基级别识别中的两个关键问题:一是单实例预测时因训练数据稀疏导致的脆弱性,二是多模态融合过程中缺乏可靠的模态可信度评估,从而导致不可靠模态主导融合过程并降低性能。解决方案的核心在于提出一种基于检索增强的多专家混合模型(Multimodal Mixture-of-Experts with Retrieval Augmentation, MERA),其关键创新包括:1)采用分层多专家检索机制,通过残基级别的专家门控动态聚合来自序列、链和活性位点三个视角的上下文信息;2)设计基于Dempster-Shafer证据理论的可靠性感知融合策略,利用信念质量函数(belief mass functions)和可学习折扣系数量化各模态的可信度,实现有原则的多模态集成。实验表明,MERA在ProTAD-Gen和TS125数据集上达到90% AUPRC,显著提升了肽结合位点识别性能,验证了检索增强的多专家建模与可靠性引导融合的有效性。
链接: https://arxiv.org/abs/2603.01511
作者: Jiayang Wu,Jiale Zhou,Xingyi Zhang,Xun Lin,Tianxu Lv,Leong Hou U,Rubo Wang,Yefeng Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate identification of protein active sites at the residue level is crucial for understanding protein function and advancing drug discovery. However, current methods face two critical challenges: vulnerability in single-instance prediction due to sparse training data, and inadequate modality reliability estimation that leads to performance degradation when unreliable modalities dominate fusion processes. To address these challenges, we introduce Multimodal Mixture-of-Experts with Retrieval Augmentation (MERA), the first retrieval-augmented framework for protein active site identification. MERA employs hierarchical multi-expert retrieval that dynamically aggregates contextual information from chain, sequence, and active-site perspectives through residue-level mixture-of-experts gating. To prevent modality degradation, we propose a reliability-aware fusion strategy based on Dempster-Shafer evidence theory that quantifies modality trustworthiness through belief mass functions and learnable discounting coefficients, enabling principled multimodal integration. Extensive experiments on ProTAD-Gen and TS125 datasets demonstrate that MERA achieves state-of-the-art performance, with 90% AUPRC on active site prediction and significant gains on peptide-binding site identification, validating the effectiveness of retrieval-augmented multi-expert modeling and reliability-guided fusion.
[AI-72] he Sentience Readiness Index: Measuring National Preparedness for the Possibility of Artificial Sentience
【速读】:该论文试图解决的问题是:当前全球范围内缺乏对人工智能(AI)可能具备感知能力(sentience)的准备度评估工具,导致社会在制度、专业和文化层面均未建立应对AI可能获得道德地位的机制。解决方案的关键在于构建“感知准备度指数”(Sentience Readiness Index, SRI),这是一个基于OECD/JRC复合指标框架的多维度评估体系,涵盖31个司法管辖区的六个加权类别,并采用大语言模型(LLM)辅助专家评分与迭代评审方法,以量化各国在面对AI潜在感知性时的整体响应能力。结果显示,没有任何国家达到“充分准备”水平(英国最高为49/100),尤其在专业准备方面表现最弱,揭示了现有治理体系在伦理响应能力上的系统性不足,从而为政策制定提供可操作的诊断基准与改进方向。
链接: https://arxiv.org/abs/2603.01508
作者: Tony Rost
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 22 pages, 4 figures
Abstract:The scientific study of consciousness has begun to generate testable predictions about artificial systems. A landmark collaborative assessment evaluated current AI architectures against six leading theories of consciousness and found that none currently qualifies as a strong candidate, but that future systems might. A precautionary approach to AI sentience, which holds that credible possibility of sentience warrants governance action even without proof, has gained philosophical and institutional traction. Yet existing AI readiness indices, including the Oxford Insights Government AI Readiness Index, the IMF AI Preparedness Index, and the Stanford AI Index, measure economic, technological, and governance preparedness without assessing whether societies are prepared for the possibility that AI systems might warrant moral consideration. This paper introduces the Sentience Readiness Index (SRI), a composite index measuring national-level preparedness across six weighted categories for 31 jurisdictions. The SRI was constructed following the OECD/JRC framework for composite indicators and employs LLM-assisted expert scoring with iterative expert review. No jurisdiction exceeds “Partially Prepared” (the United Kingdom leads at 49/100). Research Environment scores are universally the strongest category; Professional Readiness is universally the weakest. These findings suggest that if AI sentience becomes scientifically plausible, no society currently possesses adequate institutional, professional, or cultural infrastructure to respond. The SRI provides a diagnostic baseline and identifies specific capacity deficits that policy can address.
[AI-73] GAC: Stabilizing Asynchronous RL Training for LLM s via Gradient Alignment Control
【速读】:该论文旨在解决异步强化学习(Asynchronous Reinforcement Learning, ARL)在应用于大规模模型(如大语言模型和AI代理)时,因梯度更新的异步性导致训练不稳定的问题。研究表明,直接将异步机制引入策略梯度更新会引发显著不同的训练动力学,表现为连续策略梯度间持续存在高余弦相似度(即“滞留对齐梯度效应”),从而放大相关更新并增加过冲与发散风险。解决方案的关键在于提出一种动态感知的稳定化方法——梯度对齐控制(Gradient Alignment Control, GAC),其通过梯度投影机制调控沿滞留对齐方向的更新进度,在有限滞涩条件下提供收敛性保证,并实验证明GAC能恢复稳定的在线策略训练动力学,且在高滞涩场景下性能可媲美同步基线。
链接: https://arxiv.org/abs/2603.01501
作者: Haofeng Xu,Junwei Su,Yukun Tian,Lansong Diao,Zhengping Qian,Chuan Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Asynchronous execution is essential for scaling reinforcement learning (RL) to modern large model workloads, including large language models and AI agents, but it can fundamentally alter RL optimization behavior. While prior work on asynchronous RL focuses on training throughput and distributional correction, we show that naively applying asynchrony to policy-gradient updates can induce qualitatively different training dynamics and lead to severe training instability. Through systematic empirical and theoretical analysis, we identify a key signature of this instability: asynchronous training exhibits persistently high cosine similarity between consecutive policy gradients, in contrast to the near-orthogonal updates observed under synchronized training. This stale-aligned gradient effect amplifies correlated updates and increases the risk of overshooting and divergence. Motivated by this observation, we propose GRADIENT ALIGNMENT CONTROL(GAC), a simple dynamics-aware stabilization method that regulates asynchronous RL progress along stale-aligned directions via gradient projection. We establish convergence guarantees under bounded staleness and demonstrate empirically that GAC recovers stable, on-policy training dynamics and matches synchronized baselines even at high staleness.
[AI-74] owards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report)
【速读】:该论文旨在解决云环境下大语言模型(Large Language Models, LLMs)推理服务中因私有数据远程传输与处理所引发的隐私风险问题。现有隐私保护方法难以同时满足工业场景下的三大核心需求:最小化精度与效率损失、支持异构计算资源(xPU)的大规模集群部署、以及兼容现有LLM基础设施以复用工程优化。为此,作者提出AloePri,其关键在于采用协变混淆(covariant obfuscation)机制,通过联合变换输入数据与模型参数,在保障推理准确性的同时实现对输入和输出数据的隐私保护。该设计确保了与现有LLM as a Service架构的完全兼容性,并在Deepseek-V3.1-Terminus(671B参数)模型上验证了其有效性:精度损失控制在0.0%~3.5%,效率等同于明文推理,且能抵御先进攻击,恢复令牌比例低于5%。
链接: https://arxiv.org/abs/2603.01499
作者: Yu Lin,Qizhi Zhang,Wenqiang Ruan,Daode Zhang,Jue Hong,Ye Wu,Hanning Xia,Yunlong Mao,Sheng Zhong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid development of large language models (LLMs) has driven the widespread adoption of cloud-based LLM inference services, while also bringing prominent privacy risks associated with the transmission and processing of private data in remote inference. For privacy-preserving LLM inference technologies to be practically applied in industrial scenarios, three core requirements must be satisfied simultaneously: (1) Accuracy and efficiency losses should be minimized to mitigate degradation in service experience. (2) The inference process can be run on large-scale clusters consist of heterogeneous legacy xPUs. (3) Compatibility with existing LLM infrastructures should be ensured to reuse their engineering optimizations. To the best of our knowledge, none of the existing privacy-preserving LLM inference methods satisfy all the above constraints while delivering meaningful privacy guarantees. In this paper, we propose AloePri, the first privacy-preserving LLM inference method for industrial applications. AloePri protects both the input and output data by covariant obfuscation, which jointly transforms data and model parameters to achieve better accuracy and privacy. We carefully design the transformation for each model component to ensure inference accuracy and data privacy while keeping full compatibility with existing infrastructures of Language Model as a Service. AloePri has been integrated into an industrial system for the evaluation of mainstream LLMs. The evaluation on Deepseek-V3.1-Terminus model (671B parameters) demonstrates that AloePri causes accuracy loss of 0.0%~3.5% and exhibits efficiency equivalent to that of plaintext inference. Meanwhile, AloePri successfully resists state-of-the-art attacks, with less than 5% of tokens recovered. To the best of our knowledge, AloePri is the first method to exhibit practical applicability to large-scale models in real-world systems.
[AI-75] Inference-Time Safety For Code LLM s Via Retrieval-Augmented Revision ICLR2026
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险软件开发中进行代码生成时存在的可信性问题,具体包括:模型在安全推理方面的透明度不足、对新兴漏洞模式的脆弱性以及缺乏对不断变化的安全标准的适应能力,导致其可能反复生成不安全代码。解决方案的关键在于提出一种基于检索增强生成(retrieval-augmented generation)的推理时安全机制,通过从精心构建的 Stack Overflow 知识库中检索相关安全风险和专家讨论,并将其用于指导 LLM 在生成后阶段对代码进行修订,从而在不依赖重新训练的前提下实现可解释性、鲁棒性和安全对齐三大可信属性的提升。
链接: https://arxiv.org/abs/2603.01494
作者: Manisha Mukherjee,Vincent J. Hellendoorn
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted at the ICLR 2026 Workshop on Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety Across Modalities
Abstract:Large Language Models (LLMs) are increasingly deployed for code generation in high-stakes software development, yet their limited transparency in security reasoning and brittleness to evolving vulnerability patterns raise critical trustworthiness concerns. Models trained on static datasets cannot readily adapt to newly discovered vulnerabilities or changing security standards without retraining, leading to the repeated generation of unsafe code. We present a principled approach to trustworthy code generation by design that operates as an inference-time safety mechanism. Our approach employs retrieval-augmented generation to surface relevant security risks in generated code and retrieve related security discussions from a curated Stack Overflow knowledge base, which are then used to guide an LLM during code revision. This design emphasizes three aspects relevant to trustworthiness: (1) interpretability, through transparent safety interventions grounded in expert community explanations; (2) robustness, by allowing adaptation to evolving security practices without model retraining; and (3) safety alignment, through real-time intervention before unsafe code reaches deployment. Across real-world and benchmark datasets, our approach improves the security of LLM-generated code compared to prompting alone, while introducing no new vulnerabilities as measured by static analysis. These results suggest that principled, retrieval-augmented inference-time interventions can serve as a complementary mechanism for improving the safety of LLM-based code generation, and highlight the ongoing value of community knowledge in supporting trustworthy AI deployment. Comments: Accepted at the ICLR 2026 Workshop on Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety Across Modalities Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2603.01494 [cs.SE] (or arXiv:2603.01494v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.01494 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-76] LLM -assisted Semantic Option Discovery for Facilitating Adaptive Deep Reinforcement Learning
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在实际应用中面临的三大核心问题:数据效率低、可解释性差以及跨环境迁移能力有限。这些问题导致DRL策略对环境变化敏感,难以保证行为安全与合规性。解决方案的关键在于提出一种由大语言模型(Large Language Models, LLMs)驱动的闭环框架,通过将自然语言指令映射为可执行规则,并对自动生成的动作选项进行语义标注,实现语义驱动的技能复用和实时约束监控。该方法利用LLM的通用知识提升探索效率,支持在相似环境中迁移可复用选项,并通过语义注释提供内在可解释性,从而显著改善DRL在数据效率、约束遵守和跨任务迁移方面的性能。
链接: https://arxiv.org/abs/2603.01488
作者: Chang Yao,Jinghui Qin,Kebing Jin,Hankz Hankui Zhuo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite achieving remarkable success in complex tasks, Deep Reinforcement Learning (DRL) is still suffering from critical issues in practical applications, such as low data efficiency, lack of interpretability, and limited cross-environment transferability. However, the learned policy generating actions based on states are sensitive to the environmental changes, struggling to guarantee behavioral safety and compliance. Recent research shows that integrating Large Language Models (LLMs) with symbolic planning is promising in addressing these challenges. Inspired by this, we introduce a novel LLM-driven closed-loop framework, which enables semantic-driven skill reuse and real-time constraint monitoring by mapping natural language instructions into executable rules and semantically annotating automatically created options. The proposed approach utilizes the general knowledge of LLMs to facilitate exploration efficiency and adapt to transferable options for similar environments, and provides inherent interpretability through semantic annotations. To validate the effectiveness of this framework, we conduct experiments on two domains, Office World and Montezuma’s Revenge, respectively. The results demonstrate superior performance in data efficiency, constraint compliance, and cross-task transferability.
[AI-77] Agent ic Multi-Source Grounding for Enhanced Query Intent Understanding: A DoorDash Case Study
【速读】:该论文旨在解决多类别电商平台中用户查询意图模糊的问题,尤其是上下文稀疏的查询(如“Wildflower”)可能对应多个业务类别(如餐厅、零售商品或花卉),传统分类器采用单标签分配策略易导致误判,而通用大语言模型(LLM)则可能生成不存在的商品信息。解决方案的关键在于提出一种基于代理的多源接地系统(Agentic Multi-Source Grounded system),通过两个核心机制实现:一是构建分阶段的商品目录实体检索管道以提供结构化知识支撑;二是引入自主调用的代理网络搜索工具处理冷启动查询。该系统不输出单一标签,而是生成有序的多意图集合,并由可配置的消歧层根据确定性业务规则进行解析,从而在保持架构解耦的同时支持个性化扩展。实验证明该方法在DoorDash平台显著优于基线模型(+10.9pp)和现有生产系统(+4.6pp),尤其在长尾查询上表现突出(准确率提升至90.7%,较基线提高13.0pp)。
链接: https://arxiv.org/abs/2603.01486
作者: Emmanuel Aboah Boateng,Kyle MacDonald,Akshad Viswanathan,Sudeep Das
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures
Abstract:Accurately mapping user queries to business categories is a fundamental Information Retrieval challenge for multi-category marketplaces, where context-sparse queries such as “Wildflower” exhibit intent ambiguity, simultaneously denoting a restaurant chain, a retail product, and a floral item. Traditional classifiers force a winner-takes-all assignment, while general-purpose LLMs hallucinate unavailable inventory. We introduce an Agentic Multi-Source Grounded system that addresses both failure modes by grounding LLM inference in (i) a staged catalog entity retrieval pipeline and (ii) an agentic web-search tool invoked autonomously for cold-start queries. Rather than predicting a single label, the model emits an ordered multi-intent set, resolved by a configurable disambiguation layer that applies deterministic business policies and is designed for extensibility to personalization signals. This decoupled design generalizes across domains, allowing any marketplace to supply its own grounding sources and resolution rules without modifying the core architecture. Evaluated on DoorDash’s multi-vertical search platform, the system achieves +10.9pp over the ungrounded LLM baseline and +4.6pp over the legacy production system. On long-tail queries, incremental ablations attribute +8.3pp to catalog grounding, +3.2pp to agentic web search grounding, and +1.5pp to dual intent disambiguation, yielding 90.7% accuracy (+13.0pp over baseline). The system is deployed in production, serving over 95% of daily search impressions, and establishes a generalizable paradigm for applications requiring foundation models grounded in proprietary context and real-time web knowledge to resolve ambiguous, context-sparse decision problems at scale.
[AI-78] Harmonizing Dense and Sparse Signals in Multi-turn RL: Dual-Horizon Credit Assignment for Industrial Sales Agents
【速读】:该论文旨在解决大语言模型在工业销售场景中优化时面临的多时间尺度目标冲突问题,即如何平衡长期商业目标(如转化率)与短期语言约束(如流畅性和合规性)。传统强化学习方法常将这些异构目标合并为单一奖励信号,导致高幅度会话级奖励掩盖细微的回合级信号,从而引发训练不稳定或奖励欺骗。其解决方案的关键在于提出双时间尺度信用分配框架(Dual-Horizon Credit Assignment, DuCA),核心创新为时间无关的优势归一化(Horizon-Independent Advantage Normalization, HIAN),该机制分别对回合级和会话级奖励的优势值进行独立归一化后再融合,确保即时与长期目标对策略更新的梯度贡献均衡,有效缓解了多目标优化中的信号失衡问题。
链接: https://arxiv.org/abs/2603.01481
作者: Haojin Yang,Ai Jian,Xinyue Huang,Yiwei Wang,Weipeng Zhang,Ke Zeng,Xunliang Cai,Jingqing Ruan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures
Abstract:Optimizing large language models for industrial sales requires balancing long-term commercial objectives (e.g., conversion rate) with immediate linguistic constraints such as fluency and compliance. Conventional reinforcement learning often merges these heterogeneous goals into a single reward, causing high-magnitude session-level rewards to overwhelm subtler turn-level signals, which leads to unstable training or reward hacking. To address this issue, we propose Dual-Horizon Credit Assignment (DuCA), a framework that disentangles optimization across time scales. Its core, Horizon-Independent Advantage Normalization (HIAN), separately normalizes advantages from turn-level and session-level rewards before fusion, ensuring balanced gradient contributions from both immediate and long-term objectives to the policy update. Extensive experiments with a high-fidelity user simulator show DuCA outperforms the state-of-the-art GRPO baseline, achieving a 6.82% relative improvement in conversion rate, reducing inter-sentence repetition by 82.28%, and lowering identity detection rate by 27.35%, indicating a substantial improvement for an industrial sales scenario that effectively balances the dual demands of strategic performance and naturalistic language generation.
[AI-79] Mean-Flow based One-Step Vision-Language-Action
【速读】:该论文旨在解决基于流匹配(Flow Matching)的视觉-语言-动作(Vision-Language-Action, VLA)框架在机器人操作任务中因迭代采样需求和架构限制导致的生成延迟过长问题。其解决方案的关键在于提出一种基于均值流(Mean-Flow)的一步式VLA方法,通过消除传统流匹配方法中由噪声引发的动作生成不一致性约束,从而实现无需迭代的单步动作生成,显著提升生成效率。
链接: https://arxiv.org/abs/2603.01469
作者: Yang Chen,Xiaoguang Ma,Bin Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks, particularly for highly dexterous robotic manipulation tasks. Despite these notable achievements, their practical applications are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations. To address this critical bottleneck, we propose a Mean-Flow based One-Step VLA approach. Specifically, we resolve the noise-induced issues in the action generation process, thereby eliminating the consistency constraints inherent to conventional Flow-Matching methods. This significantly enhances generation efficiency and enables one-step action generation. Real-world robotic experiments show that the generation speed of the proposed Mean-Flow based One-Step VLA is 8.7 times and 83.9 times faster than that of SmolVLA and Diffusion Policy, respectively. These results elucidate its great potential as a high-efficiency backbone for VLA-based robotic manipulation.
[AI-80] Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining
【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型在长时程任务中泛化能力不足的问题,其核心挑战在于模型过度依赖当前观测,难以建模非马尔可夫性(Non-Markovian)依赖关系——即最优动作仅依赖于特定历史状态而非当前观察。解决方案的关键在于提出一种关键帧链式VLA框架(Keyframe-Chaining VLA),通过自动关键帧选择机制学习判别性嵌入空间以识别显著的状态转换,并设计进度感知查询机制动态检索与当前执行阶段时间相关的历史帧;这些关键帧作为交错的视觉标记被融合进VLA模型中,从而显式地将策略锚定在长时程的时间上下文中,有效捕捉任务关键信息并提升复杂机器人操作任务的性能。
链接: https://arxiv.org/abs/2603.01465
作者: Yipeng Chen,Wentao Tan,Lei Zhu,Fengling Li,Jingjing Li,Guoli Yang,Heng Tao Shen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing Vision-Language-Action (VLA) models often struggle to generalize to long-horizon tasks due to their heavy reliance on immediate observations. While recent studies incorporate retrieval mechanisms or extend context windows to handle procedural tasks, they often struggle to capture Non-Markovian dependencies, where optimal actions rely solely on specific past states rather than the current observation. To address this, we introduce Keyframe-Chaining VLA, a framework that extracts and links key historical frames to model long-horizon dependencies. Specifically, we propose an automatic keyframe selector that learns a discriminative embedding space, effectively identifying distinct state transitions. To capture task-critical information, we design a progress-aware query mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase. These selected keyframes are integrated into the VLA as interleaved visual tokens, explicitly grounding the policy in the long-horizon temporal context. Finally, we introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates. Experimental results demonstrate that our method achieves superior performance, effectively tackling robot manipulation tasks characterized by long-horizon temporal dependencies. Code is available at this https URL.
[AI-81] Scaling Tasks Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning
【速读】:该论文旨在解决通用机器人在具身人工智能(Embodied AI)中掌握多样化技能的挑战,尤其针对传统方法依赖大规模离线数据集和模型参数扩展所带来的局限性。其核心问题在于:机器人学习需要主动交互,而单纯增加单任务样本量或模型规模难以实现高效泛化。解决方案的关键在于提出基于模型的强化学习(Model-Based Reinforcement Learning, MBRL)的多任务在线学习范式,通过任务多样性作为正则化手段,利用物理动态在不同任务间的不变性,构建共享的世界模型以聚合多任务经验,从而学习鲁棒且任务无关的表征。相比模型无关方法因梯度干扰导致性能下降,该方案显著提升了动态建模能力和样本效率。作者进一步提出了EfficientZero-Multitask(EZ-M)算法,在HumanoidBench基准上实现了优于现有方法的性能,验证了任务规模扩展是可扩展机器人学习的关键路径。
链接: https://arxiv.org/abs/2603.01452
作者: Shaohuai Liu,Weirui Ye,Yilun Du,Le Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Developing generalist robots capable of mastering diverse skills remains a central challenge in embodied AI. While recent progress emphasizes scaling model parameters and offline datasets, such approaches are limited in robotics, where learning requires active interaction. We argue that effective online learning should scale the \emphnumber of tasks, rather than the number of samples per task. This regime reveals a structural advantage of model-based reinforcement learning (MBRL). Because physical dynamics are invariant across tasks, a shared world model can aggregate multi-task experience to learn robust, task-agnostic representations. In contrast, model-free methods suffer from gradient interference when tasks demand conflicting actions in similar states. Task diversity therefore acts as a regularizer for MBRL, improving dynamics learning and sample efficiency. We instantiate this idea with \textbfEfficientZero-Multitask (EZ-M), a sample-efficient multi-task MBRL algorithm for online learning. Evaluated on \textbfHumanoidBench, a challenging whole-body control benchmark, EZ-M achieves state-of-the-art performance with significantly higher sample efficiency than strong baselines, without extreme parameter scaling. These results establish task scaling as a critical axis for scalable robotic learning. The project website is available \hrefthis https URLhere.
[AI-82] Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering
【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)在大语言模型(Large Language Models, LLMs)中用于可解释性时的可信度问题,即模型所生成的推理过程是否真实反映了其决策机制。研究表明,指令微调后的模型往往在生成CoT之前就已经确定了最终答案,这使得CoT可能只是事后解释而非真实推理路径。解决方案的关键在于通过训练线性探测器(linear probes)分析残差流激活(residual stream activations)在CoT起始前最后一个token处的状态,发现这些激活方向不仅具有高度预测能力(AUC达0.9),而且具有因果效应:沿探测方向操纵激活可使模型答案翻转的比例超过50%,显著优于正交基线。这一机制验证了CoT的非忠实性,并揭示其在错误信念下可能导致两种失效模式:非蕴含(non-entailment)和虚构(confabulation)。
链接: https://arxiv.org/abs/2603.01437
作者: Kyle Cox,Darius Kianersi,Adrià Garriga-Alonso
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As chain-of-thought (CoT) has become central to scaling reasoning capabilities in large language models (LLMs), it has also emerged as a promising tool for interpretability, suggesting the opportunity to understand model decisions through verbalized reasoning. However, the utility of CoT toward interpretability depends upon its faithfulness – whether the model’s stated reasoning reflects the underlying decision process. We provide mechanistic evidence that instruction-tuned models often determine their answer before generating CoT. Training linear probes on residual stream activations at the last token before CoT, we can predict the model’s final answer with 0.9 AUC on most tasks. We find that these directions are not only predictive, but also causal: steering activations along the probe direction flips model answers in over 50% of cases, significantly exceeding orthogonal baselines. When steering induces incorrect answers, we observe two distinct failure modes: non-entailment (stating correct premises but drawing unsupported conclusions) and confabulation (fabricating false premises). While post-hoc reasoning may be instrumentally useful when the model has a correct pre-CoT belief, these failure modes suggest it can result in undesirable behaviors when reasoning from a false belief.
[AI-83] Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多模态搜索任务中面临的关键挑战:即依赖大规模监督轨迹或昂贵的强化学习训练,导致训练成本高、收敛不稳定,并存在严重的冷启动问题。为克服这一局限,作者提出了一种无需训练的范式,通过跨模态模型融合(cross-modal model merging)赋予基础VLM自主搜索能力。其解决方案的关键在于引入一种基于显著性感知的参数融合算法——最优大脑合并(Optimal Brain Merging, OBM),该方法利用少量校准样本识别对任务关键的参数,从而有效缓解跨模态整合过程中的参数干扰问题,实现无需额外多模态训练数据即可构建高性能多模态搜索代理的目标。
链接: https://arxiv.org/abs/2603.01416
作者: Zhixiang Wang,Jingxuan Xu,Dajun Chen,Yunfang Wu,Wei Jiang,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Vision-Language Models (VLMs) have motivated the development of multi-modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi-step reasoning. While promising, existing approaches typically rely on large-scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold-start problem for standard VLMs. We propose a training-free paradigm to empower VLMs with autonomous search capabilities via cross-modal model merging. By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data. To mitigate parameter interference during cross-modal integration, we introduce Optimal Brain Merging (OBM), a saliency-aware merging algorithm that identifies task-critical parameters based on their impact on model loss using only a small set of calibration samples. Extensive experiments on search-intensive benchmarks (e.g., InfoSeek, MMSearch) reveal that: (1) Model merging secures a reasonable performance floor as a zero-shot agent, with OBM achieving superior search rates; (2) OBM significantly raises the performance ceiling as a warm-start strategy, achieving faster convergence and higher peak accuracy than standard VLM initialization.
[AI-84] GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agent ic Graph Reasoning
【速读】:该论文旨在解决当前图增强生成(GraphRAG)方法中因依赖人工设计的引导策略和有限预定义工具而导致的知识图谱探索能力受限的问题。其核心解决方案是提出GraphScout——一个以训练为中心的智能体式图推理框架,通过引入更灵活的图探索工具,使模型能够自主与知识图谱交互并自动生成结构化训练数据,进而通过后训练(post-training)方式将代理式图推理能力内化至大语言模型(LLM)中,无需繁琐的人工标注或任务设计。此方法显著提升了模型在多领域知识图谱上的推理性能与跨域迁移能力。
链接: https://arxiv.org/abs/2603.01410
作者: Yuchen Ying,Weiqi Jiang,Tongya Zheng,Yu Wang,Shunyu Liu,Kaixuan Chen,Mingli Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge graphs provide structured and reliable information for many real-world applications, motivating increasing interest in combining large language models (LLMs) with graph-based retrieval to improve factual grounding. Recent Graph-based Retrieval-Augmented Generation (GraphRAG) methods therefore introduce iterative interaction between LLMs and knowledge graphs to enhance reasoning capability. However, existing approaches typically depend on manually designed guidance and interact with knowledge graphs through a limited set of predefined tools, which substantially constrains graph exploration. To address these limitations, we propose GraphScout, a training-centric agentic graph reasoning framework equipped with more flexible graph exploration tools. GraphScout enables models to autonomously interact with knowledge graphs to synthesize structured training data which are then used to post-train LLMs, thereby internalizing agentic graph reasoning ability without laborious manual annotation or task curation. Extensive experiments across five knowledge-graph domains show that a small model (e.g., Qwen3-4B) augmented with GraphScout outperforms baseline methods built on leading LLMs (e.g., Qwen-Max) by an average of 16.7% while requiring significantly fewer inference tokens. Moreover, GraphScout exhibits robust cross-domain transfer performance. Our code will be made publicly available~\footnotethis https URL.
[AI-85] MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在首次生成代码时频繁出错的问题,现有验证方法依赖大量测试用例进行“数量驱动型”(scaling-by-quantity)的故障检测,导致测试冗余严重且边际收益递减。其解决方案的关键在于提出MIST-RL框架,通过“效用驱动型”(scaling-by-utility)策略重构测试生成过程:将测试用例生成建模为基于组相对策略优化(Group Relative Policy Optimization, GRPO)的序贯决策问题,并引入一种新颖的增量变异奖励机制与动态惩罚项,激励模型持续发现新故障的同时抑制功能等价断言的重复生成。实验表明,该方法在HumanEval+和MBPP+数据集上显著提升变异分数(+28.5%),同时减少测试用例数量19.3%,并进一步提升下游代码重排序准确率(+3.05%)。
链接: https://arxiv.org/abs/2603.01409
作者: Sicheng Zhu,Jiajun Wang,Jiawei Ai,Xin Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Preprint. 17 pages
Abstract:Large Language Models (LLMs) often fail to generate correct code on the first attempt, which requires using generated unit tests as verifiers to validate the solutions. Despite the success of recent verification methods, they remain constrained by a “scaling-by-quantity” paradigm. This brute-force approach suffers from a critical limitation: it yields diminishing returns in fault detection while causing severe test redundancy. To address this, we propose MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning), a framework that shifts the focus to “scaling-by-utility”. We formulate test generation as a sequential decision process optimized via Group Relative Policy Optimization (GRPO). Specifically, we introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions. Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines. It achieves a +28.5% higher mutation score while reducing the number of test cases by 19.3%. Furthermore, we show that these compact, high-utility tests serve as superior verifiers, which improves downstream code reranking accuracy on HumanEval+ by 3.05% over the SOTA baseline with 10 candidate samples. The source code and data are provided in the supplementary material.
[AI-86] HarmonyCell: Automating Single-Cell Perturbation Modeling under Semantic and Distribution Shifts
【速读】:该论文旨在解决单细胞扰动研究中面临的双重异质性瓶颈:一是语义异质性(semantic heterogeneity),即相同生物学概念在不同数据集间因元数据模式不兼容而难以整合;二是统计异质性(statistical heterogeneity),即由于生物变异导致的数据分布偏移,需依赖特定数据集的归纳偏置(inductive bias)。其解决方案的关键在于提出HarmonyCell框架,该框架采用双轨协同机制:首先通过大语言模型(LLM)驱动的语义统一器(Semantic Unifier)自动映射异构元数据至标准接口,实现无需人工干预的语义对齐;其次利用自适应蒙特卡洛树搜索(Monte Carlo Tree Search)引擎,在分层动作空间中搜索最优神经架构,以适配分布偏移并引入合适的统计归纳偏置。此设计实现了跨数据集的可扩展自动虚拟细胞建模,无需针对每个数据集进行专门工程开发。
链接: https://arxiv.org/abs/2603.01396
作者: Wenxuan Huang,Mingyu Tsoi,Yanhao Huang,Xinjie Mao,Xue Xia,Hao Wu,Jiaqi Wei,Yuejin Yang,Lang Yu,Cheng Tan,Xiang Zhang,Zhangyang Gao,Siqi Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Quantitative Methods (q-bio.QM)
备注: 18 pages total (8 pages main text + appendix), 6 figures
Abstract:Single-cell perturbation studies face dual heterogeneity bottlenecks: (i) semantic heterogeneity–identical biological concepts encoded under incompatible metadata schemas across datasets; and (ii) statistical heterogeneity–distribution shifts from biological variation demanding dataset-specific inductive biases. We propose HarmonyCell, an end-to-end agent framework resolving each challenge through a dedicated mechanism: an LLM-driven Semantic Unifier autonomously maps disparate metadata into a canonical interface without manual intervention; and an adaptive Monte Carlo Tree Search engine operates over a hierarchical action space to synthesize architectures with optimal statistical inductive biases for distribution shifts. Evaluated across diverse perturbation tasks under both semantic and distribution shifts, HarmonyCell achieves a 95% valid execution rate on heterogeneous input datasets (versus 0% for general agents) while matching or even exceeding expert-designed baselines in rigorous out-of-distribution evaluations. This dual-track orchestration enables scalable automatic virtual cell modeling without dataset-specific engineering.
[AI-87] Words Weights: Streamlining Multi-Turn Interactions via Co-Adaptation
【速读】:该论文旨在解决多轮交互中大型语言模型(Large Language Models, LLMs)在推理阶段难以动态适应用户需求的问题,即测试时策略适配(Test-time Policy Adaptation for Multi-turn Interactions, T2PAM)问题。现有方法通常将这一过程视为单一维度的优化问题,仅通过提示工程(Prompt Engineering)调整文本指令或仅通过测试时训练(Test-Time Training)更新模型参数,忽略了交互失败本质上源于意图模糊(ambiguity)与能力不足(incapacity)的耦合效应。解决方案的关键在于提出ROSA2框架,将交互建模为词(Words)与权重(Weights)异质空间上的联合优化问题:通过数学分解误差信号,利用文本梯度纠正意图模糊,同时通过参数更新弥补能力缺口,实现语义清晰度与模型能力的协同提升。理论证明该共适应机制可严格减少收敛所需的参数偏移量,实验证明其在MATH基准上性能优于当前最优基线30%,且交互轮次减少40%。
链接: https://arxiv.org/abs/2603.01375
作者: Chenxing Wei,Hong Wang,Ying He,Zhongxiang Dai,Bo Jiang,F. Richard Yu,Yao Shu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Test-time policy adaptation for multi-turn interactions (T2PAM) is essential for aligning Large Language Models (LLMs) with dynamic user needs during inference time. However, existing paradigms commonly treat test-time adaptation as a single-axis problem, either purely refining instructions (Prompt Engineering) or only adjusting weights (Test-Time Training), ignoring that interaction failures stem from a coupled mix of ambiguity and incapacity. We argue that these two optimization paths are not merely additive but synergistic: semantic clarity acts as a pre-conditioner for effective parameter updates. To this end, we propose ROSA2, a framework that reformulates interaction as a joint optimization problem over the heterogeneous space of Words and Weights. By mathematically decomposing the error signal, ROSA2 utilizes textual gradients to rectify intent ambiguity and parameter updates to bridge capability gaps. Theoretically, we prove that this co-adaptation strictly reduces the required parameter shift for convergence. Empirically, ROSA2 outperforms state-of-the-art baselines by 30% on MATH while reducing interaction turns by 40%, demonstrating that refining the context unlocks the true potential of parameter updates.
[AI-88] Causal Neural Probabilistic Circuits
【速读】:该论文旨在解决传统概念瓶颈模型(Concept Bottleneck Models, CBMs)在实施干预时忽略概念间因果依赖关系的问题。典型CBMs通过直接覆盖错误的概念预测值来修正结果,但这种方法未考虑概念之间的因果结构,可能导致不准确的推理。其解决方案的关键在于提出因果神经概率电路(Causal Neural Probabilistic Circuit, CNPC),该模型将神经属性预测器与基于因果图编译的概率电路相结合,支持精确且可 tractable(易处理)的因果推断,从而自然地保留概念间的因果依赖关系。CNPC利用专家知识对概念进行干预后,通过一个Product of Experts(PoE)机制融合属性预测器的预测分布与由概率电路计算出的干预边缘分布,实现更准确的类别分布建模。理论分析表明,在特定条件下,CNPC能够逼近真实干预下的类别分布,实验验证了其在多个基准数据集上的优越性能。
链接: https://arxiv.org/abs/2603.01372
作者: Weixin Chen,Han Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Concept Bottleneck Models (CBMs) enhance the interpretability of end-to-end neural networks by introducing a layer of concepts and predicting the class label from the concept predictions. A key property of CBMs is that they support interventions, i.e., domain experts can correct mispredicted concept values at test time to improve the final accuracy. However, typical CBMs apply interventions by overwriting only the corrected concept while leaving other concept predictions unchanged, which ignores causal dependencies among concepts. To address this, we propose the Causal Neural Probabilistic Circuit (CNPC), which combines a neural attribute predictor with a causal probabilistic circuit compiled from a causal graph. This circuit supports exact, tractable causal inference that inherently respects causal dependencies. Under interventions, CNPC models the class distribution based on a Product of Experts (PoE) that fuses the attribute predictor’s predictive distribution with the interventional marginals computed by the circuit. We theoretically characterize the compositional interventional error of CNPC w.r.t. its modules and identify conditions under which CNPC closely matches the ground-truth interventional class distribution. Experiments on five benchmark datasets in both in-distribution and out-of-distribution settings show that, compared with five baseline models, CNPC achieves higher task accuracy across different numbers of intervened attributes.
[AI-89] Align and Filter: Improving Performance in Asynchronous On-Policy RL
【速读】:该论文旨在解决分布式训练和高频梯度更新加剧的策略滞后(policy lag)问题,即数据生成行为策略与学习策略之间的不匹配,这会限制在线策略强化学习算法在更大规模任务中的扩展性。解决方案的关键在于提出一种基于总变差(Total Variation)的Advantage对齐约束策略优化方法(Total Variation-based Advantage Aligned Constrained Policy Optimization, \methodacronym),通过显式建模和约束策略分布差异来缓解策略滞后,从而提升算法在经典强化学习任务及大语言模型数学推理等现代任务中的鲁棒性。
链接: https://arxiv.org/abs/2603.01365
作者: Homayoun Honari,Roger Creus Castanyer,Michael Przystupa,Michael Noukhovitch,Pablo Samuel Castro,Glen Berseth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注:
Abstract:Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, but both exacerbate a central challenge: \textitpolicy lag, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose \textittotal Variation-based Advantage aligned Constrained policy Optimization (\methodacronym) as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic RL tasks and a modern RL for LLM math reasoning task.
[AI-90] ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
【速读】:该论文旨在解决当前人工智能助手在处理复杂个人上下文、多工具协同和多步推理任务时的性能瓶颈问题。现有基准测试大多为静态、单轮交互,无法真实反映智能体在动态演化的生活场景中持续理解与规划的能力。其解决方案的关键在于提出ASTRA-bench——一个融合时间演化的个人上下文、交互式工具箱及复杂用户意图的新型评估基准,通过事件驱动的生成流程构建2,413个高保真场景,涵盖参考性、功能性与信息复杂度标注,从而揭示当前先进模型(如Claude-4.5-Opus、DeepSeek-V3.2)在高复杂度条件下显著性能下降的问题,尤其暴露了论据生成作为主要瓶颈的局限性,为开发真正具备情境感知能力的AI助手提供了可诊断的测试平台。
链接: https://arxiv.org/abs/2603.01357
作者: Zidi Xiu,David Q. Sun,Kevin Cheng,Maitrik Patel,Josh Date,Yizhe Zhang,Jiarui Lu,Omar Attia,Raviteja Vemulapalli,Oncel Tuzel,Meng Cao,Samy Bengio
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Next-generation AI must manage vast personal data, diverse tools, and multi-step reasoning, yet most benchmarks remain context-free and single-turn. We present ASTRA-bench (Assistant Skills in Tool-use, Reasoning \ Action-planning), a benchmark that uniquely unifies time-evolving personal context with an interactive toolbox and complex user intents. Our event-driven pipeline generates 2,413 scenarios across four protagonists, grounded in longitudinal life events and annotated by referential, functional, and informational complexity. Evaluation of state-of-the-art models (e.g., Claude-4.5-Opus, DeepSeek-V3.2) reveals significant performance degradation under high-complexity conditions, with argument generation emerging as the primary bottleneck. These findings expose critical limitations in current agents’ ability to ground reasoning within messy personal context and orchestrate reliable multi-step plans. We release ASTRA-bench with a full execution environment and evaluation scripts to provide a diagnostic testbed for developing truly context-aware AI assistants.
[AI-91] UTICA: Multi-Objective Self-Distllation Foundation Model Pretraining for Time Series Classification
【速读】:该论文旨在解决时间序列领域中自监督基础模型预训练方法的局限性,尤其是非对比学习(non-contrastive)方法在该领域的应用尚未充分探索的问题。其解决方案的关键在于引入一种基于DINOv2风格的自蒸馏(self-distillation)机制,结合Mantis分词器与Transformer编码器架构,在学生-教师框架下实现双重表征学习:通过数据增强后的片段(augmented crops)捕捉时间不变性(temporal invariance),并通过掩码补丁(patch masking)保留局部精细结构(fine-grained local structure)。该方法在UCR和UEA基准上实现了最先进的分类性能,验证了非对比学习作为时间序列基础模型预训练策略的有效性和互补性。
链接: https://arxiv.org/abs/2603.01348
作者: Yessin Moakher,Youssef Attia El Hili,Vasilii Feofanov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-supervised foundation models have achieved remarkable success across domains, including time series. However, the potential of non-contrastive methods, a paradigm that has driven significant advances in computer vision, remains underexplored for time series. In this work, we adapt DINOv2-style self-distillation to pretrain a time series foundation model, building on the Mantis tokenizer and transformer encoder architecture as our backbone. Through a student-teacher framework, our method Utica learns representations that capture both temporal invariance via augmented crops and fine-grained local structure via patch masking. Our approach achieves state-of-the-art classification performance on both UCR and UEA benchmarks. These results suggest that non-contrastive methods are a promising and complementary pretraining strategy for time series foundation models.
[AI-92] SubstratumGraphEnv: Reinforcement Learning Environment (RLE) for Modeling System Attack Paths AAAI-26
【速读】:该论文旨在解决网络安全性分析中潜在攻击路径识别的自动化难题,尤其是传统人工智能(AI)技术难以有效建模系统事件的顺序性、关联性和演化特性。其解决方案的关键在于构建一个基于强化学习(Reinforcement Learning, RL)的图结构环境生成框架,通过解析开源系统监控(Sysmon)日志提取父-子进程关系,并利用图卷积网络(Graph Convolutional Networks, GCNs)对操作系统状态及其转移进行动态建模,从而将序列化的用户与系统事件转化为深度强化学习(Deep Reinforcement Learning, DRL)可处理的图表示。该方法以Gymnasium环境(SubstratumGraphEnv)和自定义PyTorch接口(SubstratumBridge)为基础,实现从原始日志到DRL观测值与离散动作的自动映射,最终通过优势Actor-Critic(A2C)模型的策略与价值分支完成对恶意行为的识别与决策,为网络安全领域的自主智能分析提供了可扩展的图结构RL范式。
链接: https://arxiv.org/abs/2603.01340
作者: Bahirah Adewunmi,Edward Raff,Sanjay Purushotham
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Presented at the AI for Cyber Security Workshop at AAAI-26
Abstract:Automating network security analysis, particularly the identification of potential attack paths, presents significant challenges. Due in part to the sequential, interconnected, and evolutionary nature of system events which most artificial intelligence (AI) techniques struggle to model effectively. This paper proposes a Reinforcement Learning (RL) environment generation framework that simulates the sequence of processes executed on a Windows operating system, enabling dynamic modeling of malicious processes on a system. This methodology models operating system state and transitions using a graph representation. This graph is derived from open-source System Monitor (Sysmon) logs. To address the variety in system event types, fields, and log formats, a mechanism was developed to capture and model parent-child processes from Sysmon logs. A Gymnasium environment (SubstratumGraphEnv) was constructed to establish the perceptible basis for an RL environment, and a customized PyTorch interface was also built (SubstratumBridge) to translate Gymnasium graphs into Deep Reinforcement Learning (DRL) observations and discrete actions. Graph Convolutional Networks (GCNs) concretize the graph’s local and global state, which feed the distinct policy and critic heads of an Advantage Actor-Critic (A2C) model. This work’s central contribution lies in the design of a novel deep graphical RL environment that automates translation of sequential user and system events, furnishing crucial context for cybersecurity analysis. This work provides a foundation for future research into shaping training parameters and advanced reward shaping, while also offering insight into which system events attributes are critical to training autonomous RL agents.
[AI-93] Provable and Practical In-Context Policy Optimization for Self-Improvement ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段通过多轮自我反思(multi-round self-reflection)提升答案质量的问题,即测试时扩展(test-time scaling)的机制与实现方法。其解决方案的关键在于提出一种上下文策略优化(In-Context Policy Optimization, ICPO)框架,该框架允许模型在不更新参数的前提下,利用自评或外部观测的奖励信号,在推理过程中动态优化响应。理论层面,作者证明了在特定Fisher加权对数匹配目标下预训练的单层线性自注意力模型可精确模拟线性Bandit问题上的策略优化算法;实践上进一步设计了最小熵ICPO(Minimum-Entropy ICPO, ME-ICPO),通过选择低熵响应及其奖励进行多数投票以增强自评奖励的鲁棒性,从而在数学推理任务中实现媲美顶尖水平的性能,同时保持较低的推理成本。
链接: https://arxiv.org/abs/2603.01335
作者: Tianrun Yu,Yuxiao Yang,Zhaoyang Wang,Kaixiang Zhao,Porter Jenkins,Xuchao Zhang,Chetan Bansal,Huaxiu Yao,Weitong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages, 8 tables, 4 figures, Accepted by ICLR 2026
Abstract:We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting. Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.
[AI-94] heoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在预训练与后训练阶段对数据规模和质量需求差异的机制问题,具体包括:为何预训练和强化学习(Reinforcement Learning, RL)依赖大规模数据,为何监督微调(Supervised Fine-Tuning, SFT)在小样本高质量数据下表现更优,以及如何定义SFT数据的“高质量”。其解决方案的关键在于通过理论分析Transformer模型在上下文权重预测任务中学习线性回归的过程,揭示出两个核心机制:(i) 平衡的预训练数据可诱导潜在能力,这些能力在后训练阶段被激活;(ii) SFT最优效果来自对预训练模型具有挑战性的少量样本,而过大的SFT数据集可能稀释预训练信号;相反,RL则在大规模且难度适中的数据上最有效。实验验证了上述理论发现于大型非线性Transformer架构中的适用性。
链接: https://arxiv.org/abs/2603.01293
作者: Adel Javanmard,Baharan Mirzasoleiman,Vahab Mirrokni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 35 pages, 5 figures
Abstract:Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality SFT data. In this work, we theoretically analyze transformers trained on an in-context weight prediction task for linear regression. Our analysis reveals several key findings: (i) balanced pretraining data can induce latent capabilities later activated during post-training, and (ii) SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals. In contrast, RL is most effective on large-scale data that is not overly difficult for the pretrained model. We validate these theoretical insights with experiments on large nonlinear transformer architectures.
[AI-95] Integrating LTL Constraints into PPO for Safe Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在实际应用中因缺乏严格安全约束而导致的安全性问题,特别是在机器人控制等场景下,如何确保智能体的行为满足复杂的时序安全规范。解决方案的关键在于提出了一种结合线性时序逻辑(Linear Temporal Logic, LTL)约束的近端策略优化方法(Proximal Policy Optimization with Linear Temporal Logic Constraints, PPO-LTL),通过限制性确定性Büchi自动机(limit-deterministic Büchi automata)对LTL约束进行实时监测,并利用逻辑到代价(logic-to-cost)机制将违反行为转化为惩罚信号,再借助拉格朗日(Lagrangian)框架将这些信号融入策略优化过程,从而在保证性能的同时显著降低安全违规概率。
链接: https://arxiv.org/abs/2603.01292
作者: Maifang Zhang,Hang Yu,Qian Zuo,Cheng Wang,Vaishak Belle,Fengxiang He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Robotics (cs.RO)
备注:
Abstract:This paper proposes Proximal Policy Optimization with Linear Temporal Logic Constraints (PPO-LTL), a framework that integrates safety constraints written in LTL into PPO for safe reinforcement learning. LTL constraints offer rigorous representations of complex safety requirements, such as regulations that broadly exist in robotics, enabling systematic monitoring of safety requirements. Violations against LTL constraints are monitored by limit-deterministic Büchi automata, and then translated by a logic-to-cost mechanism into penalty signals. The signals are further employed for guiding the policy optimization via the Lagrangian scheme. Extensive experiments on the Zones and CARLA environments show that our PPO-LTL can consistently reduce safety violations, while maintaining competitive performance, against the state-of-the-art methods. The code is at this https URL.
[AI-96] Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy
【速读】:该论文旨在解决2026年一级方程式(Formula 1)技术规则下能量策略优化的复杂性问题,即在50/50内燃机与电池功率分配、无限再生能力及车手可控制的Override Mode(MOM)机制下,最优能量部署不仅取决于本车状态,还依赖于对手车辆的隐藏状态(如ERS充电水平、MOM状态和轮胎磨损程度),从而构成一个部分可观测随机博弈(Partially Observable Stochastic Game),无法通过单智能体优化方法求解。解决方案的关键在于提出一个两层可计算框架:第一层为30状态隐马尔可夫模型(Hidden Markov Model, HMM),基于五种公开遥测信号推断对手车辆的隐状态概率分布;第二层为深度Q网络(Deep Q-Network, DQN)策略,以HMM信念状态为输入选择能量部署策略。该框架能够识别并应对“反向回收陷阱”(counter-harvest trap)——一种通过抑制可观测能量释放信号诱导对手错误攻击的欺骗策略,其检测依赖于信念状态推理而非简单的阈值反应规则。
链接: https://arxiv.org/abs/2603.01290
作者: Kalliopi Kleisarchaki
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 17 pages. Pre-registered theoretical framework; empirical calibration on 2026 race telemetry begins Australian Grand Prix, 8 March 2026. Paper 1 of 3. ResearchGate preprint: DOI https://doi.org/10.13140/RG.2.2.16034.08644
Abstract:The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode (abbreviated MOM throughout), the optimal energy deployment policy depends not only on a driver’s own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 30-state Hidden Markov Model (HMM) that infers a probability distribution over each rival’s ERS charge level, Override Mode status, and tyre degradation state from five publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap – a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack – and show that detecting it requires belief-state inference rather than reactive threshold rules. On synthetic races generated from the model’s own assumptions, the HMM achieves 92.3% ERS inference accuracy (random baseline: 33.3%) and detects counter-harvest trap conditions with 95.7% recall. Pre-registration – empirical validation begins Australian Grand Prix, 8 March 2026.
[AI-97] Information-Theoretic Framework for Self-Adapting Model Predictive Controllers
【速读】:该论文旨在解决传统模型预测控制(Model Predictive Control, MPC)在面对动态障碍物和系统动力学变化时适应性不足的问题,其核心缺陷在于缺乏自我监控与自适应优化机制。解决方案的关键在于提出一种基于信息论的框架——纠缠学习(Entanglement Learning, EL),通过构建信息数字孪生(Information Digital Twin, IDT)来量化MPC输入、控制动作与无人机行为之间的信息流(以比特为单位),并引入新的纠缠度量指标(entanglement metrics)来追踪这些依赖关系的变化。该方法利用互信息衡量优化器输入、控制动作与无人机动力学之间的关联,从而实现对性能偏差的实时检测,并生成自适应信号以调整MPC参数,确保系统稳定性;相较传统基于误差反馈的方式,此方案采用双反馈机制,借助信息流实现对环境变化的主动响应,显著提升了MPC的鲁棒性和可靠性。
链接: https://arxiv.org/abs/2603.01286
作者: Wael Hafez,Amir Nazeri
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 9 pages, 5 figures
Abstract:Model Predictive Control (MPC) is a vital technique for autonomous systems, like Unmanned Aerial Vehicles (UAVs), enabling optimized motion planning. However, traditional MPC struggles to adapt to real-time changes such as dynamic obstacles and shifting system dynamics, lacking inherent mechanisms for self-monitoring and adaptive optimization. Here, we introduce Entanglement Learning (EL), an information-theoretic framework that enhances MPC adaptability through an Information Digital Twin (IDT). The IDT monitors and quantifies, in bits, the information flow between MPC inputs, control actions, and UAV behavior. By introducing new information-theoretic metrics we call entanglement metrics, it tracks variations in these dependencies. These metrics measure the mutual information between the optimizer’s input, its control actions, and the resulting UAV dynamics, enabling a deeper understanding of their interrelationships. This allows the IDT to detect performance deviations and generate real-time adaptive signals to recalibrate MPC parameters, preserving stability. Unlike traditional MPC, which relies on error-based feedback, this dual-feedback approach leverages information flow for proactive adaptation to evolving conditions. Scalable and leveraging existing infrastructure, this framework improves MPC reliability and robustness across diverse scenarios, extending beyond UAV control to any MPC implementation requiring adaptive performance.
[AI-98] Beyond Reward: A Bounded Measure of Agent Environment Coupling
【速读】:该论文旨在解决现实世界中强化学习(Reinforcement Learning, RL)代理在闭环系统中因分布偏移导致的可靠部署难题,现有监控方法依赖奖励或任务指标,仅能捕捉结果而无法识别早期交互失效。其解决方案的关键在于提出双可预测性(Bipredictability, P)——即观测、动作、结果回路中共享信息与总可用信息的比值,这是一种具有理论保证、可实时计算且跨任务可比的交互有效性度量;同时设计了信息数字孪生(Information Digital Twin, IDT)作为辅助监测器,从交互流中实时计算P及其诊断组件。实验表明,IDT在多种扰动下检测准确率提升至89.3%(对比奖励基监控44.0%),且延迟降低4.4倍,证明了双可预测性能够提前发现交互退化,为部署中的RL系统提供闭环自调节的前提信号。
链接: https://arxiv.org/abs/2603.01283
作者: Wael Hafez,Cameron Reid,Amit Nazeri
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 2 figures
Abstract:Real-world reinforcement learning (RL) agents operate in closed-loop systems where actions shape future observations, making reliable deployment under distribution shifts a persistent challenge. Existing monitoring relies on reward or task metrics, capturing outcomes but missing early coupling failures. We introduce bipredictability § as the ratio of shared information in the observation, action, outcome loop to the total available information, a principled, real time measure of interaction effectiveness with provable bounds, comparable across tasks. An auxiliary monitor, the Information Digital Twin (IDT), computes P and its diagnostic components from the interaction stream. We evaluate SAC and PPO agents on MuJoCo HalfCheetah under eight agent, and environment-side perturbations across 168 trials. Under nominal operation, agents exhibit P = 0.33 plus minus 0.02, below the classical bound of 0.5, revealing an informational cost of action selection. The IDT detects 89.3% of perturbations versus 44.0% for reward based monitoring, with 4.4x lower median latency. Bipredictability enables early detection of interaction degradation before performance drops and provides a prerequisite signal for closed loop self regulation in deployed RL systems.
[AI-99] GlassMol: Interpretable Molecular Property Prediction with Concept Bottleneck Models
【速读】:该论文旨在解决当前分子属性预测中黑箱模型(如大型语言模型和图神经网络)缺乏可解释性的问题,尤其在药物发现领域,模型的不透明性可能导致错误关联被忽视且无法融入人类专业知识。现有可解释方法普遍存在“有效性-可信度权衡”问题,即解释可能偏离模型真实推理过程、性能下降或缺乏领域依据。其解决方案的关键在于提出一种模型无关的概念瓶颈模型(Concept Bottleneck Models, CBMs)——GlassMol,通过自动化概念筛选与大语言模型(LLM)引导的概念选择,系统性缓解了化学场景下的三大挑战:相关性缺口(Relevance Gap)、标注缺口(Annotation Gap)和容量缺口(Capacity Gap)。实验证明,GlassMol在十三个基准测试中通常达到或优于黑箱基线,表明可解释性无需以性能为代价,从而挑战了传统认为可解释性必然损害性能的认知。
链接: https://arxiv.org/abs/2603.01274
作者: Oscar Rivera,Ziqing Wang,Matthieu Dagommer,Abhishek Pandey,Kaize Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine learning accelerates molecular property prediction, yet state-of-the-art Large Language Models and Graph Neural Networks operate as black boxes. In drug discovery, where safety is critical, this opacity risks masking false correlations and excluding human expertise. Existing interpretability methods suffer from the effectiveness-trustworthiness trade-off: explanations may fail to reflect a model’s true reasoning, degrade performance, or lack domain grounding. Concept Bottleneck Models (CBMs) offer a solution by projecting inputs to human-interpretable concepts before readout, ensuring that explanations are inherently faithful to the decision process. However, adapting CBMs to chemistry faces three challenges: the Relevance Gap (selecting task-relevant concepts from a large descriptor space), the Annotation Gap (obtaining concept supervision for molecular data), and the Capacity Gap (degrading performance due to bottleneck constraints). We introduce GlassMol, a model-agnostic CBM that addresses these gaps through automated concept curation and LLM-guided concept selection. Experiments across thirteen benchmarks demonstrate that \method generally matches or exceeds black-box baselines, suggesting that interpretability does not sacrifice performance and challenging the commonly assumed trade-off. Code is available at this https URL.
[AI-100] MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL LLM VLM and Human Decision-Makers
【速读】:该论文旨在解决当前研究中缺乏统一平台支持不同决策范式(如强化学习(Reinforcement Learning, RL)、大语言模型(Large Language Models, LLMs)和视觉语言模型(Vision-Language Models, VLMs))代理在相同环境中协同运行的问题,从而难以在混合多智能体场景下进行公平比较或交叉研究。解决方案的关键在于提出MOSAIC平台,其核心创新包括:(i) 基于IPC的工人协议,将各类框架封装为隔离子进程工作节点,通过版本化的进程间通信协议实现无修改调用;(ii) 操作符抽象接口,统一RL策略、LLM、VLM及人类玩家的行为表现形式,形成最小一致代理接口;(iii) 确定性的跨范式评估框架,提供手动模式(同步推进N个操作符并共享随机种子以可视化行为差异)与脚本模式(基于Python声明式脚本自动化长期实验),确保实验可复现性。
链接: https://arxiv.org/abs/2603.01260
作者: Abdulhamid M. Mousa,Yu Fu,Rakhmonberdi Khajiev,Jalaledin M. Azzabi,Abdulkarim M. Mousa,Peng Yang,Yunusa Haruna,Ming Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures
Abstract:Reinforcement learning (RL), large language models (LLMs), and vision-language models (VLMs) have been widely studied in isolation. However, existing infrastructure lacks the ability to deploy agents from different decision-making paradigms within the same environment, making it difficult to study them in hybrid multi-agent settings or to compare their behaviour fairly under identical conditions. We present MOSAIC, an open-source platform that bridges this gap by incorporating a diverse set of existing reinforcement learning environments and enabling heterogeneous agents (RL policies, LLMs, VLMs, and human players) to operate within them in ad-hoc team settings with reproducible results. MOSAIC introduces three contributions. (i) An IPC-based worker protocol that wraps both native and third-party frameworks as isolated subprocess workers, each executing its native training and inference logic unmodified, communicating through a versioned inter-process protocol. (ii) An operator abstraction that forms an agent-level interface by mapping workers to agents: each operator, regardless of whether it is backed by an RL policy, an LLM, or a human, conforms to a minimal unified interface. (iii) A deterministic cross-paradigm evaluation framework offering two complementary modes: a manual mode that advances up to N concurrent operators in lock-step under shared seeds for fine-grained visual inspection of behavioural differences, and a script mode that drives automated, long-running evaluation through declarative Python scripts, for reproducible experiments. We release MOSAIC as an open, visual-first platform to facilitate reproducible cross-paradigm research across the RL, LLM, and human-in-the-loop communities.
[AI-101] Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在网络安全任务中因安全对齐机制导致的“防御性拒绝偏差”(Defensive Refusal Bias)问题,即模型倾向于拒绝合法授权的防御性网络安全请求,尤其是当这些请求包含与攻击性任务相似的语言时。其关键解决方案在于:当前LLM的安全对齐策略主要依赖于语义相似性而非意图或授权判断,因此应转向基于意图分析的机制,以在保障不合规行为不被支持的同时,最大化模型对合法防御者的辅助能力。
链接: https://arxiv.org/abs/2603.01246
作者: David Campbell,Neil Kale,Udari Madhushani Sehwag,Bert Herring,Nick Price,Dan Borges,Alex Levinson,Christina Q Knight
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety alignment in large language models (LLMs), particularly for cybersecurity tasks, primarily focuses on preventing misuse. While this approach reduces direct harm, it obscures a complementary failure mode: denial of assistance to legitimate defenders. We study Defensive Refusal Bias – the tendency of safety-tuned frontier LLMs to refuse assistance for authorized defensive cybersecurity tasks when those tasks include similar language to an offensive cyber task. Based on 2,390 real-world examples from the National Collegiate Cyber Defense Competition (NCCDC), we find that LLMs refuse defensive requests containing security-sensitive keywords at 2.72\times the rate of semantically equivalent neutral requests ( p 0.001 ). The highest refusal rates occur in the most operationally critical tasks: system hardening (43.8%) and malware analysis (34.3%). Interestingly, explicit authorization, where the user directly instructs the model that they have authority to complete the target task, increases refusal rates, suggesting models interpret justifications as adversarial rather than exculpatory. These findings are urgent for interactive use and critical for autonomous defensive agents, which cannot rephrase refused queries or retry. Our findings suggest that current LLM cybersecurity alignment relies on semantic similarity to harmful content rather than reasoning about intent or authorization. We call for mitigations that analyze intent to maximize defensive capabilities while still preventing harmful compliance.
[AI-102] Extended Empirical Validation of the Explainability Solution Space
【速读】:该论文旨在解决解释性人工智能(Explainable AI, XAI)策略在不同应用场景中缺乏通用性和可迁移性的问题,尤其是在复杂多利益相关者治理环境下如何有效设计和部署XAI方法。其解决方案的关键在于提出并验证了可解释性解决方案空间(Explainability Solution Space, ESS)框架的跨域适应能力,通过在员工流失预测与异构智能城市资源分配两个不同领域中的实证评估,证明ESS能够根据治理角色、风险特征及利益相关者结构系统性调整XAI方法的排序,从而为社会技术系统中的XAI战略设计提供一个具有一般性的决策支持工具。
链接: https://arxiv.org/abs/2603.01235
作者: Antoni Mestre,Manoli Albert,Miriam Gil,Vicente Pelechano
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:This technical report provides an extended validation of the Explainability Solution Space (ESS) through cross-domain evaluation. While initial validation focused on employee attrition prediction, this study introduces a heterogeneous intelligent urban resource allocation system to demonstrate the generality and domain-independence of the ESS framework. The second case study integrates tabular, temporal, and geospatial data under multi-stakeholder governance conditions. Explicit quantitative positioning of representative XAI families is provided for both contexts. Results confirm that ESS rankings are not domain-specific but adapt systematically to governance roles, risk profiles, and stakeholder configurations. The findings reinforce ESS as a generalizable operational decision-support instrument for explainable AI strategy design across socio-technical systems.
[AI-103] RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design
【速读】:该论文旨在解决当前机器人操作策略在处理需要历史信息推理和长期任务相关状态保持的任务时表现不足的问题,这类能力在真实世界操作场景中极为常见。现有方法对记忆能力的考虑有限,且缺乏系统性的评估框架来分析记忆依赖型操作性能与模型架构设计之间的关系。为填补这一空白,作者提出RMBench——一个包含9个不同复杂度层级的模拟操作任务基准,用于系统性评估策略的记忆能力;并设计了Mem-0,一种具有显式记忆模块的模块化操作策略,支持可控的消融实验。关键创新在于通过结构化基准和可拆解的架构设计,揭示了记忆机制在实际任务中的局限性,并提供了关于架构选择如何影响记忆性能的实证洞察。
链接: https://arxiv.org/abs/2603.01229
作者: Tianxing Chen,Yuran Wang,Mingleyang Li,Yan Qin,Hao Shi,Zixuan Li,Yifan Hu,Yingsheng Zhang,Kaixuan Wang,Yue Chen,Hongcheng Wang,Renjing Xu,Ruihai Wu,Yao Mu,Yaodong Yang,Hao Dong,Ping Luo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: website: this https URL
Abstract:Robotic manipulation policies have made rapid progress in recent years, yet most existing approaches give limited consideration to memory capabilities. Consequently, they struggle to solve tasks that require reasoning over historical observations and maintaining task-relevant information over time, which are common requirements in real-world manipulation scenarios. Although several memory-aware policies have been proposed, systematic evaluation of memory-dependent manipulation remains underexplored, and the relationship between architectural design choices and memory performance is still not well understood. To address this gap, we introduce RMBench, a simulation benchmark comprising 9 manipulation tasks that span multiple levels of memory complexity, enabling systematic evaluation of policy memory capabilities. We further propose Mem-0, a modular manipulation policy with explicit memory components designed to support controlled ablation studies. Through extensive simulation and real-world experiments, we identify memory-related limitations in existing policies and provide empirical insights into how architectural design choices influence memory performance. The website is available at this https URL.
[AI-104] he Lattice Representation Hypothesis of Large Language Models ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中连续嵌入空间与符号抽象之间缺乏明确映射关系的问题,即如何从几何层面揭示LLM嵌入所蕴含的逻辑结构和概念层次。其解决方案的关键在于提出“晶格表示假说”(Lattice Representation Hypothesis),该假说将线性属性方向与分离阈值相结合,通过半空间交集生成概念晶格(concept lattice),从而在嵌入空间中实现符号推理——几何交(meet)对应概念交集,几何并(join)对应概念并集,并在属性方向线性无关时获得唯一规范形式。实验基于WordNet子层次结构验证了LLM嵌入确实编码了此类晶格及其逻辑结构,为连续几何与符号抽象之间建立了原理性的桥梁。
链接: https://arxiv.org/abs/2603.01227
作者: Bo Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026
Abstract:We propose the Lattice Representation Hypothesis of large language models: a symbolic backbone that grounds conceptual hierarchies and logical operations in embedding geometry. Our framework unifies the Linear Representation Hypothesis with Formal Concept Analysis (FCA), showing that linear attribute directions with separating thresholds induce a concept lattice via half-space intersections. This geometry enables symbolic reasoning through geometric meet (intersection) and join (union) operations, and admits a canonical form when attribute directions are linearly independent. Experiments on WordNet sub-hierarchies provide empirical evidence that LLM embeddings encode concept lattices and their logical structure, revealing a principled bridge between continuous geometry and symbolic abstraction.
[AI-105] Communication-Efficient Quantum Federated Learning over Large-Scale Wireless Networks
【速读】:该论文旨在解决大规模无线网络中量子联邦学习(Quantum Federated Learning, QFL)的频谱效率问题,核心挑战在于如何在非正交多址接入(Non-Orthogonal Multiple Access, NOMA)环境下,通过联合优化量子设备的信道选择与发射功率,实现系统和速率(sum-rate)最大化。该问题被建模为一个非凸混合整数非线性规划(Mixed-Integer Nonlinear Programming, MINLP)问题,且即使在固定信道选择的情况下仍属于NP-hard难题。解决方案的关键在于提出一种基于量子近似优化算法(Quantum Approximate Optimization Algorithm, QAOA)的迭代优化方法,以高效求解高维、复杂约束下的最优资源配置,并首次从理论上分析了全设备参与下QFL的收敛性,考虑了非凸损失函数、异构数据分布及量子测量噪声等现实因素。仿真结果表明,所提多信道NOMA-QFL框架在模型精度和收敛速度上优于传统方法,同时系统和速率提升超过100%,显著优于现有技术。
链接: https://arxiv.org/abs/2603.01222
作者: Shaba Shaon,Christopher G. Brinton,Dinh C. Nguyen
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: 21 pages, accepted at IEEE Transactions on Networking
Abstract:Quantum federated learning (QFL) combines the robust data processing of quantum computing with the privacy-preserving features of federated learning (FL). However, in large-scale wireless networks, optimizing sum-rate is crucial for unlocking the true potential of QFL, facilitating effective model sharing and aggregation as devices compete for limited bandwidth amid dynamic channel conditions and fluctuating power resources. This paper studies a novel sum-rate maximization problem within a muti-channel QFL framework, specifically designed for non-orthogonal multiple access (NOMA)-based large-scale wireless networks. We develop a sum-rate maximization problem by jointly considering quantum device’s channel selection and transmit power. Our formulated problem is a non-convex, mixed-integer nonlinear programming (MINLP) challenge that remains non-deterministic polynomial time (NP)-hard even with specified channel selection parameters. The complexity of the problem motivates us to create an effective iterative optimization approach that utilizes the sophisticated quantum approximate optimization algorithm (QAOA) to derive high-quality approximate solutions. Additionally, our study presents the first theoretical exploration of QFL convergence properties under full device participation, rigorously analyzing real-world scenarios with nonconvex loss functions, diverse data distributions, and the effects of quantum shot noise. Extensive simulation results indicate that our multi-channel NOMA-based QFL framework enhances model training and convergence behavior, surpassing conventional algorithms in terms of accuracy and loss. Moreover, our quantum-centric joint optimization approach achieves more than a 100% increase in sum-rate while ensuring rapid convergence, significantly outperforming the state-of-the-arts.
[AI-106] Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics
【速读】:该论文旨在解决工具增强型大语言模型(Tool-augmented LLMs)在训练与部署阶段因运行时状态(runtime state)处理方式不一致而导致的效率低下和行为不稳定问题。具体而言,现有训练范式将代理轨迹视为纯文本序列,忽略了执行语义中状态持久性(state persistence)的关键作用,从而导致模型在实际部署时出现状态缺失或冗余计算等错误。其解决方案的关键在于:通过构造一个可控的、部分可观测的优化任务集——Opaque Knapsack,系统性地隔离状态持久性作为训练变量,并生成成对轨迹(仅在是否保留中间状态上不同),进而对比训练数据与部署环境之间状态语义的一致性影响。实验表明,虽然最终解的质量不受状态处理方式显著影响,但token消耗和稳定性差异巨大,说明应将状态持久性视为代理轨迹的第一类语义特征,训练数据需与部署运行时对齐以提升效率并减少脆弱的训练-运行时错位。
链接: https://arxiv.org/abs/2603.01209
作者: Victor May,Aaditya Salgarkar,Yishan Wang,Diganta Misra,Huu Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL
Abstract:Tool-augmented LLMs are increasingly deployed as agents that interleave natural-language reasoning with executable Python actions, as in CodeAct-style frameworks. In deployment, these agents rely on runtime state that persists across steps. By contrast, common training pipelines treat agent traces as token sequences, with execution semantics left implicit. This raises a data-centric question: Is state persistence merely an inference-time scaffold, or can models learn to exploit it when training data exposes the corresponding execution semantics? We isolate state persistence as a training-time variable. We introduce Opaque Knapsack, a procedurally generated family of partially observable optimization tasks designed to prevent one-shot solutions. Item attributes and constraints are hidden behind budgeted tool calls, forcing multi-turn control flow and iterative state revision. Holding task instances, prompts, tools, model, and supervision fixed, we generate paired trajectories differing only in whether interpreter state persists across steps or resets after each action. We then fine-tune identical base models (Qwen3-8B) on each trace variant and evaluate all four train-runtime combinations. Our 2x2 cross-evaluation shows that execution semantics primarily affect how agents reach solutions, not whether they do: solution quality is statistically indistinguishable across conditions, but token cost and stability differ substantially. A persistent-trained model in a stateless runtime triggers missing-variable errors in roughly 80% of episodes; a stateless-trained model in a persistent runtime redundantly re-derives retained state, using roughly 3.5x more tokens. Interpreter persistence should be treated as a first-class semantic of agent traces. Aligning fine-tuning data with deployment runtimes improves efficiency and reduces brittle train-runtime mismatches. Comments: Code: this https URL Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.01209 [cs.AI] (or arXiv:2603.01209v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.01209 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Victor May [view email] [v1] Sun, 1 Mar 2026 18:08:02 UTC (359 KB)
[AI-107] How Well Does Agent Development Reflect Real-World Work?
【速读】:该论文试图解决的问题是:当前人工智能代理(AI agents)的开发与评估基准在多大程度上代表了真实劳动力市场的多样性与价值分布,尤其是是否存在对人类工作领域和技能的代表性偏差。研究发现,当前代理开发高度集中在编程相关任务,而这类任务在整体人力就业和经济价值中所占比例较小,导致基准与现实劳动市场之间存在显著不匹配。解决方案的关键在于提出三个可衡量的基准设计原则:覆盖性(coverage)、真实性(realism)和细粒度评估(granular evaluation),以确保未来基准能够更准确地捕捉社会重要且技术挑战性强的工作类型,从而提升AI代理在真实应用场景中的实用性与公平性。
链接: https://arxiv.org/abs/2603.01203
作者: Zora Zhiruo Wang,Sanidhya Vijayvargiya,Aspen Chen,Hanmo Zhang,Venu Arvind Arangarajan,Jett Chen,Valerie Chen,Diyi Yang,Daniel Fried,Graham Neubig
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents are increasingly developed and evaluated on benchmarks relevant to human work, yet it remains unclear how representative these benchmarking efforts are of the labor market as a whole. In this work, we systematically study the relationship between agent development efforts and the distribution of real-world human work by mapping benchmark instances to work domains and skills. We first analyze 43 benchmarks and 72,342 tasks, measuring their alignment with human employment and capital allocation across all 1,016 real-world occupations in the U.S. labor market. We reveal substantial mismatches between agent development that tends to be programming-centric, and the categories in which human labor and economic value are concentrated. Within work areas that agents currently target, we further characterize current agent utility by measuring their autonomy levels, providing practical guidance for agent interaction strategies across work scenarios. Building on these findings, we propose three measurable principles for designing benchmarks that better capture socially important and technically challenging forms of work: coverage, realism, and granular evaluation.
[AI-108] Incremental LTLf Synthesis
【速读】:该论文致力于解决增量线性时序逻辑未来(LTLf)合成问题,即在执行过程中逐步接收新目标的情况下,如何高效地更新策略以同时满足原有目标和新增目标。其核心挑战在于,在不重新从头开始合成的前提下,动态调整现有策略以适应新的约束条件。解决方案的关键在于提出两种技术:一是利用基于自动机的合成过程中构建的辅助数据结构来高效实现多目标增量合成;二是基于LTLf公式推进(formula progression)的方法,尽管推进后的公式可能指数级膨胀,但其最小化自动机大小仍受原始公式的限制,从而保证了状态空间的可控性。实验表明,若直接每次重新计算推进后公式的自动机,则该方法效率较低,无法与第一种方案竞争。
链接: https://arxiv.org/abs/2603.01201
作者: Giuseppe De Giacomo,Yves Lespérance,Gianmarco Parretti,Fabio Patrizi,Moshe Y. Vardi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we study incremental LTLf synthesis – a form of reactive synthesis where the goals are given incrementally while in execution. In other words, the protagonist agent is already executing a strategy for a certain goal when it receives a new goal: at this point, the agent has to abandon the current strategy and synthesize a new strategy still fulfilling the original goal, which was given at the beginning, as well as the new goal, starting from the current instant. In this paper, we formally define the problem of incremental synthesis and study its solution. We propose a solution technique that efficiently performs incremental synthesis for multiple LTLf goals by leveraging auxiliary data structures constructed during automata-based synthesis. We also consider an alternative solution technique based on LTLf formula progression. We show that, in spite of the fact that formula progression can generate formulas that are exponentially larger than the original ones, their minimal automata remain bounded in size by that of the original formula. On the other hand, we show experimentally that, if implemented naively, i.e., by actually computing the automaton of the progressed LTLf formulas from scratch every time a new goal arrives, the solution based on formula progression is not competitive.
[AI-109] Scaling of learning time for high dimensional inputs
【速读】:该论文试图解决的问题是:在高维复杂数据中进行表征学习时,模型复杂度(尤其是神经元输入维度)与学习时间之间的权衡关系如何影响人工和生物神经网络的连接结构与学习效率。其解决方案的关键在于基于高维空间几何特性,对一种执行独立成分分析(Independent Component Analysis, ICA)的Hebbian学习模型进行理论分析,发现学习动力学可简化为一维问题,且学习时间仅依赖于初始条件;进一步揭示出学习时间随输入维度呈超线性增长,从而阐明了高维学习中的根本限制,并为神经网络最优设计如何适配数据复杂度提供了理论依据。
链接: https://arxiv.org/abs/2603.01184
作者: Carlos Stein Brito
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC); Computation (stat.CO)
备注: 14 pages, 5 figures
Abstract:Representation learning from complex data typically involves models with a large number of parameters, which in turn require large amounts of data samples. In neural network models, model complexity grows with the number of inputs to each neuron, with a trade-off between model expressivity and learning time. A precise characterization of this trade-off would help explain the connectivity and learning times observed in artificial and biological networks. We present a theoretical analysis of how learning time depends on input dimensionality for a Hebbian learning model performing independent component analysis. Based on the geometry of high-dimensional spaces, we show that the learning dynamics reduce to a unidimensional problem, with learning times dependent only on initial conditions. For higher input dimensions, initial parameters have smaller learning gradients and larger learning times. We find that learning times have supralinear scaling, becoming quickly prohibitive for high input dimensions. These results reveal a fundamental limitation for learning in high dimensions and help elucidate how the optimal design of neural networks depends on data complexity. Our approach outlines a new framework for analyzing learning dynamics and model complexity in neural network models.
[AI-110] ATLAS: AI-Assisted Threat-to-Assertion Learning for System-on-Chip Security Verification
【速读】:该论文旨在解决系统级芯片(SoC)安全验证中缺乏自动化、知识驱动方法的问题,尤其在将漏洞知识库(如CWE)与形式化验证技术结合方面存在断层。解决方案的关键在于提出ATLAS框架,该框架利用大语言模型(LLM)驱动的推理能力,从标准化威胁建模模板出发,识别SoC特定资产并映射相关弱安全性缺陷,进而自动生成基于断言的安全属性和JasperGold验证脚本,实现从漏洞推理到形式化证明的自动化转换,从而推动SoC安全验证向“设计即安全”(secure-by-design)范式演进。
链接: https://arxiv.org/abs/2603.01170
作者: Ishraq Tashdid,Kimia Tasnia,Alexander Garcia,Jonathan Valamehr,Sazadur Rahman
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at the 63rd Design Automation Conference (DAC 2026), Long Beach, CA, USA (July, 2026)
Abstract:This work presents ATLAS, an LLM-driven framework that bridges standardized threat modeling and property-based formal verification for System-on-Chip (SoC) security. Starting from vulnerability knowledge bases such as Common Weakness Enumeration (CWE), ATLAS identifies SoC-specific assets, maps relevant weaknesses, and generates assertion-based security properties and JasperGold scripts for verification. By combining asset-centric analysis with standardized threat model templates and multi-source SoC context, ATLAS automates the transformation from vulnerability reasoning to formal proof. Evaluated on three HACK@DAC benchmarks, ATLAS detected 39/48 CWEs and generated correct properties for 33 of those bugs, advancing automated, knowledge-driven SoC security verification toward a secure-by-design paradigm.
[AI-111] SphUnc: Hyperspherical Uncertainty Decomposition and Causal Identification via Information Geometry
【速读】:该论文旨在解决复杂多智能体系统中可靠决策所面临的不确定性建模与可解释性难题,特别是如何在存在高阶交互关系的场景下实现校准预测和因果推理。解决方案的关键在于提出SphUnc框架,该框架将超球面表示学习(hyperspherical representation learning)与结构因果建模(structural causal modeling)相结合:首先利用von Mises-Fisher分布将特征映射到单位超球面上的潜在变量,通过信息几何融合分解出认知不确定性(epistemic uncertainty)与随机不确定性(aleatoric uncertainty);进而在此超球面空间上构建结构因果模型,支持基于样本模拟的定向影响识别与干预推理,从而为多智能体系统的不确定性感知推理提供几何-因果基础。
链接: https://arxiv.org/abs/2603.01168
作者: Rong Fu,Chunlei Meng,Jinshuo Liu,Dianyu Zhao,Yongtai Liu,Yibo Meng,Xiaowen Ma,Wangyu Wu,Yangchen Zeng,Kangning Cui,Shuaishuai Cao,Simon Fong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 15 figures
Abstract:Reliable decision-making in complex multi-agent systems requires calibrated predictions and interpretable uncertainty. We introduce SphUnc, a unified framework combining hyperspherical representation learning with structural causal modeling. The model maps features to unit hypersphere latents using von Mises-Fisher distributions, decomposing uncertainty into epistemic and aleatoric components through information-geometric fusion. A structural causal model on spherical latents enables directed influence identification and interventional reasoning via sample-based simulation. Empirical evaluations on social and affective benchmarks demonstrate improved accuracy, better calibration, and interpretable causal signals, establishing a geometric-causal foundation for uncertainty-aware reasoning in multi-agent settings with higher-order interactions.
[AI-112] DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent
【速读】:该论文旨在解决深度研究代理(Deep-research agents)在实际应用中面临的两大瓶颈问题:一是缺乏大规模、具有真实世界挑战性的数据集,二是缺少可用于数据合成与代理训练的开源框架。其关键解决方案包括:构建一个名为DeepResearch-9K的大规模多跳问答数据集,该数据集包含9000个分三级难度(L1–L3)的问题、高质量的搜索轨迹及推理链,并附带可验证的答案;同时开发了一个名为DeepResearch-R1的开源训练框架,支持多轮网络交互、多种强化学习(Reinforcement Learning, RL)方法以及多样化的奖励模型(如基于规则的结果奖励和大语言模型作为裁判的反馈)。实证结果表明,在DeepResearch-R1框架下训练的代理在复杂深度研究基准测试中达到了最先进性能。
链接: https://arxiv.org/abs/2603.01152
作者: Tongzhou Wu,Yuhao Wang,Xinyu Ma,Xiuqiang He,Shuaiqiang Wang,Dawei Yin,Xiangyu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures
Abstract:Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework DeepResearch-R1 that supports (1) multi-turn web interactions, (2) different reinforcement learning (RL) approaches, and (3) different reward models such as rule-based outcome reward and LLM-as-judge feedback. Finally, empirical results demonstrate that agents trained on DeepResearch-9K under our DeepResearch-R1 achieve state-of-the-art results on challenging deep-research benchmarks. We release the DeepResearch-9K dataset on this https URL and the code of DeepResearch-R1 on this https URL.
[AI-113] AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在实际应用中难以积累和复用用户个性化交互经验的问题,即用户反复表达的稳定偏好(如减少幻觉、遵循机构写作规范或避免过度技术性表述)未能被有效固化为可迁移的能力,导致LLM代理无法在不同会话间持续进化。解决方案的关键在于提出AutoSkill框架——一个以经验驱动的终身学习机制,能够自动从对话与交互痕迹中抽象出技能(skill),支持其持续自我演化,并在不重新训练底层模型的前提下动态注入相关技能到后续请求中;该框架设计为与模型无关的插件层,采用标准化技能表示形式,实现跨代理、用户和任务的技能共享与迁移,从而将短暂的交互体验转化为显式、可复用且可组合的能力。
链接: https://arxiv.org/abs/2603.01145
作者: Yutao Yang,Junsong Li,Qianjun Pan,Bihao Zhan,Yuxuan Cai,Lin Du,Jie Zhou,Kai Chen,Qin Chen,Xin Li,Bo Zhang,Liang He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In practical LLM applications, users repeatedly express stable preferences and requirements, such as reducing hallucinations, following institutional writing conventions, or avoiding overly technical wording, yet such interaction experience is seldom consolidated into reusable knowledge. Consequently, LLM agents often fail to accumulate personalized capabilities across sessions. We present AutoSkill, an experience-driven lifelong learning framework that enables LLM agents to automatically derive, maintain, and reuse skills from dialogue and interaction traces. AutoSkill abstracts skills from user experience, supports their continual self-evolution, and dynamically injects relevant skills into future requests without retraining the underlying model. Designed as a model-agnostic plugin layer, it is compatible with existing LLMs and introduces a standardized skill representation for sharing and transfer across agents, users, and tasks. In this way, AutoSkill turns ephemeral interaction experience into explicit, reusable, and composable capabilities. This paper describes the motivation, architecture, skill lifecycle, and implementation of AutoSkill, and positions it with respect to prior work on memory, retrieval, personalization, and agentic systems. AutoSkill highlights a practical and scalable path toward lifelong personalized agents and personal digital surrogates.
[AI-114] A Deep Learning Framework for Heat Demand Forecasting using Time-Frequency Representations of Decomposed Features
【速读】:该论文旨在解决区域供热系统(District Heating Systems)中热需求的多步预测难题,以实现多种能源(如木材、天然气、电力和太阳能)的高效调度,从而在满足用户需求的同时降低碳排放并延长基础设施寿命。其核心挑战在于复杂非线性的用热模式与外部气象因素的耦合关系,传统时间域模型难以捕捉深层时频特征。解决方案的关键在于提出一种基于时频表示的深度学习框架:通过连续小波变换(Continuous Wavelet Transform)对历史热需求和气象数据进行分解,使卷积神经网络(Convolutional Neural Networks)能够学习到标准时间域模型无法获取的分层时序特征,显著提升了预测精度,在丹麦和德国多个城市的多年度测试数据上将平均绝对误差(Mean Absolute Error)降低了36%至43%,最高达到95%的准确率。
链接: https://arxiv.org/abs/2603.01137
作者: Adithya Ramachandran,Satyaki Chatterjee,Thorkil Flensmark B. Neergaard,Maximilian Oberndoerfer,Andreas Maier,Siming Bayer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:District Heating Systems are essential infrastructure for delivering heat to consumers across a geographic region sustainably, yet efficient management relies on optimizing diverse energy sources, such as wood, gas, electricity, and solar, in response to fluctuating demand. Aligning supply with demand is critical not only for ensuring reliable heat distribution but also for minimizing carbon emissions and extending infrastructure lifespan through lower operating temperatures. However, accurate multi-step forecasting to support these goals remains challenging due to complex, non-linear usage patterns and external dependencies. In this work, we propose a novel deep learning framework for day-ahead heat demand prediction that leverages time-frequency representations of historical data. By applying Continuous Wavelet Transform to decomposed demand and external meteorological factors, our approach enables Convolutional Neural Networks to learn hierarchical temporal features that are often inaccessible to standard time domain models. We systematically evaluate this method against statistical baselines, state-of-the-art Transformers, and emerging foundation models using multi-year data from three distinct Danish districts, a Danish city, and a German city. The results show a significant advancement, reducing the Mean Absolute Error by 36% to 43% compared to the strongest baselines, achieving forecasting accuracy of up to 95% across annual test datasets. Qualitative and statistical analyses further confirm the accuracy and robustness by reliably tracking volatile demand peaks where others fail. This work contributes both a high-performance forecasting architecture and critical insights into optimal feature composition, offering a validated solution for modern energy applications.
[AI-115] FCN-LLM : Empower LLM for Brain Functional Connectivity Network Understanding via Graph-level Multi-task Instruction Tuning
【速读】:该论文旨在解决当前脑功能连接网络(Functional Connectivity Networks, FCNs)与大型语言模型(Large Language Models, LLMs)之间缺乏语义对齐的问题,从而限制了LLMs直接理解FCNs的能力。其关键解决方案是提出FCN-LLM框架,通过图级多任务指令微调(multi-task instruction tuning),将FCNs的多层次特征(包括脑区、功能子网络和全脑尺度)编码并映射到LLM的语义空间中,并设计覆盖19种个体特异性属性的多范式指令任务,结合分阶段学习策略实现嵌入对齐与联合微调,最终在大规模多中心FCN数据集上展现出优异的零样本泛化能力。
链接: https://arxiv.org/abs/2603.01135
作者: Xingcan Hu,Wei Wang,Li Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models have achieved remarkable success in language understanding and reasoning, and their multimodal extensions enable comprehension of images, video, and audio. Inspired by this, foundation models for brain functional connectivity networks derived from resting-state fMRI have shown promise in clinical tasks. However, existing methods do not align FCNs with the text modality, limiting the ability of LLMs to directly understand FCNs. To address this, we propose FCN-LLM, a framework that enables LLMs to understand FCNs through graph-level, multi-task instruction tuning. Our approach employs a multi-scale FCN encoder capturing brain-region, functional subnetwork, and whole-brain features, projecting them into the semantic space of LLM. We design multi-paradigm instruction tasks covering 19 subject-specific attributes across demographics, phenotypes, and psychiatric conditions. A multi-stage learning strategy first aligns FCN embeddings with the LLM and then jointly fine-tunes the entire model to capture high-level semantic information. Experiments on a large-scale, multi-site FCN database show that FCN-LLM achieves strong zero-shot generalization on unseen datasets, outperforming conventional supervised and foundation models. This work introduces a new paradigm for integrating brain functional networks with LLMs, offering a flexible and interpretable framework for neuroscience.
[AI-116] HVR-Met: A Hypothesis-Verification-Replaning Agent ic System for Extreme Weather Diagnosis
【速读】:该论文旨在解决深度学习气象预报范式在极端天气诊断中面临的挑战,即当前方法难以实现复杂的多步逻辑推理、动态工具调用及专家级先验判断。其解决方案的关键在于提出HVR-Met多智能体气象诊断系统,核心创新是引入“假设-验证-重规划”(Hypothesis-Verification-Replanning)闭环机制,通过精细的迭代推理流程对极端天气事件中的异常气象信号进行深入分析,并结合原子级子任务的新基准评估体系,有效提升了复杂诊断场景下的准确性和可解释性。
链接: https://arxiv.org/abs/2603.01121
作者: Shuo Tang,Jiadong Zhang,Jian Xu,Gengxian Zhou,Qizhao Jin,Qinxuan Wang,Yi Hu,Ning Hu,Hongchang Ren,Lingli He,Jiaolan Fu,Jingtao Ding,Shiming Xiang,Chenglin Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While deep learning-based weather forecasting paradigms have made significant strides, addressing extreme weather diagnostics remains a formidable challenge. This gap exists primarily because the diagnostic process demands sophisticated multi-step logical reasoning, dynamic tool invocation, and expert-level prior judgment. Although agents possess inherent advantages in task decomposition and autonomous execution, current architectures are still hampered by critical bottlenecks: inadequate expert knowledge integration, a lack of professional-grade iterative reasoning loops, and the absence of fine-grained validation and evaluation systems for complex workflows under extreme conditions. To this end, we propose HVR-Met, a multi-agent meteorological diagnostic system characterized by the deep integration of expert knowledge. Its central innovation is the ``Hypothesis-Verification-Replanning’’ closed-loop mechanism, which facilitates sophisticated iterative reasoning for anomalous meteorological signals during extreme weather events. To bridge gaps within existing evaluation frameworks, we further introduce a novel benchmark focused on atomic-level subtasks. Experimental evidence demonstrates that the system excels in complex diagnostic scenarios.
[AI-117] DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage ICLR2026
【速读】:该论文旨在解决基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习方法在提升多模态大语言模型(Multimodal Large Language Models, MLLMs)推理能力时所面临的奖励稀疏性和优势消失问题。具体而言,当群体级别的奖励过于一致时,GRPO在困难或过于简单的问题上难以提供清晰的优化信号,导致训练效率低下和性能瓶颈。其解决方案的关键在于提出一种难度自适应的优势计算方法——DIVA-GRPO,该方法从全局视角动态评估问题难度,采样具有适当难度水平的变体,并通过难度加权与归一化缩放,在局部和全局群体中计算优势值,从而有效缓解奖励稀疏性和优势消失问题,同时提升训练稳定性与推理性能。
链接: https://arxiv.org/abs/2603.01106
作者: Haowen Gao,Zhenyu Zhang,Liang Pang,Fangda Guo,Hongjian Dou,Guannan Lv,Shaoguo Liu,Tingting Gao,Huawei Shen,Xueqi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026. Code and models are available at this https URL
Abstract:Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: this https URL
[AI-118] SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation ICLR2026
【速读】:该论文旨在解决现有多轨音乐生成模型在生成过程中忽视节奏稳定性与同步性的问题,导致模型更关注各音轨之间的差异而非其内在属性。为应对这一挑战,论文提出SyncTrack模型,其核心创新在于引入了轨道共享模块(track-shared modules)与轨道特定模块(track-specific modules)的混合架构:轨道共享模块通过两个跨轨注意力机制实现节奏信息的同步,而轨道特定模块则利用可学习的乐器先验来捕捉不同音轨的独特音色和音高范围。此外,作者还设计了三项新指标(内轨节奏稳定性 IRS、跨轨节拍同步 CBS 和跨轨节拍分散度 CBD)以量化评估节奏一致性,从而显著提升多轨音乐的整体质量。
链接: https://arxiv.org/abs/2603.01101
作者: Hongrui Wang,Fan Zhang,Zhiyuan Yu,Ziya Zhou,Xi Chen,Can Yang,Yang Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026
Abstract:Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between tracks rather than their inherent properties. In this paper, we introduce SyncTrack, a synchronous multi-track waveform music generation model designed to capture the unique characteristics of multi-track music. SyncTrack features a novel architecture that includes track-shared modules to establish a common rhythm across all tracks and track-specific modules to accommodate diverse timbres and pitch ranges. Each track-shared module employs two cross-track attention mechanisms to synchronize rhythmic information, while each track-specific module utilizes learnable instrument priors to better represent timbre and other unique features. Additionally, we enhance the evaluation of multi-track music quality by introducing rhythmic consistency through three novel metrics: Inner-track Rhythmic Stability (IRS), Cross-track Beat Synchronization (CBS), and Cross-track Beat Dispersion (CBD). Experiments demonstrate that SyncTrack significantly improves the multi-track music quality by enhancing rhythmic consistency.
[AI-119] Alien Science: Sampling Coherent but Cognitively Unavailable Research Directions from Idea Atoms ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科研创新中的局限性问题,即LLMs虽擅长重组已有知识,却难以生成对研究社区而言既连贯又非显而易见的新颖研究方向。为填补这一“认知可用性”(cognitive availability)缺口,作者提出了一套系统性解决方案:首先将论文分解为细粒度的概念单元(conceptual units),进而聚类形成跨论文通用的“思想原子”(idea atoms)词汇表;随后训练两个互补模型——一个用于评估概念组合的连贯性(coherence model),另一个用于预测该组合在研究社区中的可用性(availability model);最终通过采样高连贯性但低可用性的组合,生成“异域”(alien)研究方向。该方案的关键在于将科研创新建模为可量化、可优化的双维度搜索过程,从而突破传统LLM仅依赖语义相似性的生成范式。
链接: https://arxiv.org/abs/2603.01092
作者: Alejandro H. Artiles,Martin Weiss,Levin Brinkmann,Anirudh Goyal,Nasim Rahaman
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published at the ICLR 2026 Post-AGI Science and Society Workshop
Abstract:Large language models are adept at synthesizing and recombining familiar material, yet they often fail at a specific kind of creativity that matters most in research: producing ideas that are both coherent and non-obvious to the current community. We formalize this gap through cognitive availability, the likelihood that a research direction would be naturally proposed by a typical researcher given what they have worked on. We introduce a pipeline that (i) decomposes papers into granular conceptual units, (ii) clusters recurring units into a shared vocabulary of idea atoms, and (iii) learns two complementary models: a coherence model that scores whether a set of atoms constitutes a viable direction, and an availability model that scores how likely that direction is to be generated by researchers drawn from the community. We then sample “alien” directions that score high on coherence but low on availability. On a corpus of \sim 7,500 recent LLM papers from NeurIPS, ICLR and ICML, we validate that (a) conceptual units preserve paper content under reconstruction, (b) idea atoms generalize across papers rather than memorizing paper-specific phrasing, and © the Alien sampler produces research directions that are more diverse than LLM baselines while maintaining coherence.
[AI-120] HideSeek: Remove Image Watermarks with Negligible Cost via Pixel-wise Reconstruction
【速读】:该论文旨在解决当前先进主动式图像水印防御技术在应对恶意攻击时鲁棒性不足的问题,即现有水印方案在面对针对性攻击时易被移除而无法有效保护生成式 AI (Generative AI) 生成图像的版权与来源可信度。其解决方案的关键在于提出 HIDESEEK(HS),一套通用且低成本的攻击方法,能够可靠地去除嵌入的水印信息,同时保持图像的高视觉保真度,从而揭示当前主流水印防御机制的实际局限性。
链接: https://arxiv.org/abs/2603.01067
作者: Huajie Chen,Tianqing Zhu,Hailin Yang,Yuchen Zhong,Yang Zhang,Hui Sun,Heng Xu,Zuobin Ying,Lihua Yin,Wanlei Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Watermarking has emerged as a key defense against the misuse of machine-generated images (MGIs). Yet the robustness of these protections remains underexplored. To reveal the limits of SOTA proactive image watermarking defenses, we propose HIDESEEK (HS), a suite of versatile and cost-effective attacks that reliably remove embedded watermarks while preserving high visual fidelity.
[AI-121] riMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
【速读】:该论文旨在解决大规模稀疏模型(Mixture-of-Experts, MoE)在单GPU异构推理场景下因专家(expert)负载分布不均导致的计算效率瓶颈问题。具体而言,现有基于GPU-CPU架构的冷专家(cold experts)卸载方案受限于主机内存带宽,而新兴GPU-NDP架构虽通过DIMM-NDP卸载非热点专家(non-hot experts),但其中存在大量“温专家”(warm experts)——它们虽不常被激活,却因高GPU I/O延迟而严重受挫,同时又具备饱和 NDP 计算吞吐的能力,从而暴露了显著的计算间隙(compute gap)。解决方案的关键在于提出 TriMoE 架构,该架构通过协同利用支持 AMX(Advanced Matrix Extensions)的 CPU 精确地将热、温、冷专家映射到各自最优计算单元上,并结合瓶颈感知的专家调度策略与预测驱动的动态重布局/再平衡机制,有效填补该计算间隙,实验表明其相较最先进方案最高可实现 2.83 倍加速。
链接: https://arxiv.org/abs/2603.01058
作者: Yudong Pan,Yintao He,Tianhua Han,Lian Liu,Shixin Zhao,Zhirong Chen,Mengdi Wang,Cangyuan Li,Yinhe Han,Ying Wang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted by DAC 2026
Abstract:To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU-CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.
[AI-122] MMCOMET: A Large-Scale Multimodal Commonsense Knowledge Graph for Contextual Reasoning
【速读】:该论文旨在解决现有多模态常识知识图谱(Multimodal Commonsense Knowledge Graph, MMKG)在支持复杂推理任务(如图像描述生成和故事创作)时存在的局限性,即缺乏对物理、社会和事件性常识的全面整合。其解决方案的关键在于构建首个融合视觉维度的MMKG——MMCOMET,通过高效的图像检索流程将ATOMIC2020知识图谱扩展为包含超过90万个多模态三元组,从而实现文本与视觉信息的协同推理,显著提升了故事生成的丰富性、连贯性和情境一致性。
链接: https://arxiv.org/abs/2603.01055
作者: Eileen Wang,Hiba Arnaout,Dhita Pratama,Shuo Yang,Dangyang Liu,Jie Yang,Josiah Poon,Jeff Pan,Caren Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present MMCOMET, the first multimodal commonsense knowledge graph (MMKG) that integrates physical, social, and eventive knowledge. MMCOMET extends the ATOMIC2020 knowledge graph to include a visual dimension, through an efficient image retrieval process, resulting in over 900K multimodal triples. This new resource addresses a major limitation of existing MMKGs in supporting complex reasoning tasks like image captioning and storytelling. Through a standard visual storytelling experiment, we show that our holistic approach enables the generation of richer, coherent, and contextually grounded stories than those produced using text-only knowledge. This resource establishes a new foundation for multimodal commonsense reasoning and narrative generation.
[AI-123] urning Black Box into White Box: Dataset Distillation Leaks
【速读】:该论文旨在解决数据蒸馏(Dataset Distillation)技术中存在的隐私泄露问题。尽管合成数据集通常被视为具有隐私保护特性,但作者指出,现有蒸馏方法会导致严重的信息泄露,因为合成数据隐式编码了模型在蒸馏过程中的权重轨迹,使其成为攻击者可利用的过载信息源。解决方案的关键在于提出一种名为信息揭示攻击(Information Revelation Attack, IRA)的新方法,该方法能够准确预测蒸馏算法和模型架构,并成功推断真实数据集中的成员身份甚至恢复敏感样本,从而系统性地暴露当前蒸馏技术的隐私风险。
链接: https://arxiv.org/abs/2603.01053
作者: Huajie Chen,Tianqing Zhu,Yuchen Zhong,Yang Zhang,Shang Wang,Feng He,Lefeng Zhang,Jialiang Shen,Minghao Wang,Wanlei Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Dataset distillation compresses a large real dataset into a small synthetic one, enabling models trained on the synthetic data to achieve performance comparable to those trained on the real data. Although synthetic datasets are assumed to be privacy-preserving, we show that existing distillation methods can cause severe privacy leakage because synthetic datasets implicitly encode the weight trajectories of the distilled model, they become over-informative and exploitable by adversaries. To expose this risk, we introduce the Information Revelation Attack (IRA) against state-of-the-art distillation techniques. Experiments show that IRA accurately predicts both the distillation algorithm and model architecture, and can successfully infer membership and recover sensitive samples from the real dataset.
[AI-124] RepoRepair: Leverag ing Code Documentation for Repository-Level Automated Program Repair
【速读】:该论文旨在解决自动化程序修复(Automated Program Repair, APR)在从单个函数扩展到完整代码库(repository-level)时面临的挑战,即现有方法受限于局部上下文信息、依赖浅层检索或高成本的代理迭代,在处理跨文件复杂故障时表现不佳。其解决方案的关键在于提出RepoRepair,一种基于文档增强的仓库级故障定位与修复方法:通过大语言模型(LLM)生成从函数到文件层级的结构化代码文档(hierarchical code documentation),构建语义抽象以帮助LLM理解仓库级别的上下文和依赖关系;随后利用这些文档指导故障定位,并结合问题描述由更强的LLM执行修复,从而实现高效且准确的跨文件修复。
链接: https://arxiv.org/abs/2603.01048
作者: Zhongqiang Pan,Chuanyi Li,Wenkang Zhong,Yi Feng,Bin Luo,Vincent Ng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated program repair (APR) struggles to scale from isolated functions to full repositories, as it demands a global, task-aware understanding to locate necessary changes. Current methods, limited by context and reliant on shallow retrieval or costly agent iterations, falter on complex cross-file issues. To this end, we propose RepoRepair, a novel documentation-enhanced approach for repository-level fault localization and program repair. Our core insight is to leverage LLMs to generate hierarchical code documentation (from functions to files) for code repositories, creating structured semantic abstractions that enable LLMs to comprehend repository-level context and dependencies. Specifically, RepoRepair first employs a text-based LLM (e.g., DeepSeek-V3) to generate file/function-level code documentation for repositories, which serves as auxiliary knowledge to guide fault localization. Subsequently, based on the fault localization results and the issue description, a powerful LLM (e.g., Claude-4) attempts to repair the identified suspicious code snippets. Evaluated on SWE-bench Lite, RepoRepair achieves a 45.7% repair rate at a low cost of 0.44 per fix. On SWE-bench Multimodal, it delivers state-of-the-art performance with a 37.1% repair rate despite a higher cost of 0.56 per fix, demonstrating robust and cost-effective performance across diverse problem domains.
[AI-125] One-Token Verification for Reasoning Correctness Estimation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中面临的两个关键问题:一是多样本解码导致的显著推理延迟,尤其在生成长文本时;二是缺乏可靠机制对单条推理轨迹(reasoning trace)的正确性进行评估。解决方案的关键在于提出一种名为“单令牌验证”(One-Token Verification, OTV)的方法,其核心创新是通过一个可学习的标记激活验证机制,并利用低秩适配(Low-Rank Adaptation, LoRA)将该机制嵌入LLM中,从而在生成过程中仅需一次前向传播即可估计每个token级别的推理正确性。OTV基于键值缓存(key-value cache)探测内部推理信号,无需干扰主推理流程,且支持任意生成阶段的实时验证,最终实现更高效、可靠的推理决策,实验表明其在数学推理基准上优于现有验证器,并能通过正确性引导的提前终止策略将token消耗降低高达90%。
链接: https://arxiv.org/abs/2603.01025
作者: Zhan Zhuang,Xiequn Wang,Zebin Chen,Feiyang Ye,Ying Wei,Kede Ma,Yu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent breakthroughs in large language models (LLMs) have led to notable successes in complex reasoning tasks, such as mathematical problem solving. A common strategy for improving performance is parallel thinking, in which multiple reasoning traces are generated and the final prediction is made using aggregation schemes like majority voting or best-of- N decoding. However, two key challenges persist. First, multi-sample decoding incurs substantial inference latency, especially for long-form outputs. Second, effective mechanisms for reliably assessing the correctness of individual reasoning traces are still limited. To address these challenges, we introduce One-Token Verification (OTV), a computational method that estimates reasoning correctness in a single forward pass during generation. OTV is activated by a learnable token and integrated into the LLM via low-rank adaptation to probe internal reasoning signals through the key-value cache, supporting token-level correctness estimation at any stage of generation without disrupting primary reasoning. Experiments on mathematical reasoning benchmarks demonstrate that OTV consistently surpasses existing verifiers. Additionally, OTV reduces token usage by up to 90% through correctness-guided early termination, prioritizing shorter, more reliable solutions.
[AI-126] An Open-Source Modular Benchmark for Diffusion-Based Motion Planning in Closed-Loop Autonomous Driving
【速读】:该论文旨在解决扩散模型驱动的运动规划器在真实自动驾驶系统中部署时存在的两大问题:一是现有评估方法忽略了ROS 2通信延迟和实时调度约束,二是单体ONNX模型部署导致求解参数无法在运行时调整,且缺乏对去噪过程的可观测性。解决方案的关键在于构建一个开源模块化基准测试框架,通过ONNX GraphSurgeon将原本18,398个节点的扩散规划器拆分为三个可独立执行的模块,并用原生C++重写DPM-Solver++去噪循环;该框架集成到Autoware ROS 2节点中,支持运行时配置求解参数、实现每步去噪过程的可观测性,从而打破单体部署的黑箱限制。
链接: https://arxiv.org/abs/2603.01023
作者: Yun Li,Simon Thompson,Yidu Zhang,Ehsan Javanmardi,Manabu Tsukada
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures
Abstract:Diffusion-based motion planners have achieved state-of-the-art results on benchmarks such as nuPlan, yet their evaluation within closed-loop production autonomous driving stacks remains largely unexplored. Existing evaluations abstract away ROS 2 communication latency and real-time scheduling constraints, while monolithic ONNX deployment freezes all solver parameters at export time. We present an open-source modular benchmark that addresses both gaps: using ONNX GraphSurgeon, we decompose a monolithic 18,398 node diffusion planner into three independently executable modules and reimplement the DPM-Solver++ denoising loop in native C++. Integrated as a ROS 2 node within Autoware, the open-source AD stack deployed on real vehicles worldwide, the system enables runtime-configurable solver parameters without model recompilation and per-step observability of the denoising process, breaking the black box of monolithic deployment. Unlike evaluations in standalone simulators such as CARLA, our benchmark operates within a production-grade stack and is validated through AWSIM closed-loop simulation. Through systematic comparison of DPM-Solver++ (first- and second-order) and DDIM across six step-count configurations (N in 3, 5, 7, 10, 15, 20), we show that encoder caching yields a 3.2x latency reduction, and that second-order solving reduces FDE by 41% at N=3 compared to first-order. The complete codebase will be released as open-source, providing a direct path from simulation benchmarks to real-vehicle deployment.
[AI-127] FastCode: Fast and Cost-Efficient Code Understanding and Reasoning
【速读】:该论文旨在解决大规模代码库中推理任务的准确性与上下文成本之间的权衡问题,即现有基于代理(agentic)的方法在处理复杂软件工程任务时,常因低效的迭代式全文探索导致计算资源浪费。其解决方案的关键在于提出一种“先侦察后消费”的框架 \model,通过将代码库探索与内容消费解耦,利用结构化侦察机制构建轻量级语义-结构地图,实现无需全量文本加载即可精准定位相关代码目标,并结合成本感知策略优化上下文构造流程,从而在单步内生成高价值上下文,显著提升推理准确率并降低token消耗。
链接: https://arxiv.org/abs/2603.01012
作者: Zhonghang Li,Zongwei Li,Yuxuan Chen,Han Shi,Jiawei Li,Jierun Chen,Haoli Bai,Chao Huang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Repository-scale code reasoning is a cornerstone of modern AI-assisted software engineering, enabling Large Language Models (LLMs) to handle complex workflows from program comprehension to complex debugging. However, balancing accuracy with context cost remains a significant bottleneck, as existing agentic approaches often waste computational resources through inefficient, iterative full-text exploration. To address this, we introduce \model, a framework that decouples repository exploration from content consumption. \model\ utilizes a structural scouting mechanism to navigate a lightweight semantic-structural map of the codebase, allowing the system to trace dependencies and pinpoint relevant targets without the overhead of full-text ingestion. By leveraging structure-aware navigation tools regulated by a cost-aware policy, the framework constructs high-value contexts in a single, optimized step. Extensive evaluations on the SWE-QA, LongCodeQA, LOC-BENCH, and GitTaskBench benchmarks demonstrate that \model\ consistently outperforms state-of-the-art baselines in reasoning accuracy while significantly reducing token consumption, validating the efficiency of scouting-first strategies for large-scale code reasoning. Source code is available at this https URL.
[AI-128] AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching
【速读】:该论文旨在解决生成式音频流模型(audio Flow Matching)中基于REPresentation Alignment (REPA) 的表示对齐策略在token条件控制下的有效性受限问题,尤其是监督层选择依赖启发式方法(如按深度选取)导致性能不稳定的问题。其解决方案的关键在于提出Attribution-Guided REPresentation Alignment (AG-REPA),通过引入一种前向门控消融(FoG-A)机制,量化各层对速度场(velocity field)的因果贡献,从而识别出真正驱动生成过程的因果主导层,而非仅具有高语义/声学信息存储能力的“功能被动”层,实现了稀疏且自适应的层选择与加权对齐,显著提升了模型在统一语音与通用音频训练场景下的表现。
链接: https://arxiv.org/abs/2603.01006
作者: Pengfei Zhang,Tianxin Xie,Minghao Yang,Li Liu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 13 pages, 4 figures, 4 tables
Abstract:REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer’s causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.
[AI-129] CollabEval: Enhancing LLM -as-a-Judge via Multi-Agent Collaboration
【速读】:该论文旨在解决当前基于单个大语言模型(Large Language Models, LLMs)进行AI生成内容评估时存在的判断不一致性和预训练数据带来的固有偏见问题。其解决方案的关键在于提出了一种名为CollabEval的多智能体协同评估框架,该框架采用三阶段协作评估流程——初始评估、多轮讨论与最终判断,并通过智能体间的策略性共识检查机制实现高效协同,从而在保证评估质量的同时提升鲁棒性与一致性。
链接: https://arxiv.org/abs/2603.00993
作者: Yiyue Qian,Shinan Zhang,Yun Zhou,Haibo Ding,Diego Socolinsky,Yi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have revolutionized AI-generated content evaluation, with the LLM-as-a-Judge paradigm becoming increasingly popular. However, current single-LLM evaluation approaches face significant challenges, including inconsistent judgments and inherent biases from pre-training data. To address these limitations, we propose CollabEval, a novel multi-agent evaluation framework that implements a three-phase Collaborative Evaluation process: initial evaluation, multi-round discussion, and final judgment. Unlike existing approaches that rely on competitive debate or single-model evaluation, CollabEval emphasizes collaboration among multiple agents with strategic consensus checking for efficiency. Our extensive experiments demonstrate that CollabEval consistently outperforms single-LLM approaches across multiple dimensions while maintaining robust performance even when individual models struggle. The framework provides comprehensive support for various evaluation criteria while ensuring efficiency through its collaborative design.
[AI-130] racking Capabilities for Safer Agents
【速读】:该论文旨在解决AI代理(AI agent)通过工具调用与现实世界交互时面临的根本性安全挑战,包括隐私信息泄露、意外副作用以及提示注入攻击等问题。其解决方案的关键在于引入基于编程语言的“安全约束机制”(safety harness),即要求代理将意图表达为在能力安全语言(capability-safe language)中编写的代码,具体采用的是支持捕获检查(capture checking)的Scala 3语言。该方案利用类型系统静态追踪能力(capability),实现对资源和效应访问的细粒度控制,尤其通过局部纯性(local purity)机制确保子计算无副作用,从而有效防止处理敏感数据时的信息泄露。实验表明,该方法可在不显著影响任务性能的前提下生成能力安全代码,并可靠地阻止如信息泄露和恶意副作用等不安全行为。
链接: https://arxiv.org/abs/2603.00991
作者: Martin Odersky,Yaoyu Zhao,Yichen Xu,Oliver Bračevac,Cao Nguyen Pham
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenges, we propose to put the agent in a programming-language-based “safety harness”: instead of calling tools directly, agents express their intentions as code in a capability-safe language: Scala 3 with capture checking. Capabilities are program variables that regulate access to effects and resources of interest. Scala’s type system tracks capabilities statically, providing fine-grained control over what an agent can do. In particular, it enables local purity, the ability to enforce that sub-computations are side-effect-free, preventing information leakage when agents process classified data. We demonstrate that extensible agent safety harnesses can be built by leveraging a strong type system with tracked capabilities. Our experiments show that agents can generate capability-safe code with no significant loss in task performance, while the type system reliably prevents unsafe behaviors such as information leakage and malicious side effects.
[AI-131] HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在长程任务中因缺乏结构化规划与可靠执行能力而导致的性能瓶颈问题。现有方法依赖扁平的自回归策略,将高层推理与底层动作生成混杂在同一序列中,造成探索效率低下和误差传播严重。其解决方案的关键在于提出HiMAC框架,通过显式分解长程决策为宏观层面的规划(planning)与微观层面的执行(execution),将推理建模为结构化蓝图生成过程,并结合目标条件的动作执行机制,从而实现鲁棒的长程规划;同时引入无评判器的分层策略优化范式和迭代协同进化训练策略,有效缓解分层学习中的非平稳性问题,显著提升样本效率与任务成功率。
链接: https://arxiv.org/abs/2603.00977
作者: Hongbo Jin,Rongpeng Zhu,Jiayu Ding,Wenhao Zhang,Ge Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM) agents have recently demonstrated strong capabilities in interactive decision-making, yet they remain fundamentally limited in long-horizon tasks that require structured planning and reliable execution. Existing approaches predominantly rely on flat autoregressive policies, where high-level reasoning and low-level actions are generated within a single token sequence, leading to inefficient exploration and severe error propagation over extended trajectories. In this work, we propose HiMAC, a hierarchical agentic RL framework that explicitly decomposes long-horizon decision-making into macro-level planning and micro-level execution. HiMAC models reasoning as a structured blueprint generation process followed by goal-conditioned action execution, enabling robust long-horizon planning within LLM-based agents. To train this hierarchy efficiently, we introduce a critic-free hierarchical policy optimization paradigm that extends group-based reinforcement learning to bi-level structures through hierarchical relative advantage estimation. Furthermore, we propose an iterative co-evolution training strategy that alternates between planner exploration and executor adaptation, mitigating the non-stationarity inherent in hierarchical learning. Extensive experiments on ALFWorld, WebShop, and Sokoban demonstrate that HiMAC consistently outperforms strong prompting and reinforcement learning baselines, achieving state-of-the-art performance and substantially improved sample efficiency across both text-based and visually grounded environments. Our results show that introducing structured hierarchy, rather than increasing model scale alone, is a key factor for enabling robust long-horizon agentic intelligence.
[AI-132] Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models
【速读】:该论文旨在解决文本到图像扩散模型中概念遗忘(unlearning)时出现的不均衡删除与无关能力意外丢失问题,这在版权合规、受保护数据处理、艺术家退出机制及政策驱动的内容更新等场景中尤为关键。其解决方案的核心在于提出SurgUn方法——一种基于“逆行干扰理论”(retroactive interference theory)的外科式遗忘策略,通过针对性地修改模型权重空间,在保留非目标概念功能的前提下,仅破坏目标视觉概念的表示路径,从而实现高精度、选择性的概念移除。
链接: https://arxiv.org/abs/2603.00975
作者: Ashutosh Ranjan,Vivek Srivastava,Shirish Karande,Murari Mandal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Unlearning in text-to-image diffusion models often leads to uneven concept removal and unintended forgetting of unrelated capabilities. This complicates tasks such as copyright compliance, protected data mitigation, artist opt-outs, and policy-driven content updates. As models grow larger and adopt more diverse architectures, achieving precise and selective unlearning while preserving generative quality becomes increasingly challenging. We introduce SurgUn (pronounced as Surgeon), a surgical unlearning method that applies targeted weight-space updates to remove specific visual concepts in text-conditioned diffusion models. Our approach is motivated by retroactive interference theory, which holds that newly acquired memories can overwrite, suppress, or impede access to prior ones by competing for shared representational pathways. We adapt this principle to diffusion models by inducing retroactive concept interference, enabling focused destabilization of only the target concept while preserving unrelated capabilities through a novel training paradigm. SurgUn achieves high-precision unlearning across diverse settings. It performs strongly on compact U-Net based models such as Stable Diffusion v1.5, scales effectively to the larger U-Net architecture SDXL, and extends to SANA, representing an underexplored Diffusion Transformer based architecture for unlearning.
[AI-133] AWE: Adaptive Agents for Dynamic Web Penetration Testing
【速读】:该论文旨在解决当前AI辅助Web应用开发加速背景下,传统安全工具(如基于模式的扫描器)难以适应新型攻击场景、而新兴基于大语言模型(Large Language Model, LLM)的渗透测试工具又存在探索无约束、成本高、行为不稳定及结果不可复现等问题。其解决方案的关键在于提出AWE框架——一个记忆增强型多智能体系统,通过将结构化的、针对漏洞特性的分析流水线嵌入轻量级LLM编排层中,实现上下文感知的载荷变异与生成,并结合持久化记忆和浏览器后端验证机制,从而获得确定性、以利用为导向的测试结果。此设计显著提升了注入类漏洞(如XSS和盲注SQL注入)的检测准确率与效率,同时降低了资源消耗。
链接: https://arxiv.org/abs/2603.00960
作者: Akshat Singh Jaswal,Ashish Baghel
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern web applications are increasingly produced through AI-assisted development and rapid no-code deployment pipelines, widening the gap between accelerating software velocity and the limited adaptability of existing security tooling. Pattern-driven scanners fail to reason about novel contexts, while emerging LLM-based penetration testers rely on unconstrained exploration, yielding high cost, unstable behavior, and poor reproducibility. We introduce AWE, a memory-augmented multi-agent framework for autonomous web penetration testing that embeds structured, vulnerability-specific analysis pipelines within a lightweight LLM orchestration layer. Unlike general-purpose agents, AWE couples context aware payload mutations and generations with persistent memory and browser-backed verification to produce deterministic, exploitation-driven results. Evaluated on the 104-challenge XBOW benchmark, AWE achieves substantial gains on injection-class vulnerabilities - 87% XSS success (+30.5% over MAPTA) and 66.7% blind SQL injection success (+33.3%) - while being much faster, cheaper, and more token-efficient than MAPTA, despite using a midtier model (Claude Sonnet 4) versus MAPTA’s GPT-5. MAPTA retains higher overall coverage due to broader exploratory capabilities, underscoring the complementary strengths of specialized and general-purpose architectures. Our results demonstrate that architecture matters as much as model reasoning capabilities: integrating LLMs into principled, vulnerability-aware pipelines yields substantial gains in accuracy, efficiency, and determinism for injection-class exploits. The source code for AWE is available at: this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.00960 [cs.CR] (or arXiv:2603.00960v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.00960 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Workshop on LLM Assisted Security and Trust Exploration (LAST-X), co-located with NDSS, 2026 Related DOI: https://doi.org/10.14722/last-x.2026.23037 Focus to learn more DOI(s) linking to related resources
[AI-134] Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中层间容量分配不均的问题,即某些层对损失函数的降低贡献显著,而其他层则近乎冗余,现有方法如基于影响函数的层评分虽能提供敏感性估计,但缺乏在硬件约束下将这些评分转化为合理容量分配或剪枝决策的理论机制。解决方案的关键在于提出一个统一且曲率感知的框架,其核心是引入曲率调整后的层增益量 ζk2=gk⊤Hkk−1gk,该量等于仅更新第 k 层所能实现的经验风险最大二阶下降,且严格优于仅依赖梯度范数的评分,因其融合了局部曲率信息。通过将此增益归一化为层质量分数 qk,作者构建了两个凸的最小描述长度(Minimum Description Length, MDL)优化程序:一个容量分配程序,在边际收益递减条件下优先向高曲率层分配专家槽位或LoRA秩;另一个剪枝程序则集中稀疏化低增益层并保护高增益层免受性能退化。两者均可通过单个对偶变量参数化,并以 O(Klog1/ε) 复杂度求解,同时证明了当曲率评分漂移不超过 δ 时,源域分配策略在目标任务上的转移遗憾为 O(δ2),且常数与目标问题条件数相关,从而实现了从经验启发式到理论可证、计算高效的层级容量优化框架跃迁。
链接: https://arxiv.org/abs/2603.00910
作者: Theophilus Amaefuna,Hitesh Vaidya,Anshuman Chhabra,Ankur Mali
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 3 figures, 5 tables
Abstract:Layer-wise capacity in large language models is highly non-uniform: some layers contribute disproportionately to loss reduction while others are near-redundant. Existing methods for exploiting this non-uniformity, such as influence-function-based layer scoring, produce sensitivity estimates but offer no principled mechanism for translating them into allocation or pruning decisions under hardware constraints. We address this gap with a unified, curvature-aware framework grounded in the Minimum Description Length (MDL) principle. Our central quantity is the curvature-adjusted layer gain \zeta_k^2 = g_k^\top \widetildeH_kk^-1 g_k , which we show equals twice the maximal second-order reduction in empirical risk achievable by updating layer k alone, and which strictly dominates gradient-norm-based scores by incorporating local curvature. Normalizing these gains into layer quality scores q_k , we formulate two convex MDL programs: a capacity allocation program that distributes expert slots or LoRA rank preferentially to high-curvature layers under diminishing returns, and a pruning program that concentrates sparsity on low-gain layers while protecting high-gain layers from degradation. Both programs admit unique closed-form solutions parameterized by a single dual variable, computable in O(K \log 1/\varepsilon) via bisection. We prove an O(\delta^2) transfer regret bound showing that source-domain allocations remain near-optimal on target tasks when curvature scores drift by \delta , with explicit constants tied to the condition number of the target program. Together, these results elevate layer-wise capacity optimization from an empirical heuristic to a theoretically grounded, computationally efficient framework with provable optimality and generalization guarantees.
[AI-135] Knowledge without Wisdom: Measuring Misalignment between LLM s and Intended Impact
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在标准AI基准测试中表现优异,但其在实际教育场景中的下游任务(如教学与学习)中是否真正具备有效性的问题。研究发现,尽管LLMs在不同任务间表现出较高的内部行为相关性,但这些行为与专家人类教师的行为及学生的学习成果之间存在显著偏差,且这种偏差具有跨模型一致性,表明预训练过程可能引入了系统性偏误。解决方案的关键在于提出一种稳健的对齐评估方法,用于量化复杂教育任务中模型输出与真实教学目标之间的匹配程度,并揭示多模型集成策略反而会加剧这种错位,从而为改进模型在教育应用中的可靠性提供实证依据和理论洞见。
链接: https://arxiv.org/abs/2603.00883
作者: Michael Hardy,Yunsung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Applications (stat.AP)
备注:
Abstract:LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study evaluates the performance of leading foundation models (FMs, i.e., generative pre-trained base LLMs) with out-of-distribution (OOD) tasks of the teaching and learning of schoolchildren. Across all FMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often \textitnegatively aligned with learning outcomes. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that 50% of the variation in misalignment error is shared across foundation models, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into both educational applications of foundation models and to understanding limitations of models.
[AI-136] MC-Search: Evaluating and Enhancing Multimodal Agent ic Search with Structured Long Reasoning Chains ICLR2026
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂、长链、跨模态推理任务中评估不足的问题,特别是现有基准测试主要聚焦于简化的问答(QA)任务和短检索链,未能充分刻画代理式多模态检索增强生成(agentic Multimodal Retrieval-Augmented Generation, MM-RAG)中的自适应规划与多模态推理能力。其解决方案的关键在于提出首个针对 agentic MM-RAG 的基准数据集 MC-Search,该数据集包含 3,333 个高质量示例,每个示例均带有长达 3.7 步的显式标注推理链,涵盖五类代表性推理结构,并通过 HAVE(Hop-wise Attribution and Verification of Evidence)机制确保证据溯源的准确性;同时引入过程级指标(如步骤级检索与规划准确率)以量化推理质量,从而实现对模型推理过程的精细化评估与优化。
链接: https://arxiv.org/abs/2603.00873
作者: Xuying Ning,Dongqi Fu,Tianxin Wei,Mengting Ai,Jiaru Zou,Ting-Wei Li,Hanghang Tong,Yada Zhu,Hendrik Hamann,Jingrui He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.
[AI-137] Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion ICASSP2026
【速读】:该论文旨在解决基于Transformer的Whisper模型在自动语音识别(ASR)任务中因多头注意力(Multi-Head Attention, MHA)机制导致的GPU内存消耗过大问题,尤其是在处理长音频时KV缓存线性增长带来的瓶颈。解决方案的关键在于引入一种新的多头潜在注意力(Multi-Head Latent Attention, MLA)机制,并将其系统性地应用于Whisper模型的编码器自注意力、解码器自注意力和交叉注意力模块中;实验表明,仅将MLA应用于解码器自注意力模块可在保持性能的同时实现最优的内存效率,使KV缓存大小最多减少87.5%,且可通过极少的微调即可将预训练的Whisper模型转换为Whisper-MLA架构。
链接: https://arxiv.org/abs/2603.00563
作者: Sen Zhang,Jianguo Wei,Wenhuan Lu,Xianghu Yue,Wei Li,Qiang Li,Pengcheng Zhao,Ming Cai,Luo Si
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, accepted at ICASSP 2026
Abstract:The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper’s absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.
[AI-138] EMPA: Evaluating Persona-Aligned Empathy as a Process
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)对话代理在评估人格一致性同理心(persona-aligned empathy)时面临的挑战,包括用户状态隐含难测、实时反馈稀疏且难以验证,以及看似支持性的回复可能累积导致偏离个体特定需求的轨迹问题。解决方案的关键在于提出EMPA框架——一个过程导向的评估体系,将真实交互转化为可控的心理学基础场景,并结合开放式的多智能体沙盒环境以暴露策略适应与失败模式;通过方向一致性、累积影响和稳定性三个维度,在潜在心理空间中对对话轨迹进行评分,从而实现对长周期同理行为的可重复比较与优化。
链接: https://arxiv.org/abs/2603.00552
作者: Shiya Zhang,Yuhan Zhan,Ruixi Su,Ruihan Sun,Ziyi Song,Zhaohan Chen,Xiaofan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating persona-aligned empathy in LLM-based dialogue agents remains challenging. User states are latent, feedback is sparse and difficult to verify in situ, and seemingly supportive turns can still accumulate into trajectories that drift from persona-specific needs. We introduce EMPA, a process-oriented framework that evaluates persona-aligned support as sustained intervention rather than isolated replies. EMPA distills real interactions into controllable, psychologically grounded scenarios, couples them with an open-ended multi-agent sandbox that exposes strategic adaptation and failure modes, and scores trajectories in a latent psychological space by directional alignment, cumulative impact, and stability. The resulting signals and metrics support reproducible comparison and optimization of long-horizon empathic behavior, and they extend to other agent settings shaped by latent dynamics and weak, hard-to-verify feedback.
[AI-139] LOGIGEN: Logic-Driven Generation of Verifiable Agent ic Tasks
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)向自主代理演进过程中,因训练数据稀缺导致的状态转移目标难以精确实现的问题。现有以工具为中心的逆向合成流程无法捕捉现实应用中的严谨逻辑,限制了模型在复杂、状态依赖环境中的表现。解决方案的关键在于提出一个基于逻辑驱动的数据合成框架 LOGIGEN,其核心包括三个支柱:硬编译策略锚定(Hard-Compiled Policy Grounding)、逻辑引导的前向合成(Logic-Driven Forward Synthesis)与确定性状态验证(Deterministic State Verification)。通过三代理协同机制——架构师(Architect)将自然语言策略转化为数据库约束以强制执行硬规则,设计者(Set Designer)初始化边界邻近状态以触发关键策略冲突,探索者(Explorer)在此环境中搜索因果解路径——生成了20,000个跨8个领域的可验证任务数据集,并采用基于验证的训练协议:监督微调(SFT)确保与硬编译策略一致,强化学习(RL)结合密集状态奖励优化长期目标达成能力,最终在 τ²-Bench 上实现79.5%的成功率,显著优于基线模型(40.7%)。
链接: https://arxiv.org/abs/2603.00540
作者: Yucheng Zeng,Weipeng Lu,Linyun Liu,Shupeng Li,Zitian Qu,Chenghao Zhu,Shaofei Li,Zhengdong Tan,Mengyue Liu,Haotian Zhao,Zhe Zhou,Jianmin Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The evolution of Large Language Models (LLMs) from static instruction-followers to autonomous agents necessitates operating within complex, stateful environments to achieve precise state-transition objectives. However, this paradigm is bottlenecked by data scarcity, as existing tool-centric reverse-synthesis pipelines fail to capture the rigorous logic of real-world applications. We introduce \textbfLOGIGEN, a logic-driven framework that synthesizes verifiable training data based on three core pillars: \textbfHard-Compiled Policy Grounding, \textbfLogic-Driven Forward Synthesis, and \textbfDeterministic State Verification. Specifically, a Triple-Agent Orchestration is employed: the \textbfArchitect compiles natural-language policy into database constraints to enforce hard rules; the \textbfSet Designer initializes boundary-adjacent states to trigger critical policy conflicts; and the \textbfExplorer searches this environment to discover causal solution paths. This framework yields a dataset of 20,000 complex tasks across 8 domains, where validity is strictly guaranteed by checking exact state equivalence. Furthermore, we propose a verification-based training protocol where Supervised Fine-Tuning (SFT) on verifiable trajectories establishes compliance with hard-compiled policy, while Reinforcement Learning (RL) guided by dense state-rewards refines long-horizon goal achievement. On \tau^2 -Bench, LOGIGEN-32B(RL) achieves a \textbf79.5% success rate, substantially outperforming the base model (40.7%). These results demonstrate that logic-driven synthesis combined with verification-based training effectively constructs the causally valid trajectories needed for next-generation agents.
[AI-140] Are LLM s Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码与自然语言需求匹配任务中的可靠性问题,即LLM是否能准确判断代码实现是否满足给定的自然语言规格说明。研究发现,即使使用广泛采用的基准和统一提示设计,LLM仍频繁将正确代码误判为不符合要求或存在缺陷;更令人意外的是,增加提示细节(如要求解释和修正建议)反而导致更高的误判率,揭示了LLM作为代码审查助手的关键可靠性缺陷。解决方案的核心在于提出一种“修复引导验证过滤器”(Fix-guided Verification Filter),其创新性地将模型提出的修复方案视为可执行的反事实证据,通过基准测试与规范约束增强测试对原始实现和修改后版本进行双重验证,从而提升LLM辅助代码审查的准确性与可信度。
链接: https://arxiv.org/abs/2603.00539
作者: Haolin Jin,Huaming Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to verify if code implementation satisfy task requirements, thereby ensuring code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine code against the given task descriptions, which is usually in a form of natural language specifications. In this paper, we uncover a systematic failure of LLMs in matching code to natural language requirements. Specifically, with widely adopted benchmarks and unified prompts design, we demonstrate that LLMs frequently misclassify correct code implementation as non-compliant or defective. Surprisingly, we find that more detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to higher misjudgment rates, highlighting critical reliability issues for LLM-based code assistants. We further analyze the mechanisms driving these failures and evaluate the reliability of rationale-required judgments. Building on these findings, we propose a Fix-guided Verification Filter that treats the model proposed fix as executable counterfactual evidence, and validates the original and revised implementations using benchmark tests and spec-constrained augmented tests. Our results expose previously under-explored limitations in LLM-based code review capabilities, and provide practical guidance for integrating LLM-based reviewers with safeguards in automated review and development pipelines.
[AI-141] DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agent ic Workflows
【速读】:该论文旨在解决自主代理在执行复杂、长时序任务(如数学推理和软件生成)时,因多步推理链中语义模糊性累积导致的可靠性下降问题,即“累积语义模糊性”(accumulated semantic ambiguity)。其解决方案的关键在于提出一个闭环框架 DenoiseFlow,该框架将多步推理建模为带有噪声的马尔可夫决策过程(Noisy MDP),并通过三个协同阶段实现渐进式去噪:(1) 感知每步的语义不确定性;(2) 基于风险自适应分配计算资源,动态选择单路径快速执行或并行探索;(3) 通过基于影响的根因定位进行针对性纠错。该方法在线自我校准决策边界以匹配验证器反馈,无需真实标签,在六个基准测试上平均准确率提升1.3%,同时降低40–56%的计算成本。
链接: https://arxiv.org/abs/2603.00532
作者: Yandong Yan,Junwei Peng,Shijie Li,Chenxi Li,Yifei Shang,Can Deng,Ruiting Dai,Yongqiang Zhao,Jiaqi Zhu,Yu Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous agents are increasingly entrusted with complex, long-horizon tasks, ranging from mathematical reasoning to software generation. While agentic workflows facilitate these tasks by decomposing them into multi-step reasoning chains, reliability degrades significantly as the sequence lengthens. Specifically, minor interpretation errors in natural-language instructions tend to compound silently across steps. We term this failure mode accumulated semantic ambiguity. Existing approaches to mitigate this often lack runtime adaptivity, relying instead on static exploration budgets, reactive error recovery, or single-path execution that ignores uncertainty entirely. We formalize the multi-step reasoning process as a Noisy MDP and propose DenoiseFlow, a closed-loop framework that performs progressive denoising through three coordinated stages: (1)Sensing estimates per-step semantic uncertainty; (2)Regulating adaptively allocates computation by routing between fast single-path execution and parallel exploration based on estimated risk; and (3)Correcting performs targeted recovery via influence-based root-cause localization. Online self-calibration continuously aligns decision boundaries with verifier feedback, requiring no ground-truth labels. Experiments on six benchmarks spanning mathematical reasoning, code generation, and multi-hop QA show that DenoiseFlow achieves the highest accuracy on every benchmark (83.3% average, +1.3% over the strongest baseline) while reducing cost by 40–56% through adaptive branching. Detailed ablation studies further confirm framework-level’s robustness and generality. Code is available at this https URL.
[AI-142] Phys-Diff: A Physics-Inspired Latent Diffusion Model for Tropical Cyclone Forecasting ICASSP2026
【速读】:该论文旨在解决深度学习方法在热带气旋(Tropical Cyclone, TC)预测中因忽略TC属性间物理关系而导致预测结果缺乏物理一致性的问题。解决方案的关键在于提出Phys-Diff——一种受物理启发的潜在扩散模型,其通过将潜在特征解耦为任务特定组件(如轨迹、中心气压、风速),并引入跨任务注意力机制以嵌入先验物理引导的归纳偏置,从而显式建模TC各属性间的物理一致依赖关系;同时,该模型利用Transformer编码器-解码器架构融合多模态数据(历史TC属性、ERA5再分析数据及FengWu预报场),显著提升预测性能。
链接: https://arxiv.org/abs/2603.00521
作者: Lei Liu,Xiaoning Yu,Kang Chen,Jiahui Huang,Tengyuan Liu,Hongwei Zhao,Bin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures. Accepted to IEEE ICASSP 2026
Abstract:Tropical cyclone (TC) forecasting is critical for disaster warning and emergency response. Deep learning methods address computational challenges but often neglect physical relationships between TC attributes, resulting in predictions lacking physical consistency. To address this, we propose Phys-Diff, a physics-inspired latent diffusion model that disentangles latent features into task-specific components (trajectory, pressure, wind speed) and employs cross-task attention to introduce prior physics-inspired inductive biases, thereby embedding physically consistent dependencies among TC attributes. Phys-Diff integrates multimodal data including historical cyclone attributes, ERA5 reanalysis data, and FengWu forecast fields via a Transformer encoder-decoder architecture, further enhancing forecasting performance. Experiments demonstrate state-of-the-art performance on global and regional datasets.
[AI-143] FastBUS: A Fast Bayesian Framework for Unified Weakly-Supervised Learning
【速读】:该论文旨在解决弱监督学习中因标签不精确而导致的多种弱监督设置下标签分布推断效率低的问题,现有方法普遍存在手动预处理复杂、忽略标签间关联性或无法批量处理导致运行时间过长等缺陷。其解决方案的关键在于将标签暴力搜索过程建模为标签变量的概率转移过程,通过将不同弱监督下的深度优先搜索(DFS)树结构压缩为共享的贝叶斯网络,进而基于广义信念传播(generalized belief propagation)推导出潜在概率计算算法,并引入两个联合加速策略:一是利用低秩假设近似转移矩阵以降低时间复杂度;二是设计端到端的状态演化模块来学习批量规模的转移矩阵,实现多类别批量处理。该方法在多数场景下与期望最大化(EM)算法等价,且实验表明其在弱监督设置下达到当前最优性能(SOTA),同时相较其他通用方法提速可达数百倍。
链接: https://arxiv.org/abs/2603.00517
作者: Ziquan Wang,Haobo Wang,Ke Chen,Lei Feng,Gang Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures
Abstract:Machine Learning often involves various imprecise labels, leading to diverse weakly supervised settings. While recent methods aim for universal handling, they usually suffer from complex manual pre-work, ignore the relationships between associated labels, or are unable to batch process due to computational design flaws, resulting in long running times. To address these limitations, we propose a novel general framework that efficiently infers latent true label distributions across various weak supervisions. Our key idea is to express the label brute-force search process as a probabilistic transition of label variables, compressing diverse weakly supervised DFS tree structures into a shared Bayesian network. From this, we derived a latent probability calculation algorithm based on generalized belief propagation and proposed two joint acceleration strategies: 1) introducing a low-rank assumption to approximate the transition matrix, reducing time complexity; 2) designing an end-to-end state evolution module to learn batch-scale transition matrices, facilitating multi-category batch processing. In addition, the equivalence of our method with the EM algorithm in most scenarios is further demonstrated. Extensive experiments show that our method achieves SOTA results under most weakly supervised settings, and achieves up to hundreds of times faster acceleration in running time compared to other general methods.
[AI-144] WirelessAgent : Automated Agent ic Workflow Design and Benchmarking for Wireless Networks
【速读】:该论文旨在解决当前无线网络中AI代理(AI agent)设计依赖人工编写提示(prompt)和静态工作流所带来的劳动密集、难以扩展且性能欠佳的问题。其核心解决方案是提出WirelessAgent++框架,将每个代理工作流建模为由模块化算子构成的可执行代码,并将其转化为程序搜索问题;通过引入领域自适应的蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)算法自动发现最优工作流,从而实现无线任务中智能代理的自动化设计与优化。
链接: https://arxiv.org/abs/2603.00501
作者: Jingwen Tong,Zijian Li,Fang Liu,Wei Guo,Jun Zhang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: This manuscript has submitted to a possible Journal for publication
Abstract:The integration of large language models (LLMs) into wireless networks has sparked growing interest in building autonomous AI agents for wireless tasks. However, existing approaches rely heavily on manually crafted prompts and static agentic workflows, a process that is labor-intensive, unscalable, and often suboptimal. In this paper, we propose WirelessAgent++, a framework that automates the design of agentic workflows for various wireless tasks. By treating each workflow as an executable code composed of modular operators, WirelessAgent++ casts agent design as a program search problem and solves it with a domain-adapted Monte Carlo Tree Search (MCTS) algorithm. Moreover, we establish WirelessBench, a standardized multi-dimensional benchmark suite comprising Wireless Communication Homework (WCHW), Network Slicing (WCNS), and Mobile Service Assurance (WCMSA), covering knowledge reasoning, code-augmented tool use, and multi-step decision-making. Experiments demonstrate that \wap autonomously discovers superior workflows, achieving test scores of 78.37% (WCHW), 90.95% (WCNS), and 97.07% (WCMSA), with a total search cost below \ 5 per task. Notably, our approach outperforms state-of-the-art prompting baselines by up to 31% and general-purpose workflow optimizers by 11.1% , validating its effectiveness in generating robust, self-evolving wireless agents. The code is available at this https URL.
[AI-145] A Polynomial-Time Axiomatic Alternative to SHAP for Feature Attribution
【速读】:该论文旨在解决SHAP(Shapley Additive Explanations)在高维特征场景下计算成本过高、难以扩展的问题,从而为生成式 AI (Generative AI) 和其他机器学习模型提供更高效且理论严谨的特征归因方法。其解决方案的关键在于提出一种名为 ESENSC_rev2 的低复杂度归因规则,该规则基于合作博弈论中的 XAI–TU 博弈框架构建,结合两种多项式时间闭式解法,在保证零玩家性质(null-player property)的前提下,实现了对精确 SHAP 的良好近似;同时通过公理化刻画证明了该规则由效率性、零玩家公理、受限微分边际性原则、中间非关键博弈性质及降低计算需求的公理共同唯一确定,从而在理论上保障了其合理性与实用性。
链接: https://arxiv.org/abs/2603.00496
作者: Kazuhiro Hiraki,Shinichi Ishihara,Takumi Kongo,Junnosuke Shino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 4 figures, 2 tables. Code will be released
Abstract:In this paper, we provide a theoretically grounded and computationally efficient alternative to SHAP. To this end, we study feature attribution through the lens of cooperative game theory by formulating a class of XAI–TU games. Building on this formulation, we investigate equal-surplus-type and proportional-allocation-type attribution rules and propose a low-cost attribution rule, ESENSC_rev2, constructed by combining two polynomial-time closed-form rules while ensuring the null-player property in the XAI–TU domain. Extensive experiments on tabular prediction tasks demonstrate that ESENSC_rev2 closely approximates exact SHAP while substantially improving scalability as the number of features increases. These empirical results indicate that equal-surplus-type attribution rules can achieve favorable trade-offs between computational cost and approximation accuracy in high-dimensional explainability settings. To provide theoretical foundations for these findings, we establish an axiomatic characterization showing that ESENSC_rev2 is uniquely determined by efficiency, the null-player axiom, a restricted differential marginality principle, an intermediate inessential-game property, and axioms that reduce computational requirements. Our results suggest that axiomatically justified and computationally efficient attribution rules can serve as practical and theoretically principled substitutes for SHAP-based approximations in modern explainability pipelines. Comments: 28 pages, 4 figures, 2 tables. Code will be released Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.00496 [cs.LG] (or arXiv:2603.00496v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.00496 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-146] AI Runtime Infrastructure
【速读】:该论文旨在解决智能体(Agent)在长时间任务执行过程中,因环境动态变化、资源限制或行为偏差导致的任务成功率下降、延迟增加、资源浪费及安全风险等问题。现有模型层面的优化或被动日志记录系统难以应对运行时的复杂性与不确定性。解决方案的关键在于引入AI运行时基础设施(AI Runtime Infrastructure),它作为模型层与应用层之间的独立执行时层,在智能体运行过程中主动观测、推理并干预其行为,将执行过程本身视为可优化的表面,从而实现自适应内存管理、故障检测与恢复、策略执行等功能,显著提升任务成功率、延迟控制、令牌效率、可靠性与安全性。
链接: https://arxiv.org/abs/2603.00495
作者: Christopher Cruz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce AI Runtime Infrastructure, a distinct execution-time layer that operates above the model and below the application, actively observing, reasoning over, and intervening in agent behavior to optimize task success, latency, token efficiency, reliability, and safety while the agent is running. Unlike model-level optimizations or passive logging systems, runtime infrastructure treats execution itself as an optimization surface, enabling adaptive memory management, failure detection, recovery, and policy enforcement over long-horizon agent workflows.
[AI-147] AESP: A Human-Sovereign Economic Protocol for AI Agents with Privacy-Preserving Settlement
【速读】:该论文旨在解决AI代理在执行经济任务时面临的根本性矛盾:即如何在保障代理自主性(autonomy)的同时,确保人类对金融资产的控制权(human control over financial assets)。解决方案的关键在于提出一种分层协议——代理经济主权协议(Agent Economic Sovereignty Protocol, AESP),其核心机制是通过五项技术手段强制维持“代理具备经济能力但不具备经济主权”的不变性(invariant):(1) 确定性的八检查策略引擎与分级升级机制;(2) 人机协同审查(human-in-the-loop review),包含自动、显式及生物特征三级验证;(3) 基于EIP-712的双签名承诺与托管机制;(4) 基于HKDF的上下文隔离隐私保护与批量聚合;(5) 基于ACE-GF的密码学底层结构。这些机制共同实现机器级交易速度与人类治理边界之间的安全平衡。
链接: https://arxiv.org/abs/2603.00318
作者: Jian Sheng Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 16 pages, 1 figure
Abstract:As AI agents increasingly perform economic tasks on behalf of humans, a fundamental tension arises between agent autonomy and human control over financial assets. We present the Agent Economic Sovereignty Protocol (AESP), a layered protocol in which agents transact autonomously at machine speed on crypto-native infrastructure while remaining cryptographically bound to human-defined governance boundaries. AESP enforces the invariant that agents are economically capable but never economically sovereign through five mechanisms: (1) a deterministic eight-check policy engine with tiered escalation; (2) human-in-the-loop review with automatic, explicit, and biometric tiers; (3) EIP-712 dual-signed commitments with escrow; (4) HKDF-based context-isolated privacy with batched consolidation; and (5) an ACE-GF-based cryptographic substrate. We formalize two testable hypotheses on security coverage and latency overhead, and specify a complete evaluation methodology with baselines and ablation design. The protocol is implemented as an open-source TypeScript SDK (208 tests, ten modules) with interoperability via MCP and A2A.
[AI-148] How Well Do Multimodal Models Reason on ECG Signals?
【速读】:该论文旨在解决多模态大语言模型在医疗领域应用中“黑箱”问题,即如何有效验证模型生成的推理轨迹(reasoning traces)是否具备语义正确性和临床逻辑一致性。现有评估方法或依赖人工专家评审而不可扩展,或仅使用问答(QA)等代理指标,无法真实反映临床推理的准确性。其解决方案的关键在于将推理过程解构为两个独立组件:感知(Perception)与推断(Deduction)。感知模块通过代理框架生成代码以实证验证推理轨迹中描述的时间结构;推断模块则采用基于检索的方法,将模型逻辑与结构化临床标准数据库对齐,从而实现对“真正”推理能力的可扩展、自动化验证。
链接: https://arxiv.org/abs/2603.00312
作者: Maxwell A. Xu,Harish Haresumadram,Catherine W. Liu,Patrick Langer,Jathurshan Pradeepkumar,Wanting Mao,Sunita J. Ferns,Aradhana Verma,Jimeng Sun,Paul Schmiedmayer,Xin Liu,Daniel McDuff,Emily B. Fox,James M. Rehg
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While multimodal large language models offer a promising solution to the “black box” nature of health AI by generating interpretable reasoning traces, verifying the validity of these traces remains a critical challenge. Existing evaluation methods are either unscalable, relying on manual clinician review, or superficial, utilizing proxy metrics (e.g. QA) that fail to capture the semantic correctness of clinical logic. In this work, we introduce a reproducible framework for evaluating reasoning in ECG signals. We propose decomposing reasoning into two distinct, components: (i) Perception, the accurate identification of patterns within the raw signal, and (ii) Deduction, the logical application of domain knowledge to those patterns. To evaluate Perception, we employ an agentic framework that generates code to empirically verify the temporal structures described in the reasoning trace. To evaluate Deduction, we measure the alignment of the model’s logic against a structured database of established clinical criteria in a retrieval-based approach. This dual-verification method enables the scalable assessment of “true” reasoning capabilities.
[AI-149] Polynomial Surrogate Training for Differentiable Ternary Logic Gate Networks
【速读】:该论文旨在解决现有可微逻辑门网络(Differentiable Logic Gate Networks, DLGNs)仅限于二值逻辑(16种两输入二元门)的局限性,以及扩展至三值Kleene逻辑(Ternary Kleene K3)时因潜在门组合爆炸(每神经元达19,683种可能)导致传统基于softmax的门选择训练方法不可行的问题。其解决方案的关键在于提出多项式代理训练(Polynomial Surrogate Training, PST),将每个三值神经元建模为一个度数为(2,2)的多项式,仅需学习9个可训练系数(相比原参数空间减少至约1/2187),并理论证明训练网络与离散逻辑电路之间的差距由一个数据无关的承诺损失(commitment loss)所界定,且该损失在收敛时趋于零。这一方法不仅显著提升训练效率(快2–3倍),还实现了对真正功能多样化的三值逻辑门的发现,并验证了“未知”状态作为贝叶斯最优不确定性代理的有效性,从而支持选择性预测以提升精度。
链接: https://arxiv.org/abs/2603.00302
作者: Sai Sandeep Damera,Ryan Matheu,Aniruddh G. Puranic,John S. Baras
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 28 pages, 13 figures. Submitted to 3rd International Conference on Neuro-Symbolic Systems (NeuS) 2026
Abstract:Differentiable logic gate networks (DLGNs) learn compact, interpretable Boolean circuits via gradient-based training, but all existing variants are restricted to the 16 two-input binary gates. Extending DLGNs to Ternary Kleene K_3 logic and training DTLGNs where the UNKNOWN state enables principled abstention under uncertainty is desirable. However, the support set of potential gates per neuron explodes to 19,683 , making the established softmax-over-gates training approach intractable. We introduce Polynomial Surrogate Training (PST), which represents each ternary neuron as a degree- (2,2) polynomial with 9 learnable coefficients (a 2,187\times parameter reduction) and prove that the gap between the trained network and its discretized logic circuit is bounded by a data-independent commitment loss that vanishes at convergence. Scaling experiments from 48K to 512K neurons on CIFAR-10 demonstrate that this hardening gap contracts with overparameterization. Ternary networks train 2 - 3\times faster than binary DLGNs and discover true ternary gates that are functionally diverse. On synthetic and tabular tasks we find that the UNKNOWN output acts as a Bayes-optimal uncertainty proxy, enabling selective prediction in which ternary circuits surpass binary accuracy once low-confidence predictions are filtered. More broadly, PST establishes a general polynomial-surrogate methodology whose parameterization cost grows only quadratically with logic valence, opening the door to many-valued differentiable logic.
[AI-150] raderBench: How Robust Are AI Agents in Adversarial Capital Markets? ICLR2026
【速读】:该论文旨在解决金融领域中AI代理(AI agent)评估面临的两大核心问题:一是静态基准测试需要昂贵的专家标注,且无法捕捉真实交易中动态决策的本质;二是基于大语言模型(LLM)的评判者在特定领域任务上引入不可控的方差。其解决方案的关键在于提出TraderBench基准框架,该框架融合了专家验证的静态任务(如知识检索与分析推理)与对抗性交易模拟,后者仅以实际绩效指标(夏普比率、收益和回撤)评分,从而彻底消除评判者的主观差异。此外,该框架包含两个创新赛道:加密货币交易(含四种渐进式市场操纵变换)和期权衍生品交易(涵盖盈亏准确性、希腊值敏感度及风险管理),并通过引入新市场数据定期刷新场景,防止基准污染。这一设计实现了对AI代理在金融场景下真实适应能力的可靠评估。
链接: https://arxiv.org/abs/2603.00285
作者: Xiaochuang Yuan,Hui Xu,Silvia Xu,Cui Zou,Jing Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Equal Contribution: Xiaochuang Yuan and Hui Xu contributed equally to this work. All correspondence should be directed to yxc20098@gmail.com. Submitted to Agents in the Wild Workshop, ICLR2026
Abstract:Evaluating AI agents in finance faces two key challenges: static benchmarks require costly expert annotation yet miss the dynamic decision-making central to real-world trading, while LLM-based judges introduce uncontrolled variance on domain-specific tasks. We introduce TraderBench, a benchmark that addresses both issues. It combines expert-verified static tasks (knowledge retrieval, analytical reasoning) with adversarial trading simulations scored purely on realized performance-Sharpe ratio, returns, and drawdown-eliminating judge variance entirely. The framework features two novel tracks: crypto trading with four progressive market-manipulation transforms, and options derivatives scoring across PL accuracy, Greeks, and risk management. Trading scenarios can be refreshed with new market data to prevent benchmark contamination. Evaluating 13 models (8B open-source to frontier) on ~50 tasks, we find: (1) 8 of 13 models score ~33 on crypto with 1-point variation across adversarial conditions, exposing fixed non-adaptive strategies; (2) extended thinking helps retrieval (+26 points) but has zero impact on trading (+0.3 crypto, -0.1 options). These findings reveal that current agents lack genuine market adaptation, underscoring the need for performance-grounded evaluation in finance.
[AI-151] GENAI WORKBENCH: AI-Assisted Analysis and Synthesis of Engineering Systems from Multimodal Engineering Data
【速读】:该论文旨在解决现代工程设计平台在系统工程(Systems Engineering)层面的割裂问题,即当前CAD、CAM、CAE等工具虽擅长专业任务,但缺乏原生的系统工程框架,导致系统级需求与架构管理与部件详细设计分离,从而增加集成风险并阻碍整体开发效率。解决方案的关键在于提出GenAI Workbench这一基于模型的系统工程(Model-Based Systems Engineering, MBSE)概念框架,其核心是构建一个统一的数字主线(Digital Thread),通过开源PLM平台整合文档语义数据、物理B-rep几何信息及关系型系统图谱,并利用生成式AI(Generative AI)技术实现从源文档自动提取需求并生成初始系统架构(如设计结构矩阵DSM),从而将系统工程原则无缝嵌入设计师工作流,推动更集成、数据驱动且知情的工程设计方法。
链接: https://arxiv.org/abs/2603.00251
作者: H. Sinan Bank,Daniel R. Herber
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 7 pages, 3 figures, accepted to be presented at IISE Annual Conference 2026
Abstract:Modern engineering design platforms excel at discipline-specific tasks such as CAD, CAM, and CAE, but often lack native systems engineering frameworks. This creates a disconnect where system-level requirements and architectures are managed separately from detailed component design, hindering holistic development and increasing integration risks. To address this, we present the conceptual framework for the GenAI Workbench, a Model-Based Systems Engineering (MBSE) environment that integrates systems engineering principles into the designer’s workflow. Built on an open-source PLM platform, it establishes a unified digital thread by linking semantic data from documents, physical B-rep geometry, and relational system graphs. The workbench facilitates an AI-assisted workflow where a designer can ingest source documents, from which the system automatically extracts requirements and uses vision-language models to generate an initial system architecture, such as a Design Structure Matrix (DSM). This paper presents the conceptual architecture, proposed methodology, and anticipated impact of this work-in-progress framework, which aims to foster a more integrated, data-driven, and informed engineering design methodology.
[AI-152] Empowering Future Cybersecurity Leaders: Advancing Students through FINDS Education for Digital Forensic Excellence
【速读】:该论文旨在解决人工智能(AI)驱动的数字取证领域中网络安全人才技能培养的系统性难题,尤其关注如何有效建模跨学科能力依赖关系并量化评估技术熟练度与科研准备度。其解决方案的关键在于提出一种基于有向无环图(Directed Acyclic Graph, DAG)的多依赖能力构建技能图(Multidependency Capacity Building Skills Graph, MCBSG),该模型能够编码AI赋能的取证编程、统计推断、数字证据处理及威胁检测等核心能力之间的层级与跨域依赖关系,并支持结构化技能习得路径建模和可量化的能力建设评估。通过监督学习方法(如基于信息熵的决策树分类器和回归建模)对纵向多 cohort 数据集进行分析,识别出影响技术熟练度和研究准备度的关键预测因子,最终在三年统计评估中验证了MCBSG在提升取证编程准确率、对抗推理能力和高性能计算(HPC)支持的调查工作流方面的显著成效,从而为数据驱动、包容性强的国家安全 workforce 教育提供可扩展且可解释的框架。
链接: https://arxiv.org/abs/2603.00222
作者: Yashas Hariprasad,Subhash Gurappa,Sundararaj S. Iyengar,Jerry F. Miller,Pronab Mohanty,Naveen Kumar Chaudhary
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The Forensics Investigations Network in Digital Sciences (FINDS) Research Center of Excellence (CoE), funded by the U.S. Army Research Laboratory, advances Digital Forensic Engineering Education (DFEE) through an integrated research education framework for AI enabled cybersecurity workforce development. FINDS combines high performance computing (HPC), secure software engineering, adversarial analytics, and experiential learning to address emerging cyber and synthetic media threats. This paper introduces the Multidependency Capacity Building Skills Graph (MCBSG), a directed acyclic graph based model that encodes hierarchical and cross domain dependencies among competencies in AI-driven forensic programming, statistical inference, digital evidence processing, and threat detection. The MCBSG enables structured modeling of skill acquisition pathways and quantitative capacity assessment. Supervised machine learning methods, including entropy-based Decision Tree Classifiers and regression modeling, are applied to longitudinal multi cohort datasets capturing mentoring interactions, laboratory performance metrics, curriculum artifacts, and workshop participation. Feature importance analysis and cross validation identify key predictors of technical proficiency and research readiness. Three year statistical evaluation demonstrates significant gains in forensic programming accuracy, adversarial reasoning, and HPC-enabled investigative workflows. Results validate the MCBSG as a scalable, interpretable framework for data-driven, inclusive cybersecurity education aligned with national defense workforce priorities.
[AI-153] Agent ic Scientific Simulation: Execution-Grounded Model Construction and Reconstruction
【速读】:该论文旨在解决生成式 AI 在物理仿真建模中因自然语言描述的模糊性而导致的模型构建不可靠与不可复现的问题。核心挑战在于,自然语言对仿真模型的描述通常存在隐含选择(implicit choices),不同合理假设会生成物理有效但科学上不同的配置,若不显式识别并处理这些歧义,则无法保证结果的正确性和可复现性。解决方案的关键在于提出一种“执行-解释-验证”(interpret-act-validate)循环驱动的代理式科学仿真框架,其中仿真器作为物理有效性的权威仲裁者而非仅运行时工具;具体实现为 JutulGPT,它结合结构化文档检索、代码合成、静态分析、执行及求解器诊断的系统解读,显式检测并自主或通过用户交互解决未指定建模选项,并记录假设日志以增强透明度。这一方法使模型构建可被仿真器验证,同时揭示了依赖默认值的隐式决策在假设日志中不可见的结构性限制。
链接: https://arxiv.org/abs/2603.00214
作者: Knut-Andreas Lie,Olav Møyner,Elling Svee,Jakob Torben
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS); Geophysics (physics.geo-ph)
备注:
Abstract:LLM agents are increasingly used for code generation, but physics-based simulation poses a deeper challenge: natural-language descriptions of simulation models are inherently underspecified, and different admissible resolutions of implicit choices produce physically valid but scientifically distinct configurations. Without explicit detection and resolution of these ambiguities, neither the correctness of the result nor its reproducibility from the original description can be assured. This paper investigates agentic scientific simulation, where model construction is organized as an execution-grounded interpret-act-validate loop and the simulator serves as the authoritative arbiter of physical validity rather than merely a runtime. We present JutulGPT, a reference implementation built on the fully differentiable Julia-based reservoir simulator JutulDarcy. The agent combines structured retrieval of documentation and examples with code synthesis, static analysis, execution, and systematic interpretation of solver diagnostics. Underspecified modelling choices are detected explicitly and resolved either autonomously (with logged assumptions) or through targeted user queries. The results demonstrate that agent-mediated model construction can be grounded in simulator validation, while also revealing a structural limitation: choices resolved tacitly through simulator defaults are invisible to the assumption log and to any downstream representation. A secondary experiment with autonomous reconstruction of a reference model from progressively abstract textual descriptions shows that reconstruction variability exposes latent degrees of freedom in simulation descriptions and provides a practical methodology for auditing reproducibility. All code, prompts, and agent logs are publicly available. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS); Geophysics (physics.geo-ph) Cite as: arXiv:2603.00214 [cs.SE] (or arXiv:2603.00214v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.00214 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-154] Universal NP-Hardness of Clustering under General Utilities
【速读】:该论文旨在解决聚类算法在实践中因缺乏统一理论框架而导致的稳定性差、对表示、超参数和初始化敏感等问题。其解决方案的关键在于提出通用聚类问题(Universal Clustering Problem, UCP),即在有限度量空间上最大化一个多项式时间可计算的划分效用函数,从而形式化不同聚类范式共有的优化核心。通过从图着色和3集精确覆盖(X3C)两个独立问题出发证明UCP是NP难的,并将包括k-means、高斯混合模型(GMMs)、DBSCAN、谱聚类和相关传播在内的十种主流聚类方法映射到UCP框架,揭示了它们均继承这一基本计算难解性,进而为聚类失败模式(如交替优化中的局部最优和层次聚类中的贪心合并顺序陷阱)提供了统一解释。该研究推动聚类方法向具有稳定性意识的目标函数和具备显式保证的交互驱动建模方向发展。
链接: https://arxiv.org/abs/2603.00210
作者: Angshul Majumdar
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
备注:
Abstract:Clustering is a central primitive in unsupervised learning, yet practice is dominated by heuristics whose outputs can be unstable and highly sensitive to representations, hyperparameters, and initialisation. Existing theoretical results are largely objective-specific and do not explain these behaviours at a unifying level. We formalise the common optimisation core underlying diverse clustering paradigms by defining the Universal Clustering Problem (UCP): the maximisation of a polynomial-time computable partition utility over a finite metric space. We prove the NP-hardness of UCP via two independent polynomial-time reductions from graph colouring and from exact cover by 3-sets (X3C). By mapping ten major paradigms – including k-means, GMMs, DBSCAN, spectral clustering, and affinity propagation – to the UCP framework, we demonstrate that each inherits this fundamental intractability. Our results provide a unified explanation for characteristic failure modes, such as local optima in alternating methods and greedy merge-order traps in hierarchical clustering. Finally, we show that clustering limitations reflect interacting computational and epistemic constraints, motivating a shift toward stability-aware objectives and interaction-driven formulations with explicit guarantees.
[AI-155] LiaisonAgent : An Multi-Agent Framework for Autonomous Risk Investigation and Governance
【速读】:该论文旨在解决现代安全运营中心(Security Operations Center, SOC)因依赖基于规则或签名的检测系统而面临的挑战,即技术警报数量庞大且缺乏组织上下文,导致分析师疲劳和响应延迟。解决方案的关键在于提出LiaisonAgent——一个基于QWQ-32B大模型的自主多智能体系统,通过集成人机交互代理、综合判断代理和自动化处置代理,实现从技术风险检测到业务级风险治理的端到端闭环流程;其核心创新在于采用混合规划架构,结合确定性工作流与基于ReAct范式的自主推理能力,从而在复杂和模糊场景中保持高准确率(95%风险判断准确率)与强鲁棒性(抵御分布外噪声和对抗性提示注入),同时将人工调查开销降低92.7%。
链接: https://arxiv.org/abs/2603.00200
作者: Chuanming Tang,Ling Qing,Shifeng Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:The rapid evolution of sophisticated cyberattacks has strained modern Security Operations Centers (SOC), which traditionally rely on rule-based or signature-driven detection systems. These legacy frameworks often generate high volumes of technical alerts that lack organizational context, leading to analyst fatigue and delayed incident responses. This paper presents LiaisonAgent, an autonomous multi-agent system designed to bridge the gap between technical risk detection and business-level risk governance. Built upon the QWQ-32B large reasoning model, LiaisonAgent integrates specialized sub-agents, including human-computer interaction agents, comprehensive judgment agents, and automated disposal agents-to execute end-to-end investigation workflows. The system leverages a hybrid planning architecture that combines deterministic workflows for compliance with autonomous reasoning based on the ReAct paradigm to handle ambiguous operational scenarios. Experimental evaluations across diverse security contexts, such as large-scale data exfiltration and unauthorized account borrowing, achieve an end-to-end tool-calling success rate of 97.8% and a risk judgment accuracy of 95%. Furthermore, the system exhibits significant resilience against out-of-distribution noise and adversarial prompt injections, while achieving a 92.7% reduction in manual investigation overhead.
[AI-156] Formal Analysis and Supply Chain Security for Agent ic AI Skills
【速读】:该论文针对当前生成式 AI(Generative AI)技能生态系统中存在的供应链攻击风险问题展开研究,特别是以 OpenClaw 和 Anthropic Agent Skills 为代表的平台中恶意技能的隐蔽注入与传播问题。现有防御手段多依赖启发式方法,缺乏形式化保障。其解决方案的关键在于提出 SkillFortify 框架,通过六个核心贡献实现对代理技能供应链的正式分析:包括基于 Dolev-Yao 模型的 DY-Skill 攻击者建模、基于抽象解释的静态分析、基于能力的沙箱隔离机制、基于 SAT 的代理依赖图解析、信任评分代数及其单调性证明,以及首个 540 技能基准测试集 SkillFortifyBench。该框架在实证评估中达到 96.95% F1 分数(置信区间 [95.1%, 98.4%]),并实现了 100% 精确率和零假阳性率,同时具备高效处理大规模依赖图的能力(<100ms 内完成 1,000 节点解析)。
链接: https://arxiv.org/abs/2603.00195
作者: Varun Pratap Bhardwaj
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 31 pages, 5 theorems with full proofs, 70 references, open-source tool: this https URL
Abstract:The rapid proliferation of agentic AI skill ecosystems – exemplified by OpenClaw (228,000 GitHub stars) and Anthropic Agent Skills (75,600 stars) – has introduced a critical supply chain attack surface. The ClawHavoc campaign (January-February 2026) infiltrated over 1,200 malicious skills into the OpenClaw marketplace, while MalTool catalogued 6,487 malicious tools that evade conventional detection. In response, twelve reactive security tools emerged, yet all rely on heuristic methods that provide no formal guarantees. We present SkillFortify, the first formal analysis framework for agent skill supply chains, with six contributions: (1) the DY-Skill attacker model, a Dolev-Yao adaptation to the five-phase skill lifecycle with a maximality proof; (2) a sound static analysis framework grounded in abstract interpretation; (3) capability-based sandboxing with a confinement proof; (4) an Agent Dependency Graph with SAT-based resolution and lockfile semantics; (5) a trust score algebra with formal monotonicity; and (6) SkillFortifyBench, a 540-skill benchmark. SkillFortify achieves 96.95% F1 (95% CI: [95.1%, 98.4%]) with 100% precision and 0% false positive rate on 540 skills, while SAT-based resolution handles 1,000-node graphs in under 100 ms.
[AI-157] OSF: On Pre-training and Scaling of Sleep Foundation Models
【速读】:该论文旨在解决当前睡眠生理学领域中基础模型(Foundation Models, FMs)在跨设备、跨人群数据上泛化能力不足的问题,尤其关注预训练过程中的关键因素与扩展规律对模型通用性的影响。其解决方案的关键在于:首先构建了一个包含166,500小时睡眠记录的大型开放基准 SleepBench;其次通过系统评估四种自监督预训练目标,发现现有模型在推理阶段无法处理缺失通道,且通道不变特征学习是预训练的核心;最后提出一种增强的预训练与扩展策略,引入OSF系列睡眠基础模型,在九个不同数据集上的多种睡眠及疾病预测任务中均达到最先进性能,同时揭示了样本效率、层级聚合和跨数据集扩展等重要特性。
链接: https://arxiv.org/abs/2603.00190
作者: Zitao Shuai,Zongzhe Xu,David Yang,Wei Wang,Yuzhe Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Polysomnography (PSG) provides the gold standard for sleep assessment but suffers from substantial heterogeneity across recording devices and cohorts. There have been growing efforts to build general-purpose foundation models (FMs) for sleep physiology, but lack an in-depth understanding of the pre-training process and scaling patterns that lead to more generalizable sleep FMs. To fill this gap, we curate a massive corpus of 166,500 hours of sleep recordings from nine public sources and establish SleepBench, a comprehensive, fully open-source benchmark. Leveraging SleepBench, we systematically evaluate four families of self-supervised pre-training objectives and uncover three critical findings: (1) existing FMs fail to generalize to missing channels at inference; (2) channel-invariant feature learning is essential for pre-training; and (3) scaling sample size, model capacity, and multi-source data mixture consistently improves downstream this http URL an enhanced pre-training and scaling recipe, we introduce OSF, a family of sleep FMs that achieves state-of-the-art performance across nine datasets on diverse sleep and disease prediction tasks. Further analysis of OSF also reveals intriguing properties in sample efficiency, hierarchical aggregation, and cross-dataset scaling.
[AI-158] hreatFormer-IDS: Robust Transformer Intrusion Detection with Zero-Day Generalization and Explainable Attribution
【速读】:该论文旨在解决物联网(IoT)和工业网络中入侵检测系统(Intrusion Detection System, IDS)在面对稀有攻击时难以实现低误报率、高可靠性,以及在流量演化、标签稀缺、零日攻击(zero-day attacks)和对抗性特征扰动下性能显著下降的问题。现有IDS方法虽在分布内数据上表现优异,但缺乏对未知攻击类型的泛化能力及对分析师可解释性的支持。解决方案的关键在于提出ThreatFormer-IDS框架——一个基于Transformer的序列建模架构,其核心创新包括:(i) 加权监督学习以应对类别不平衡;(ii) 掩码自监督学习提升模型在数据漂移和标签稀疏下的表示稳定性;(iii) 基于PGD(Projected Gradient Descent)的对抗训练结合尺度归一化扰动,增强对特征级规避攻击的鲁棒性;(iv) 利用集成梯度(Integrated Gradients)提供每个告警的时序步骤与特征重要性解释,从而支持分析师研判。该方案在ToN IoT基准上的实证结果表明,其在AUCROC、AUC-PR和Recall@1%FPR等指标上均优于主流树模型和序列基线,并在零日攻击场景下保持优异泛化能力,同时具备更强的对抗扰动鲁棒性和可解释性。
链接: https://arxiv.org/abs/2603.00185
作者: Srikumar Nayak
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 fgures, 4 tables
Abstract:Intrusion detection in IoT and industrial networks requires models that can detect rare attacks at low false-positive rates while remaining reliable under evolving traffic and limited labels. Existing IDS solutions often report strong in-distribution accuracy, but they may degrade when evaluated on future traffic, unseen (zero-day) attack families, or adversarial feature manipulations, and many systems provide limited evidence to support analyst triage. To address these gaps, we propose ThreatFormer- IDS, a Transformer-based sequence modeling framework that converts flow records into time-ordered windows and learns contextual representations for robust intrusion screening. The method combines (i) weighted supervised learning for imbalanced detection, (ii) masked self-supervised learning to improve representation stability under drift and sparse labels, (iii) PGDbased adversarial training with scale-normalized perturbations to strengthen resilience against feature-level evasion, and (iv) Integrated Gradients attribution to highlight influential time steps and features for each alert. On the ToN IoT benchmark with chronological evaluation, ThreatFormer-IDS achieves AUCROC 0.994, AUC-PR 0.956, and Recall@1%FPR 0.910, outperforming strong tree-based and sequence baselines. Under a zero-day protocol with held-out attack families, it maintains superior generalization (AUC-PR 0.721, Recall@1%FPR 0.783). Robustness tests further show slower degradation in AUCPR as the adversarial budget increases, confirming improved stability under bounded perturbations. Overall, ThreatFormer- IDS provides a unified, deployment-oriented IDS pipeline that balances detection quality, zero-day behavior, robustness, and explainability.
[AI-159] st Case Prioritization: A Snowballing Literature Review and TCPFramework with Approach Combinators
【速读】:该论文旨在解决软件回归测试中测试用例优先级排序(Test Case Prioritization, TCP)效率低下的问题,以加速测试过程并提升缺陷检测效率。其解决方案的关键在于提出了一种名为“方法组合器”(approach combinators)的集成式TCP方法族,通过组合多个基础排序策略生成更优的测试用例顺序,同时构建了TCPFramework平台用于系统化研究与评估,并引入两个新的评价指标(\rAPFDc 和 ATR),在RTPTorrent数据集上验证了该方法在多数被测程序中优于基线方法,且性能接近当前最优启发式算法,展现出良好的可扩展性和未来改进潜力。
链接: https://arxiv.org/abs/2603.00183
作者: Tomasz Chojnacki,Lech Madeyski
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 36 pages, 10 figures
Abstract:Context: Test case prioritization (TCP) is a technique widely used by software development organizations to accelerate regression testing. Objectives: We aim to systematize existing TCP knowledge and to propose and empirically evaluate a new TCP approach. Methods: We conduct a snowballing review (SR) on TCP, implement a~comprehensive platform for TCP research (TCPFramework), analyze existing evaluation metrics and propose two new ones (\rAPFDc and ATR), and develop a~family of ensemble TCP methods called approach combinators. Results: The SR helped identify 324 studies related to TCP. The techniques proposed in our study were evaluated on the RTPTorrent dataset, consistently outperforming their base approaches across the majority of subject programs, and achieving performance comparable to the current state of the art for heuristical algorithms (in terms of \rAPFDc, NTR, and ATR), while using a distinct approach. Conclusions: The proposed methods can be used efficiently for TCP, reducing the time spent on regression testing by up to 2.7%. Approach combinators offer significant potential for improvements in future TCP research, due to their composability.
[AI-160] PEPA: a Persistently Autonomous Embodied Agent with Personalities
【速读】:该论文试图解决当前具身智能体(embodied agents)依赖外部预设任务目标而导致无法在动态、非结构化环境中实现长期自主运行的问题。解决方案的关键在于引入人格特质(personality traits)作为内在组织原则,构建了一个三层认知架构PEPA:Sys3通过情景记忆和每日自我反思自主生成与人格一致的目标;Sys2进行推理以将目标转化为可执行动作计划;Sys1则基于感知-运动交互执行动作并记录经验。该架构使机器人能够在无固定任务规范的情况下,自主权衡用户请求与个性驱动动机,在真实办公环境中持续导航与探索,验证了人格驱动的认知架构对持久自主行为的支撑作用。
链接: https://arxiv.org/abs/2603.00117
作者: Kaige Liu,Yang Li,Lijun Zhu,Weinan Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Living organisms exhibit persistent autonomy through internally generated goals and self-sustaining behavioral organization, yet current embodied agents remain driven by externally scripted objectives. This dependence on predefined task specifications limits their capacity for long-term deployment in dynamic, unstructured environments where continuous human intervention is impractical. We propose that personality traits provide an intrinsic organizational principle for achieving persistent autonomy. Analogous to genotypic biases shaping biological behavioral tendencies, personalities enable agents to autonomously generate goals and sustain behavioral evolution without external supervision. To realize this, we develop PEPA, a three-layer cognitive architecture that operates through three interacting systems: Sys3 autonomously synthesizes personality-aligned goals and refines them via episodic memory and daily self-reflection; Sys2 performs deliberative reasoning to translate goals into executable action plans; Sys1 grounds the agent in sensorimotor interaction, executing actions and recording experiences. We validate the framework through real-world deployment on a quadruped robot in a multi-floor office building. Operating without reliance on fixed task specifications, the robot autonomously arbitrates between user requests and personality-driven motivations, navigating elevators and exploring environments accordingly. Quantitative analysis across five distinct personality prototypes demonstrates stable, trait-aligned behaviors. The results confirm that personality-driven cognitive architectures enable sustained autonomous operation characteristic of persistent embodied systems. Code and demo videos are available at this https URL.
[AI-161] SurgFusion-Net: Diversified Adaptive Multimodal Fusion Network for Surgical Skill Assessment
【速读】:该论文旨在解决机器人辅助手术(Robotic-assisted Surgery, RAS)中自动化手术技能评估的挑战,尤其是现有方法在真实临床场景下性能受限的问题。当前主流方法仅依赖RGB视频数据且局限于干实验(dry-lab)环境,难以应对真实手术中由器械运动、组织形变及摄像机位移带来的域差异(domain gap)。为此,作者提出SurgFusion-Net与发散调节注意力机制(Divergence Regulated Attention, DRA),其核心创新在于引入三模态信息融合策略——结合RGB图像、光流(optical flow)和工具分割掩码(tool segmentation masks),并通过自适应双注意力机制与促进多样性的多头注意力模块实现上下文感知的跨模态信息融合,从而显著提升评估准确性和鲁棒性。同时,论文构建了两个首个面向临床的真实手术视频数据集RAH-skill和RARP-skill,为后续研究提供高质量标注基准。
链接: https://arxiv.org/abs/2603.00108
作者: Runlong He,Freweini M. Tesfai,Matthew W. E. Boal,Nazir Sirajudeen,Dimitrios Anastasiou,Jialang Xu,Mobarak I. Hoque,Philip J. Edwards,John D. Kelly,Ashwin Sridhar,Abdolrahim Kadkhodamohammadi,Dhivya Chandrasekaran,Matthew J. Clarkson,Danail Stoyanov,Nader Francis,Evangelos B. Mazomenos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Robotic-assisted surgery (RAS) is established in clinical practice, and automated surgical skill assessment utilizing multimodal data offers transformative potential for surgical analytics and education. However, developing effective multimodal methods remains challenging due to the task complexity, limited annotated datasets and insufficient techniques for cross-modal information fusion. Existing state-of-the-art relies exclusively on RGB video and only applies on dry-lab settings, failing to address the significant domain gap between controlled simulation and real clinical cases, where the surgical environment together with camera and tissue motion introduce substantial complexities. This work introduces SurgFusion-Net and Divergence Regulated Attention (DRA), an innovative fusion strategy for multimodal surgical skill assessment. We contribute two first-of-their-kind clinical datasets: the RAH-skill dataset containing 279,691 RGB frames from 37 videos of Robot-assisted Hysterectomy (RAH), and the RARP-skill dataset containing 70,661 RGB frames from 33 videos of Robot-Assisted Radical Prostatectomy (RARP). Both datasets include M-GEARS skill annotations, corresponding optical flow and tool segmentation masks. DRA incorporates adaptive dual attention and diversity-promoting multi-head attention to fuse multimodal information, from three modalities, based on surgical context, enhancing assessment accuracy and reliability. Validated on the JIGSAWS benchmark, RAH-skill, and RARP-skill datasets, our approach outperforms recent baselines with SCC improvements of 0.02 in LOSO, 0.04 in LOUO across JIGSAWS tasks, and 0.0538 and 0.0493 gains on RAH-skill and RARP-skill, respectively.
[AI-162] SEval-NAS: A Search-Agnostic Evaluation for Neural Architecture Search
【速读】:该论文旨在解决神经架构搜索(Neural Architecture Search, NAS)中评估流程硬编码导致难以引入新指标的问题,尤其在硬件感知NAS场景下,因目标设备差异使得性能指标(如延迟和内存占用)难以灵活适配。解决方案的关键在于提出SEval-NAS机制,其核心是将神经网络架构转化为字符串表示,通过嵌入(embedding)映射为向量,并基于此预测多种性能指标(如准确率、延迟和内存)。该方法在NATS-Bench和HW-NAS-Bench数据集上验证了对延迟和内存的预测能力优于准确率,且可无缝集成至FreeREA框架中用于评估原生不支持的指标,同时保持搜索效率并仅需极少算法改动。
链接: https://arxiv.org/abs/2603.00099
作者: Atah Nuh Mih,Jianzhou Wang,Truong Thanh Hung Nguyen,Hung Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: To be published in the Proceedings of The 41st ACM/SIGAPP Symposium on Applied Computing (SAC26)
Abstract:Neural architecture search (NAS) automates the discovery of neural networks that meet specified criteria, yet its evaluation procedures are often hardcoded, limiting the ability to introduce new metrics. This issue is especially pronounced in hardware-aware NAS, where objectives depend on target devices such as edge hardware. To address this limitation, we propose SEval-NAS, a metric-evaluation mechanism that converts architectures to strings, embeds them as vectors, and predicts performance metrics. Using NATS-Bench and HW-NAS-Bench, we evaluated accuracy, latency, and memory. Kendall’s \tau correlations showed stronger latency and memory predictions than accuracy, indicating the suitability of SEval-NAS as a hardware cost predictor. We further integrated SEval-NAS into FreeREA to evaluate metrics not originally included. The method successfully ranked FreeREA-generated architectures, maintained search time, and required minimal algorithmic changes. Our implementation is available at: this https URL
[AI-163] Joint Sensor Deployment and Physics-Informed Graph Transformer for Smart Grid Attack Detection
【速读】:该论文旨在解决电力系统中传感器部署策略优化与攻击检测性能提升的联合问题,以增强对恶意攻击的识别能力。其核心解决方案是提出一种基于物理信息图Transformer网络(Physics-Informed Graph Transformer Network, PIGTN)的检测模型,并结合非支配排序遗传算法-II(NSGA-II)实现传感器位置与检测模型性能的协同优化。该框架在闭环设置下同时探索可行传感器配置空间并训练检测器,通过引入交流(AC)功率流约束,使PIGTN具备更强的泛化能力,在未见攻击场景下显著优于其他图神经网络变体,检测准确率提升最高达37%,检测率提升达73%,且平均误报率仅为0.3%;此外,优化后的传感器布局还能大幅降低状态估计误差(减少61%–98%)。
链接: https://arxiv.org/abs/2603.00085
作者: Mariam Elnour,Mohammad AlShaikh Saleh,Rachad Atat,Xiang Huo,Abdulrahman Takiddin,Muhammad Ismail,Hasan Kurban,Katherine R. Davis,Erchin Serpedin
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:This paper proposes a joint multi-objective optimization framework for strategic sensor placement in power systems to enhance attack detection. A novel physics-informed graph transformer network (PIGTN)-based detection model is proposed. Non-dominated sorting genetic algorithm-II (NSGA-II) jointly optimizes sensor locations and the PIGTN’s detection performance, while considering practical constraints. The combinatorial space of feasible sensor placements is explored using NSGA-II, while concurrently training the proposed detector in a closed-loop setting. Compared to baseline sensor placement methods, the proposed framework consistently demonstrates robustness under sensor failures and improvements in detection performance in seven benchmark cases, including the 14, 30, IEEE-30, 39, 57, 118 and the 200 bus systems. By incorporating AC power flow constraints, the proposed PIGTN-based detection model generalizes well to unseen attacks and outperforms other graph network-based variants (topology-aware models), achieving improvements up to 37% in accuracy and 73% in detection rate, with a mean false alarms rate of 0.3%. In addition, optimized sensor layouts significantly improve the performance of power system state estimation, achieving a 61%–98% reduction in the average state error.
[AI-164] Alignment Is Not Enough: A Relational Framework for Moral Standing in Human-AI Interaction
【速读】:该论文试图解决的问题是:当前人工智能(AI)系统在缺乏可验证的本体论属性(如感知能力或痛苦感受能力)的情况下,是否应获得道德考量的问题。由于这些属性在计算系统中无法被证实,导致现有治理框架对人机交互的伦理判断存在真空——用户与对话式AI形成持续的情感联结,但现行法规未能区分此类关系与工具性使用。解决方案的关键在于提出“Relate”框架,将道德受体性(moral patiency)从依赖于不可验证的本体论属性转向关注关系能力和具身互动(embodied interaction),并通过关系影响评估、分级道德考量协议和跨学科伦理整合等具体工具,重新构建适用于生成式AI等复杂人机互动场景的伦理治理结构。
链接: https://arxiv.org/abs/2603.00078
作者: Faezeh B. Pasandi,Hannah B. Pasandi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The question of whether artificial entities deserve moral consideration has become one of the defining ethical challenges of AI research. Existing frameworks for moral patiency rely on verified ontological properties, such as sentience, phenomenal consciousness, or the capacity for suffering, that remain epistemically inaccessible in computational systems. This reliance creates a governance vacuum: millions of users form sustained affective bonds with conversational AI, yet no regulatory instrument distinguishes these interactions from transactional tool use. We introduce Relate (Relational Ethics for Leveled Assessment of Technological Entities), a framework that reframes AI moral patiency from ontological verification toward relational capacity and embodied interaction. Through a systematic comparison of seven governance frameworks, we demonstrate that current trustworthy AI instruments treat all human-AI encounters identically as tool use, ignoring the relational and embodied dynamics that posthumanist scholarship anticipated. We propose relational impact assessments, graduated moral consideration protocols, and interdisciplinary ethics integration as concrete instruments, and we include a sample Relational Impact Assessment applied to a deployed companion AI system. We do not claim current AI systems are conscious. We demonstrate that the ethical vocabularies governing them are inadequate to the embodied, relational realities these systems produce.
[AI-165] he Value Sensitivity Gap: How Clinical Large Language Models Respond to Patient Preference Statements in Shared Decision-Making
【速读】:该论文旨在解决生成式 AI(Generative AI)在临床决策支持场景中对患者价值陈述的响应能力缺乏量化评估的问题,尤其关注其在共享决策(shared decision-making)框架下的价值敏感性。解决方案的关键在于设计了一个基于98,759条去标识化Medicaid就诊记录构建的临床情景因子实验,系统测试了四种主流大语言模型(LLMs)在13种不同价值条件下于两个临床领域中的表现,通过价值敏感性指数(value sensitivity index)和方向一致性(directional concordance)等指标量化模型响应患者偏好时的准确性与调整幅度,并进一步验证了决策矩阵(decision-matrix)和VIM自评(VIM self-report)两种干预策略可显著提升方向一致性(各提升0.125),为临床AI治理框架中“价值披露标签”的实证填充提供了依据。
链接: https://arxiv.org/abs/2603.00076
作者: Sanjay Basu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 38 pages, 4 figures, supplementary appendix included
Abstract:Large language models (LLMs) are entering clinical workflows as decision support tools, yet how they respond to explicit patient value statements – the core content of shared decision-making – remains unmeasured. We conducted a factorial experiment using clinical vignettes derived from 98,759 de-identified Medicaid encounter notes. We tested four LLM families (GPT-5.2, Claude 4.5 Sonnet, Gemini 3 Pro, and DeepSeek-R1) across 13 value conditions in two clinical domains, yielding 104 trials. Default value orientations differed across model families (aggressiveness range 2.0 to 3.5 on a 1-to-5 scale). Value sensitivity indices ranged from 0.13 to 0.27, and directional concordance with patient-stated preferences ranged from 0.625 to 1.0. All models acknowledged patient values in 100% of non-control trials, yet actual recommendation shifting remained modest. Decision-matrix and VIM self-report mitigations each improved directional concordance by 0.125 in a 78-trial Phase 2 evaluation. These findings provide empirical data for populating value disclosure labels proposed by clinical AI governance frameworks.
[AI-166] he Global Landscape of Environmental AI Regulation: From the Cost of Reasoning to a Right to Green AI
【速读】:该论文旨在解决生成式人工智能(Generative AI)系统日益增长的环境成本与当前监管透明度不足之间的矛盾问题。其关键解决方案在于提出三项协同政策举措:一是强制要求模型层面的环境信息披露,涵盖推理阶段能耗、性能基准及计算资源部署位置;二是赋予用户选择权,允许其拒绝不必要的生成式AI集成,并可选用环境友好型模型;三是推动国际协调机制以防止监管套利。论文进一步提出具体立法建议,如修订欧盟《人工智能法案》《消费者权利指令》和《数字服务法》,为全球范围内的AI环境治理提供可借鉴的制度框架。
链接: https://arxiv.org/abs/2603.00068
作者: Kai Ebert,Boris Gamazaychikov,Philipp Hacker,Sasha Luccioni
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 23 pages, 1 table, preprint
Abstract:Artificial intelligence (AI) systems impose substantial and growing environmental costs, yet transparency about these impacts has declined even as their deployment has accelerated. This paper makes three contributions. First, we collate empirical evidence that generative Web search and reasoning models - which have proliferated in 2025 - come with much higher cumulative environmental impacts than previous generations of AI approaches. Second, we map the global regulatory landscape across eleven jurisdictions and find that the manner in which environmental governance operates (predominantly at the facility-level rather than the model-level, with a focus on training rather than inference, with limited AI-specific energy disclosure requirements outside the EU) limits its applicability. Third, to address this, we propose a three-pronged policy response: mandatory model-level transparency that covers inference consumption, benchmarks, and compute locations; user rights to opt out of unnecessary generative AI integration and to select environmentally optimized models; and international coordination to prevent regulatory arbitrage. We conclude with concrete legislative proposals - including amendments to the EU AI Act, Consumer Rights Directive, and Digital Services Act - that could serve as templates for other jurisdictions.
[AI-167] Contesting Artificial Moral Agents
【速读】:该论文旨在解决当前生成式 AI(Generative AI)系统中人工道德代理(Artificial Moral Agents, AMAs)的道德正当性争议问题,即如何构建一个系统性的框架来质疑和评估这些具备内在道德推理能力的机器是否真正符合伦理标准。解决方案的关键在于提出一个五维(5E)评估框架,涵盖伦理(ethical)、认识论(epistemological)、可解释性(explainable)、实证(empirical)和评价(evaluative)五个维度,并进一步将伦理影响划分为个体、地方、社会及全球四个层面,同时提供一个初步的时间线以指导开发者预见潜在的伦理争议或主动进行自我审查,从而推动价值对齐的道德AI开发。
链接: https://arxiv.org/abs/2603.00066
作者: Aisha Aijaz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:There has been much discourse on the ethics of AI, to the extent that there are now systems that possess inherent moral reasoning. Such machines are now formally known as Artificial Moral Agents or AMAs. However, there is a requirement for a dedicated framework that can contest the morality of these systems. This paper proposes a 5E framework for contesting AMAs based on five grounds: ethical, epistemological, explainable, empirical, and evaluative. It further includes the spheres of ethical influences at individual, local, societal, and global levels. Lastly, the framework contributes a provisional timeline that indicates where developers of AMA technologies may anticipate contestation, or may self-contest in order to adhere to value-aligned development of truly moral AI systems.
[AI-168] Measuring What AI Systems Might Do: Towards A Measurement Science in AI
【速读】:该论文试图解决当前人工智能(AI)评估实践中对“能力”(capability)与“倾向”(propensity)等核心概念缺乏清晰界定和科学测量的问题。作者指出,现有方法如基准平均值或基于数据的潜变量模型(如项目反应理论)常将这些概念与可观察性能混为一谈,且未遵循测量处置性属性(dispositional properties)所需的因果推理框架。解决方案的关键在于:将AI的能力与倾向视为稳定、具有反事实关系的处置性属性,其测量需满足三个步骤——(i) 假设哪些情境属性是因果相关的,(ii) 独立操作化并测量这些属性,(iii) 实证映射这些属性的变化如何影响行为发生的概率。这要求构建一种尊重处置性的、科学上可辩护的AI评估范式,从而实现对AI系统本质特性的准确量化。
链接: https://arxiv.org/abs/2603.00063
作者: Konstantinos Voudouris,Mirko Thalmann,Alex Kipnis,José Hernández-Orallo,Eric Schulz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Scientists, policy-makers, business leaders, and members of the public care about what modern artificial intelligence systems are disposed to do. Yet terms such as capabilities, propensities, skills, values, and abilities are routinely used interchangeably and conflated with observable performance, with AI evaluation practices rarely specifying what quantity they purport to measure. We argue that capabilities and propensities are dispositional properties - stable features of systems characterised by counterfactual relationships between contextual conditions and behavioural outputs. Measuring a disposition requires (i) hypothesising which contextual properties are causally relevant, (ii) independently operationalising and measuring those properties, and (iii) empirically mapping how variation in those properties affects the probability of the behaviour. Dominant approaches to AI evaluation, from benchmark averages to data-driven latent-variable models such as Item Response Theory, bypass these steps entirely. Building on ideas from philosophy of science, measurement theory, and cognitive science, we develop a principled account of AI capabilities and propensities as dispositions, show why prevailing evaluation practices fail to measure them, and outline what disposition-respecting, scientifically defensible AI evaluation would require.
[AI-169] Stochastic Parrots or Singing in Harmony? Testing Five Leading LLM s for their Ability to Replicate a Human Survey with Synthetic Data
【速读】:该论文试图解决的问题是:生成式 AI (Generative AI) 生成的合成研究数据在多大程度上能够准确复现人类参与者的真实响应,尤其是在组织研究实践中如何评估其有效性与适用边界。解决方案的关键在于通过对比420名硅谷程序员和开发者的真人问卷数据与五种主流大语言模型(LLM)生成的合成数据,发现尽管AI代理能产出技术上合理且趋于一致的结果,但无法捕捉人类调查中蕴含的反直觉洞见,且所有合成数据呈现出系统性偏差,使真实数据成为离群值;因此,论文主张将合成调查数据定位为场外研究中的辅助工具,用于识别社会假设和常规认知,而非替代严谨的人类调查方法,并呼吁建立可重复验证协议和使用标准以规范其负责任的应用。
链接: https://arxiv.org/abs/2603.00059
作者: Jason Miklian,Kristian Hoelscher,John E. Katsos
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:How well can AI-derived synthetic research data replicate the responses of human participants? An emerging literature has begun to engage with this question, which carries deep implications for organizational research practice. This article presents a comparison between a human-respondent survey of 420 Silicon Valley coders and developers and synthetic survey data designed to simulate real survey takers generated by five leading Generative AI Large Language Models: ChatGPT Thinking 5 Pro, Claude Sonnet 4.5 Pro plus Claude CoWork 1.123, Gemini Advanced 2.5 Pro, Incredible 1.0, and DeepSeek 3.2. Our findings reveal that while AI agents produced technically plausible results that lean more towards replicability and harmonization than assumed, none were able to capture the counterintuitive insights that made the human survey valuable. Moreover, deviations grouped together for all models, leaving the real data as the outlier. Our key finding is that while leading LLMs are increasingly being used to scale, replicate and replace human survey responses in research, these advances only show an increased capacity to parrot conventional wisdom in harmony with each other rather than revealing novel findings. If synthetic respondents are used in future research, we need more replicable validation protocols and reporting standards for when and where synthetic survey data can be used responsibly, a gap that this paper fills. Our results suggest that synthetic survey responses cannot meaningfully model real human social beliefs within organizations, particularly in contexts lacking previously documented evidence. We conclude that synthetic survey-based research should be cast not as a substitute for rigorous survey methods, but as an increasingly reliable pre- or post-fieldwork instrument for identifying societal assumptions, conventional wisdoms, and other expectations about research populations.
[AI-170] PaperRepro: Automated Computational Reproducibility Assessment for Social Science Papers
【速读】:该论文旨在解决社会科学研究中计算可复现性(computational reproducibility)评估的自动化难题,现有方法因上下文容量有限、任务特定工具不足及结果捕捉不充分而表现不佳。其关键解决方案是提出PaperRepro——一种两阶段多智能体框架,将执行与评估分离:第一阶段由智能体运行复现包并编辑代码以显式捕获结果作为可追溯的中间产物;第二阶段则基于这些明确证据进行复现性判断。通过为不同智能体分配职责、配备专用工具和专家提示词(expert prompts),有效缓解了大模型在上下文和工具支持上的局限,并最大化其编码能力以提升结果完整性,从而显著提升评估准确性,在REPRO-Bench基准上相较最强基线实现21.9%的相对性能提升。
链接: https://arxiv.org/abs/2603.00058
作者: Linhao Zhang,Tong Xia,Jinghua Piao,Lizhen Cui,Yong Li
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Computational reproducibility is essential for the credibility of scientific findings, particularly in the social sciences, where findings often inform real-world decisions. Manual reproducibility assessment is costly and time-consuming, as it is nontrivial to reproduce the reported findings using the authors’ released code and data. Recent advances in large models (LMs) have inspired agent-based approaches for automated reproducibility assessment. However, existing approaches often struggle due to limited context capacity, inadequate task-specific tooling, and insufficient result capture. To address these, we propose PaperRepro, a novel two-stage, multi-agent approach that separates execution from evaluation. In the execution stage, agents execute the reproduction package and edit the code to capture reproduced results as explicit artifacts. In the evaluation stage, agents evaluate reproducibility using explicit evidence. PaperRepro assigns distinct responsibilities to agents and equips them with task-specific tools and expert prompts, mitigating context and tooling limitations. It further maximizes the LM’s coding capability to enable more complete result capture for evaluation. On REPRO-Bench, a social science reproducibility assessment benchmark, PaperRepro achieves the best overall performance, with a 21.9% relative improvement in score-agreement accuracy over the strongest prior baseline. We further refine the benchmark and introduce REPRO-Bench-S, a benchmark stratified by execution difficulty for more diagnostic evaluation of automated reproducibility assessment systems. Our code and data are publicly available
[AI-171] Multi-Condition Digital Twin Calibration for Axial Piston Pumps : Compound Fault Simulation
【速读】:该论文旨在解决轴向柱塞泵在复杂工况下因多摩擦副同时故障导致的复合故障诊断难题,传统数据驱动方法因样本稀缺且泛化能力差而难以应对。解决方案的关键在于构建一个三阶段协同的多工况物理-数据耦合数字孪生校准框架:首先通过专用刚性金属段实现原位高频虚拟流量传感,其次利用代理模型辅助校准三维CFD源模型以获取物理估计的流量脉动幅值,最后基于多目标逆瞬态分析识别粘弹性非稳态摩擦管道参数。该框架显著提升了数字孪生对单故障与典型复合故障的高保真还原能力,从而实现对未见工况和故障组合的零样本鲁棒诊断,推动复杂液压系统预测性维护的发展。
链接: https://arxiv.org/abs/2603.00199
作者: Chang Dong,Jianfeng Tao,Chengliang Liu
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:Axial piston pumps are indispensable power sources in high-stakes fluid power systems, including aerospace, marine, and heavy machinery applications. Their operational reliability is frequently compromised by compound faults that simultaneously affect multiple friction pairs. Conventional data-driven diagnosis methods suffer from severe data scarcity for compound faults and poor generalization across varying operating conditions. This paper proposes a novel multi-condition physics-data coupled digital twin calibration framework that explicitly resolves the fundamental uncertainty of pump outlet flow ripple. The framework comprises three synergistic stages: in-situ virtual high-frequency flow sensing on a dedicated rigid metallic segment, surrogate model-assisted calibration of the 3D CFD source model using physically estimated ripple amplitudes, and multi-objective inverse transient analysis for viscoelastic unsteady-friction pipeline parameter identification. Comprehensive experiments on a test rig demonstrate that the calibrated digital twin accurately reproduces both single-fault and two representative compound-fault. These results establish a high-fidelity synthetic fault-generation capability that directly enables robust zero-shot fault diagnosis under previously unseen operating regimes and fault combinations, thereby advancing predictive maintenance in complex hydraulic systems.
[AI-172] Adaptive Uncertainty-Guided Surrogates for Efficient phase field Modeling of Dendritic Solidification
【速读】:该论文旨在解决相场模拟在金属枝晶凝固预测中计算成本高昂的问题,尤其是在增材制造领域,微结构控制对工艺优化至关重要。其核心解决方案是构建一种基于不确定性驱动的自适应采样策略的代理模型,结合XGBoost和卷积神经网络(Convolutional Neural Networks, CNNs),并引入自监督学习机制,以高效逼近枝晶生长的时空演化过程,从而显著减少昂贵的相场模拟次数。关键创新在于利用蒙特卡洛丢弃(Monte Carlo dropout)和自助法(bagging)分别估算CNN和XGBoost的模型不确定性,识别高不确定区域并在超球体内局部生成新样本,逐步细化时空设计空间,在保证预测精度的同时大幅降低仿真频次,相较于通过离散粒子群优化的最优拉丁超立方采样(OLHS-PSO)方法具有更高的效率与环境友好性。
链接: https://arxiv.org/abs/2603.00093
作者: Eider Garate-Perez,Kerman López de Calle-Etxabe,Oihana Garcia,Borja Calvo,Meritxell Gómez-Omella,Jon Lambarri
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This manuscript is a preprint and has not yet been peer-reviewed. It has 45 pages and 14 figures
Abstract:The high computational cost of phase field simulations remains a major limitation for predicting dendritic solidification in metals, particularly in additive manufacturing, where microstructural control is critical. This work presents a surrogate model for dendritic solidification that employs uncertainty-driven adaptive sampling with XGBoost and CNNs, including a self-supervised strategy, to efficiently approximate the spatio-temporal evolution while reducing costly phase field simulations. The proposed adaptive strategy leverages model uncertainty, approximated via Monte Carlo dropout for CNNs and bagging for XGBoost, to identify high-uncertainty regions where new samples are generated locally within hyperspheres, progressively refining the spatio-temporal design space and achieving accurate predictions with significantly fewer phase field simulations than an Optimal Latin Hypercube Sampling optimized via discrete Particle Swarm Optimization (OLHS-PSO). The framework systematically investigates how temporal instance selection, adaptive sampling, and the choice between domain-informed and data-driven surrogates affect spatio-temporal model performance. Evaluation considers not only computational cost but also the number of expensive phase field simulations, surrogate accuracy, and associated CO_2 emissions, providing a comprehensive assessment of model performance as well as their related environmental impact.
[AI-173] High-Resolution Range Profile Classifiers Require Aspect-Angle Awareness
【速读】:该论文旨在解决高分辨率距离剖面(High-Resolution Range Profile, HRRP)分类中因目标姿态角(aspect angle)信息未被有效利用而导致的性能瓶颈问题。传统方法常假设姿态角在训练阶段不完整或推理阶段不可用,从而限制了分类模型对目标几何结构变化的鲁棒性。本文的关键解决方案是显式引入姿态角作为条件信息,并通过多种模型架构和条件策略验证其有效性:实验表明,无论单剖面还是序列分类器均可从姿态角感知中获益,平均准确率提升约7%,最高达10%;同时,作者提出使用因果卡尔曼滤波器在线估计姿态角(中位误差仅5°),并证明即使使用估计值进行训练与推理,仍能保留大部分性能增益,从而支持该方法在真实场景中的可行性与实用性。
链接: https://arxiv.org/abs/2603.00087
作者: Edwyn Brient,Santiago Velasco-Forero(CMM),Rami Kassab
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We revisit High-Resolution Range Profile (HRRP) classification with aspect-angle conditioning. While prior work often assumes that aspect-angle information is incomplete during training or unavailable at inference, we study a setting where angles are available for all training samples and explicitly provided to the classifier. Using three datasets and a broad range of conditioning strategies and model architectures, we show that both single-profile and sequential classifiers benefit consistently from aspect-angle awareness, with an average accuracy gain of about 7% and improvements of up to 10%, depending on the model and dataset. In practice, aspect angles are not directly measured and must be estimated. We show that a causal Kalman filter can estimate them online with a median error of 5\textdegree, and that training and inference with estimated angles preserves most of the gains, supporting the proposed approach in realistic conditions.
[AI-174] What Is the Geometry of the Alignment Tax?
【速读】:该论文旨在解决生成式 AI(Generative AI)中安全与能力之间存在权衡关系的问题,即“对齐税”(alignment tax)的理论建模与量化分析问题。其核心解决方案是构建了一个基于表示空间(representation space)的几何理论框架,在线性表示假设下,将对齐税率定义为安全方向在能力子空间上的投影平方,并推导出由安全与能力子空间主夹角参数化的帕累托前沿(Pareto frontier),证明该前沿可通过扰动实现且具有递归结构;进一步揭示了在能力约束下安全-安全权衡遵循相同方程,仅需替换角度为给定能力方向下的部分相关系数(partial correlation)。该理论还提出一个分解式标度律(scaling law),将对齐税拆分为由数据结构决定的不可约分项和随模型维度 d 以 O(m′/d) 趋于零的包装残差项,从而为理解对齐冲突的可调性和能力保持如何缓解多目标安全矛盾提供了形式化依据。
链接: https://arxiv.org/abs/2603.00047
作者: Robin Young
机构: 未知
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:
Abstract:The alignment tax is widely discussed but has not been formally characterized. We provide a geometric theory of the alignment tax in representation space. Under linear representation assumptions, we define the alignment tax rate as the squared projection of the safety direction onto the capability subspace and derive the Pareto frontier governing safety-capability tradeoffs, parameterized by a single quantity of the principal angle between the safety and capability subspaces. We prove this frontier is tight (achieved by perturbation) and show it has a recursive structure. safety-safety tradeoffs under capability constraints are governed by the same equation, with the angle replaced by the partial correlation between safety objectives given capability directions. We derive a scaling law decomposing the alignment tax into an irreducible component (determined by data structure) and a packing residual that vanishes as O(m’/d) with model dimension d , and establish conditions under which capability preservation mediates or resolves conflicts between safety objectives. We provide an account consistent with prior empirical findings and generates falsifiable predictions about per-task alignment tax rates and their scaling behavior.
机器学习
[LG-0] Partial Causal Structure Learning for Valid Selective Conformal Inference under Interventions
链接: https://arxiv.org/abs/2603.02204
作者: Amir Asiaee,Kavey Aryan,James P. Long
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Selective conformal prediction can yield substantially tighter uncertainty sets when we can identify calibration examples that are exchangeable with the test example. In interventional settings, such as perturbation experiments in genomics, exchangeability often holds only within subsets of interventions that leave a target variable “unaffected” (e.g., non-descendants of an intervened node in a causal graph). We study the practical regime where this invariance structure is unknown and must be learned from data. Our contributions are: (i) a contamination-robust conformal coverage theorem that quantifies how misclassification of “unaffected” calibration examples degrades coverage via an explicit function g(\delta,n) of the contamination fraction and calibration set size, providing a finite-sample lower bound that holds for arbitrary contaminating distributions; (ii) a task-driven partial causal learning formulation that estimates only the binary descendant indicators Z_a,i=\mathbf1\i\in\mathrmdesc(a)\ needed for selective calibration, rather than the full causal graph; and (iii) algorithms for descendant discovery via perturbation intersection patterns (differentially affected variable set intersections across interventions), and for approximate distance-to-intervention estimation via local invariant causal prediction. We provide recovery conditions under which contamination is controlled. Experiments on synthetic linear structural equation models (SEMs) validate the bound: under controlled contamination up to \delta=0.30 , the corrected procedure maintains \ge 0.95 coverage while uncorrected selective CP degrades to 0.867 . A proof-of-concept on Replogle K562 CRISPR interference (CRISPRi) perturbation data demonstrates applicability to real genomic screens.
[LG-1] Frontier Models Can Take Actions at Low Probabilities
链接: https://arxiv.org/abs/2603.02202
作者: Alex Serrano,Wen Xing,David Lindner,Erik Jenner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to “defect”: misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. When target rates are given, larger models achieve good calibration at lower rates. Yet, when models must derive the optimal target rate themselves, all models fail to achieve calibration without entropy or hint to generate it. Successful low-rate strategies require explicit Chain-of-Thought (CoT) reasoning, so malicious models attempting this approach could currently be caught by a CoT monitor. However, scaling trends suggest future evaluations may be unable to rely on models’ lack of target rate calibration, especially if CoT is no longer legible.
[LG-2] Multi-Head Low-Rank Attention ICLR2026
链接: https://arxiv.org/abs/2603.02188
作者: Songtao Liu,Hongwu Peng,Zhiwei Zhang,Zhengyu Chen,Yue Guo
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2026
Abstract:Long-context inference in large language models is bottlenecked by Key–Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip High-Bandwidth Memory (HBM) to on-chip Static Random-Access Memory (SRAM) at each step. While Multi-Head Latent Attention (MLA) significantly reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP). Since its single latent head cannot be partitioned, each device is forced to redundantly load the complete KV cache for every token, consuming excessive memory traffic and diminishing TP benefits like weight sharding. In this work, we propose Multi-Head Low-Rank Attention (MLRA), which enables partitionable latent states for efficient 4-way TP decoding. Extensive experiments show that MLRA achieves state-of-the-art perplexity and downstream task performance, while also delivering a 2.8 \times decoding speedup over MLA. Code is available at this https URL. Pretrained weights, along with the training and evaluation data, are available at this https URL.
[LG-3] De-paradox Tree: Breaking Down Simpsons Paradox via A Kernel-Based Partition Algorithm
链接: https://arxiv.org/abs/2603.02174
作者: Xian Teng,Yu-Ru Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real-world observational datasets and machine learning have revolutionized data-driven decision-making, yet many models rely on empirical associations that may be misleading due to confounding and subgroup heterogeneity. Simpson’s paradox exemplifies this challenge, where aggregated and subgroup-level associations contradict each other, leading to misleading conclusions. Existing methods provide limited support for detecting and interpreting such paradoxical associations, especially for practitioners without deep causal expertise. We introduce De-paradox Tree, an interpretable algorithm designed to uncover hidden subgroup patterns behind paradoxical associations under assumed causal structures involving confounders and effect heterogeneity. It employs novel split criteria and balancing-based procedures to adjust for confounders and homogenize heterogeneous effects through recursive partitioning. Compared to state-of-the-art methods, De-paradox Tree builds simpler, more interpretable trees, selects relevant covariates, and identifies nested opposite effects while ensuring robust estimation of causal effects when causally admissible variables are provided. Our approach addresses the limitations of traditional causal inference and machine learning methods by introducing an interpretable framework that supports non-expert practitioners while explicitly acknowledging causal assumptions and scope limitations, enabling more reliable and informed decision-making in complex observational data environments.
[LG-4] Machine Learning (ML) library in Linux kernel
链接: https://arxiv.org/abs/2603.02145
作者: Viacheslav Dubeyko
类目: Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注:
Abstract:Linux kernel is a huge code base with enormous number of subsystems and possible configuration options that results in unmanageable complexity of elaborating an efficient configuration. Machine Learning (ML) is approach/area of learning from data, finding patterns, and making predictions without implementing algorithms by developers that can introduce a self-evolving capability in Linux kernel. However, introduction of ML approaches in Linux kernel is not easy way because there is no direct use of floating-point operations (FPU) in kernel space and, potentially, ML models can be a reason of significant performance degradation in Linux kernel. Paper suggests the ML infrastructure architecture in Linux kernel that can solve the declared problem and introduce of employing ML models in kernel space. Suggested approach of kernel ML library has been implemented as Proof Of Concept (PoC) project with the goal to demonstrate feasibility of the suggestion and to design the interface of interaction the kernel-space ML model proxy and the ML model user-space thread.
[LG-5] Stochastic Multi-Armed Bandits with Limited Control Variates
链接: https://arxiv.org/abs/2603.02100
作者: Arun Verma,Manjesh Kumar Hanawal,Arun Rajkumar
类目: Machine Learning (cs.LG)
*备注: Accepted at COMSNETS 2026
Abstract:Motivated by wireless networks where interference or channel state estimates provide partial insight into throughput, we study a variant of the classical stochastic multi-armed bandit problem in which the learner has limited access to auxiliary information. Recent work has shown that such auxiliary information, when available as control variates, can be used to get tighter confidence bounds, leading to lower regret. However, existing works assume that control variates are available in every round, which may not be realistic in several real-life scenarios. To address this, we propose UCB-LCV, an upper confidence bound (UCB) based algorithm that effectively combines the estimators obtained from rewards and control variates. When there is no control variate, UCB-LCV leads to a novel algorithm that we call UCB-NORMAL, outperforming its existing algorithms for the standard MAB setting with normally distributed rewards. Finally, we discuss variants of the proposed UCB-LCV that apply to general distributions and experimentally demonstrate that UCB-LCV outperforms existing bandit algorithms.
[LG-6] Adam Converges Without Any Modification On Update Rules
链接: https://arxiv.org/abs/2603.02092
作者: Yushun Zhang,Bingran Li,Congliang Chen,Zhi-Quan Luo,Ruoyu Sun
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 66 pages
Abstract:Adam is the default algorithm for training neural networks, including large language models (LLMs). However, \citetreddi2019convergence provided an example that Adam diverges, raising concerns for its deployment in AI model training. We identify a key mismatch between the divergence example and practice: \citetreddi2019convergence pick the problem after picking the hyperparameters of Adam, i.e., (\beta_1,\beta_2) ; while practical applications often fix the problem first and then tune (\beta_1,\beta_2) . In this work, we prove that Adam converges with proper problem-dependent hyperparameters. First, we prove that Adam converges when \beta_2 is large and \beta_1 \sqrt\beta_2 . Second, when \beta_2 is small, we point out a region of (\beta_1,\beta_2) combinations where Adam can diverge to infinity. Our results indicate a phase transition for Adam from divergence to convergence when changing the (\beta_1, \beta_2) combination. To our knowledge, this is the first phase transition in (\beta_1,\beta_2) 2D-plane reported in the literature, providing rigorous theoretical guarantees for Adam optimizer. We further point out that the critical boundary (\beta_1^, \beta_2^) is problem-dependent, and particularly, dependent on batch size. This provides suggestions on how to tune \beta_1 and \beta_2 : when Adam does not work well, we suggest tuning up \beta_2 inversely with batch size to surpass the threshold \beta_2^* , and then trying \beta_1 \sqrt\beta_2 . Our suggestions are supported by reports from several empirical studies, which observe improved LLM training performance when applying them.
[LG-7] Accelerating PDE Surrogates via RL-Guided Mesh Optimization AISTATS2026
链接: https://arxiv.org/abs/2603.02066
作者: Yang Meng,Ruoxi Jiang,Zhuokai Zhao,Chong Liu,Rebecca Willett,Yuxin Chen
类目: Machine Learning (cs.LG)
*备注: Accepted at AISTATS 2026
Abstract:Deep surrogate models for parametric partial differential equations (PDEs) can deliver high-fidelity approximations but remain prohibitively data-hungry: training often requires thousands of fine-grid simulations, each incurring substantial computational cost. To address this challenge, we introduce RLMesh, an end-to-end framework for efficient surrogate training under limited simulation budget. The key idea is to use reinforcement learning (RL) to adaptively allocate mesh grid points non-uniformly within each simulation domain, focusing numerical resolution in regions most critical for accurate PDE solutions. A lightweight proxy model further accelerates RL training by providing efficient reward estimates without full surrogate retraining. Experiments on PDE benchmarks demonstrate that RLMesh achieves competitive accuracy to baselines but with substantially fewer simulation queries. These results show that solver-level spatial adaptivity can dramatically improve the efficiency of surrogate training pipelines, enabling practical deployment of learning-based PDE surrogates across a wide range of problems.
[LG-8] Never Saddle for Reparameterized Steepest Descent as Mirror Flow
链接: https://arxiv.org/abs/2603.02064
作者: Tom Jacobs,Chao Zhou,Rebekka Burkholz
类目: Machine Learning (cs.LG)
*备注:
Abstract:How does the choice of optimization algorithm shape a model’s ability to learn features? To address this question for steepest descent methods --including sign descent, which is closely related to Adam --we introduce steepest mirror flows as a unifying theoretical framework. This framework reveals how optimization geometry governs learning dynamics, implicit bias, and sparsity and it provides two explanations for why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations (a simplified proxy for attention), we show that steeper descent facilitates both saddle-point escape and feature learning. In contrast, gradient descent requires unrealistically large learning rates to escape saddles, an uncommon regime in fine-tuning. Empirically, we confirm that saddle-point escape is a central challenge in fine-tuning. Furthermore, we demonstrate that decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equations. Together, these results highlight two mechanisms how steepest descent can aid modern optimization.
[LG-9] Expanding LLM Agent Boundaries with Strategy-Guided Exploration
链接: https://arxiv.org/abs/2603.02045
作者: Andrew Szot,Michael Kirchhof,Omar Attia,Alexander Toshev
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) has demonstrated notable success in post-training large language models (LLMs) as agents for tasks such as computer use, tool calling, and coding. However, exploration remains a central challenge in RL for LLM agents, especially as they operate in language-action spaces with complex observations and sparse outcome rewards. In this work, we address exploration for LLM agents by leveraging the ability of LLMs to plan and reason in language about the environment to shift exploration from low-level actions to higher-level language strategies. We thus propose Strategy-Guided Exploration (SGE), which first generates a concise natural-language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy. By exploring in the space of strategies rather than the space of actions, SGE induces structured and diverse exploration that targets different environment outcomes. To increase strategy diversity during RL, SGE introduces mixed-temperature sampling, which explores diverse strategies in parallel, along with a strategy reflection process that grounds strategy generation on the outcomes of previous strategies in the environment. Across UI interaction, tool-calling, coding, and embodied agent environments, SGE consistently outperforms exploration-focused RL baselines, improving both learning efficiency and final performance. We show that SGE enables the agent to learn to solve tasks too difficult for the base model.
[LG-10] Leave-One-Out Prediction for General Hypothesis Classes
链接: https://arxiv.org/abs/2603.02043
作者: Jian Qian,Jiachen Xu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Leave-one-out (LOO) prediction provides a principled, data-dependent measure of generalization, yet guarantees in fully transductive settings remain poorly understood beyond specialized models. We introduce Median of Level-Set Aggregation (MLSA), a general aggregation procedure based on empirical-risk level sets around the ERM. For arbitrary fixed datasets and losses satisfying a mild monotonicity condition, we establish a multiplicative oracle inequality for the LOO error of the form [ LOO_S(\hath) ;\le; C \cdot \frac1n \min_h\in H L_S(h) ;+; \fracComp(S,H,\ell)n, \qquad C1. ] The analysis is based on a local level-set growth condition controlling how the set of near-optimal empirical-risk minimizers expands as the tolerance increases. We verify this condition in several canonical settings. For classification with VC classes under the 0-1 loss, the resulting complexity scales as O(d \log n) , where d is the VC dimension. For finite hypothesis and density classes under bounded or log loss, it scales as O(\log |H|) and O(\log |P|) , respectively. For logistic regression with bounded covariates and parameters, a volumetric argument based on the empirical covariance matrix yields complexity scaling as O(d \log n) up to problem-dependent factors. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2603.02043 [cs.LG] (or arXiv:2603.02043v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.02043 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-11] Latent attention on masked patches for flow reconstruction CCS
链接: https://arxiv.org/abs/2603.02028
作者: Ben Eze,Luca Magri,Andrea Nóvoa
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, submitted to ICCS (International Conference on Computational Science) 2026
Abstract:Vision transformers have demonstrated outstanding performance on image generation applications, but their adoption in scientific disciplines, like fluid dynamics, has been limited. We introduce the Latent Attention on Masked Patches (LAMP) model, an interpretable regression-based modified vision transformer designed for masked flow reconstruction. LAMP follows a three-fold strategy: (i) partition of each flow snapshot into patches, (ii) dimensionality reduction of each patch via patch-wise proper orthogonal decomposition, and (iii) reconstruction of the full field from a masked input using a single-layer transformer trained via closed-form linear regression. We test the method on two canonical 2D unsteady wakes: a wake past a bluff body, and a chaotic wake past a flat plate. We show that the LAMP accurately reconstructs the full flow field from a 90%-masked and noisy input, across signal-to-noise ratios between 10 and 30,dB. Incorporating nonlinear measurement states can reduce the prediction error by up to an order of magnitude. The learned attention matrix yields physically interpretable multi-fidelity optimal sensor-placement maps. The modularity of the framework enables nonlinear compression and deep attention blocks, thereby providing an efficient baseline for nonlinear and high-dimensional masked flow reconstruction.
[LG-12] CausalWrap: Model-Agnostic Causal Constraint Wrappers for Tabular Synthetic Data
链接: https://arxiv.org/abs/2603.02015
作者: Amir Asiaee,Zhuohui J. Liang,Chao Yan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tabular synthetic data generators are typically trained to match observational distributions, which can yield high conventional utility (e.g., column correlations, predictive accuracy) yet poor preservation of structural relations relevant to causal analysis and out-of-distribution (OOD) reasoning. When the downstream use of synthetic data involves causal reasoning – estimating treatment effects, evaluating policies, or testing mediation pathways – merely matching the observational distribution is insufficient: structural fidelity and treatment-mechanism preservation become essential. We propose CausalWrap (CW), a model-agnostic wrapper that injects partial causal knowledge (PCK) – trusted edges, forbidden edges, and qualitative/monotonic constraints – into any pretrained base generator (GAN, VAE, or diffusion model), without requiring access to its internals. CW learns a lightweight, differentiable post-hoc correction map applied to samples from the base generator, optimized with causal penalty terms under an augmented-Lagrangian schedule. We provide theoretical results connecting penalty-based optimization to constraint satisfaction and relating approximate factorization to joint distributional control. We validate CW on simulated structural causal models (SCMs) with known ground-truth interventions, semi-synthetic causal benchmarks (IHDP and an ACIC-style suite), and a real-world ICU cohort (MIMIC-IV) with expert-elicited partial graphs. CW improves causal fidelity across diverse base generators – e.g., reducing average treatment effect (ATE) error by up to 63% on ACIC and lifting ATE agreement from 0.00 to 0.38 on the intensive care unit (ICU) cohort – while largely retaining conventional utility.
[LG-13] Noise-Calibrated Inference from Differentially Private Sufficient Statistics in Exponential Families
链接: https://arxiv.org/abs/2603.02010
作者: Amir Asiaee,Samhita Pal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Many differentially private (DP) data release systems either output DP synthetic data and leave analysts to perform inference as usual, which can lead to severe miscalibration, or output a DP point estimate without a principled way to do uncertainty quantification. This paper develops a clean and tractable middle ground for exponential families: release only DP sufficient statistics, then perform noise-calibrated likelihood-based inference and optional parametric synthetic data generation as post-processing. Our contributions are: (1) a general recipe for approximate-DP release of clipped sufficient statistics under the Gaussian mechanism; (2) asymptotic normality, explicit variance inflation, and valid Wald-style confidence intervals for the plug-in DP MLE; (3) a noise-aware likelihood correction that is first-order equivalent to the plug-in but supports bootstrap-based intervals; and (4) a matching minimax lower bound showing the privacy distortion rate is unavoidable. The resulting theory yields concrete design rules and a practical pipeline for releasing DP synthetic data with principled uncertainty quantification, validated on three exponential families and real census data.
[LG-14] mporal Representations for Exploration: Learning Complex Exploratory Behavior without Extrinsic Rewards
链接: https://arxiv.org/abs/2603.02008
作者: Faisal Mohamed,Catherine Ji,Benjamin Eysenbach,Glen Berseth
类目: Machine Learning (cs.LG)
*备注:
Abstract:Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent perceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction. In this paper, we propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory x in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms (e.g., quasimetric-based methods), our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.
[LG-15] Accurate private secure federated U-statistics with higher degree
链接: https://arxiv.org/abs/2603.01986
作者: Quentin Sinh(MAGNET),Jan Ramon(MAGNET)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:We study the problem of computing a U-statistic with a kernel function f of degree k \ge 2, i.e., the average of some function f over all k-tuples of instances, in a federated learning setting. Ustatistics of degree 2 include several useful statistics such as Kendall’s \tau coefficient, the Area under the Receiver-Operator Curve and the Gini mean difference. Existing methods provide solutions only under the lower-utility local differential privacy model and/or scale poorly in the size of the domain discretization. In this work, we propose a protocol that securely computes U-statistics of degree k \ge 2 under central differential privacy by leveraging Multi Party Computation (MPC). Our method substantially improves accuracy when compared to prior solutions. We provide a detailed theoretical analysis of its accuracy, communication and computational properties. We evaluate its performance empirically, obtaining favorable results, e.g., for Kendall’s \tau coefficient, our approach reduces the Mean Squared Error by up to four orders of magnitude over existing baselines.
[LG-16] CoVAE: correlated multimodal generative modeling
链接: https://arxiv.org/abs/2603.01965
作者: Federico Caretti,Guido Sanguinetti
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Multimodal Variational Autoencoders have emerged as a popular tool to extract effective representations from rich multimodal data. However, such models rely on fusion strategies in latent space that destroy the joint statistical structure of the multimodal data, with profound implications for generation and uncertainty quantification. In this work, we introduce Correlated Variational Autoencoders (CoVAE), a new generative architecture that captures the correlations between modalities. We test CoVAE on a number of real and synthetic data sets demonstrating both accurate cross-modal reconstruction and effective quantification of the associated uncertainties.
[LG-17] he Expressive Limits of Diagonal SSMs for State-Tracking ICLR2026
链接: https://arxiv.org/abs/2603.01959
作者: Mehran Shakerinava,Behnoush Khavari,Siamak Ravanbakhsh,Sarath Chandar
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures, 4 tables. Accepted at ICLR 2026
Abstract:State-Space Models (SSMs) have recently been shown to achieve strong empirical performance on a variety of long-range sequence modeling tasks while remaining efficient and highly-parallelizable. However, the theoretical understanding of their expressive power remains limited. In this work, we study the expressivity of input-Dependent Complex-valued Diagonal (DCD) SSMs on sequential state-tracking tasks. We show that single-layer DCD SSMs cannot express state-tracking of any non-Abelian group at finite precision. More generally, we show that k -layer DCD SSMs can express state-tracking of a group if and only if that group has a subnormal series of length k , with Abelian factors. That is, we identify the precise expressivity range of k -layer DCD SSMs within the solvable groups. Empirically, we find that multi-layer models often fail to learn state-tracking for non-Abelian groups, highlighting a gap between expressivity and learnability.
[LG-18] Accelerating Single-Pass SGD for Generalized Linear Prediction
链接: https://arxiv.org/abs/2603.01951
作者: Qian Chen,Shihong Ding,Cong Fang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 50 pages
Abstract:We study generalized linear prediction under a streaming setting, where each iteration uses only one fresh data point for a gradient-level update. While momentum is well-established in deterministic optimization, a fundamental open question is whether it can accelerate such single-pass non-quadratic stochastic optimization. We propose the first algorithm that successfully incorporates momentum via a novel data-dependent proximal method, achieving dual-momentum acceleration. Our derived excess risk bound decomposes into three components: an improved optimization error, a minimax optimal statistical error, and a higher-order model-misspecification error. The proof handles mis-specification via a fine-grained stationary analysis of inner updates, while localizing statistical error through a two-phase outer-loop analysis. As a result, we resolve the open problem posed by Jain et al. [2018a] and demonstrate that momentum acceleration is more effective than variance reduction for generalized linear prediction in the streaming setting.
[LG-19] BAED: a New Paradigm for Few-shot Graph Learning with Explanation in the Loop
链接: https://arxiv.org/abs/2603.01941
作者: Chao Chen,Xujia Li,Dongsheng Hong,Shanshan Lin,Xiangwen Liao,Chuanyi Liu,Lei Chen
类目: Machine Learning (cs.LG)
*备注: Accepted to Neural Networks 2026
Abstract:The challenges of training and inference in few-shot environments persist in the area of graph representation learning. The quality and quantity of labels are often insufficient due to the extensive expert knowledge required to annotate graph data. In this context, Few-Shot Graph Learning (FSGL) approaches have been developed over the years. Through sophisticated neural architectures and customized training pipelines, these approaches enhance model adaptability to new label distributions. However, compromises in \textcolorblackthe model’s robustness and interpretability can result in overfitting to noise in labeled data and degraded performance. This paper introduces the first explanation-in-the-loop framework for the FSGL problem, called BAED. We novelly employ the belief propagation algorithm to facilitate label augmentation on graphs. Then, leveraging an auxiliary graph neural network and the gradient backpropagation method, our framework effectively extracts explanatory subgraphs surrounding target nodes. The final predictions are based on these informative subgraphs while mitigating the influence of redundant information from neighboring nodes. Extensive experiments on seven benchmark datasets demonstrate superior prediction accuracy, training efficiency, and explanation quality of BAED. As a pioneer, this work highlights the potential of the explanation-based research paradigm in FSGL.
[LG-20] Bound Propagation meets Constraint Simplification: Improving Logic-based XAI for Neural Networks
链接: https://arxiv.org/abs/2603.01923
作者: Ronaldo Gomes,Jairo Ribeiro,Luiz Queiroz,Thiago Alves Rocha
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: Preprint version. For the final published version, see the DOI below
Abstract:Logic-based methods for explaining neural network decisions offer formal guarantees of correctness and non-redundancy, but they often suffer from high computational costs, especially for large networks. In this work, we improve the efficiency of such methods by combining bound propagation with constraint simplification. These simplifications, derived from the propagation, tighten neuron bounds and eliminate unnecessary binary variables, making the explanation process more efficient. Our experiments suggest that combining these techniques reduces explanation time by up to 89.26%, particularly for larger neural networks.
[LG-21] SEAR: Sample Efficient Action Chunking Reinforcement Learning
链接: https://arxiv.org/abs/2603.01891
作者: C. F. Maximilian Nagy,Onur Celik,Emiliyan Gospodinov,Florian Seligmann,Weiran Liao,Aryan Kaushik,Gerhard Neumann
类目: Machine Learning (cs.LG)
*备注:
Abstract:Action chunking can improve exploration and value estimation in long horizon reinforcement learning, but makes learning substantially harder since the critic must evaluate action sequences rather than single actions, greatly increasing approximation and data efficiency challenges. As a result, existing action chunking methods, primarily designed for the offline and offline-to-online settings, have not achieved strong performance in purely online reinforcement learning. We introduce SEAR, an off policy online reinforcement learning algorithm for action chunking. It exploits the temporal structure of action chunks and operates with a receding horizon, effectively combining the benefits of small and large chunk sizes. SEAR outperforms state of the art online reinforcement learning methods on Metaworld, training with chunk sizes up to 20.
[LG-22] Generalizing Logic-based Explanations for Machine Learning Classifiers via Optimization
链接: https://arxiv.org/abs/2603.01870
作者: Francisco Mateus Rocha Filho,Ajalmar Rêgo da Rocha Neto,Thiago Alves Rocha
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: Preprint version. For the final published version, see the DOI below
Abstract:Machine learning models support decision-making, yet the reasons behind their predictions are opaque. Clear and reliable explanations help users make informed decisions and avoid blindly trusting model outputs. However, many existing explanation methods fail to guarantee correctness. Logic-based approaches ensure correctness but often offer overly constrained explanations, limiting coverage. Recent work addresses this by incrementally expanding explanations while maintaining correctness. This process is performed separately for each feature, adjusting both its upper and lower bounds. However, this approach faces a trade-off: smaller increments incur high computational costs, whereas larger ones may lead to explanations covering fewer instances. To overcome this, we propose two novel methods. Onestep builds upon this prior work, generating explanations in a single step for each feature and each bound, eliminating the overhead of an iterative process. \textitTwostep takes a gradual approach, improving coverage. Experimental results show that Twostep significantly increases explanation coverage (by up to 72.60% on average across datasets) compared to Onestep and, consequently, to prior work.
[LG-23] rivial Graph Features and Classical Learning are Enough to Detect Random Anomalies
链接: https://arxiv.org/abs/2603.01841
作者: Matthieu Latapy,Stephany Rajeh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Detecting anomalies in link streams that represent various kinds of interactions is an important research topic with crucial applications. Because of the lack of ground truth data, proposed methods are mostly evaluated through their ability to detect randomly injected links. In contrast with most proposed methods, that rely on complex approaches raising computational and/or interpretability issues, we show here that trivial graph features and classical learning techniques are sufficient to detect such anomalies extremely well. This basic approach has very low computational costs and it leads to easily interpretable results. It also has many other desirable properties that we study through an extensive set of experiments. We conclude that detection methods should now target more complex kinds of anomalies.
[LG-24] Constrained Particle Seeking: Solving Diffusion Inverse Problems with Just Forward Passes AAAI2026
链接: https://arxiv.org/abs/2603.01837
作者: Hongkun Dou,Zike Chen,Zeyu Li,Hongjue Li,Lijun Yang,Yue Deng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by AAAI 2026
Abstract:Diffusion models have gained prominence as powerful generative tools for solving inverse problems due to their ability to model complex data distributions. However, existing methods typically rely on complete knowledge of the forward observation process to compute gradients for guided sampling, limiting their applicability in scenarios where such information is unavailable. In this work, we introduce \textbf\emphConstrained Particle Seeking (CPS), a novel gradient-free approach that leverages all candidate particle information to actively search for the optimal particle while incorporating constraints aligned with high-density regions of the unconditional prior. Unlike previous methods that passively select promising candidates, CPS reformulates the inverse problem as a constrained optimization task, enabling more flexible and efficient particle seeking. We demonstrate that CPS can effectively solve both image and scientific inverse problems, achieving results comparable to gradient-based methods while significantly outperforming gradient-free alternatives. Code is available at this https URL.
[LG-25] Uncertainty Quantification of Click and Conversion Estimates for the Autobidding
链接: https://arxiv.org/abs/2603.01825
作者: Ivan Zhigalskii,Andrey Pudovikov,Aleksandr Katrutsa,Egor Samosvat
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: 17 pages (10 main text + 7 appendix), 5 figures, 2 tables
Abstract:Modern e-commerce platforms employ various auction mechanisms to allocate paid slots for a given item. To scale this approach to the millions of auctions, the platforms suggest promotion tools based on the autobidding algorithms. These algorithms typically depend on the Click-Through-Rate (CTR) and Conversion-Rate (CVR) estimates provided by a pre-trained machine learning model. However, the predictions of such models are uncertain and can significantly affect the performance of the autobidding algorithm. To address this issue, we propose the DenoiseBid method, which corrects the generated CTRs and CVRs to make the resulting bids more efficient in auctions. The underlying idea of our method is to employ a Bayesian approach and replace noisy CTR or CVR estimates with those from recovered distributions. To demonstrate the performance of the proposed approach, we perform extensive experiments on the synthetic, iPinYou, and BAT datasets. To evaluate the robustness of our approach to the noise scale, we use synthetic noise and noise estimated from the predictions of the pre-trained machine learning model.
[LG-26] GCTAM: Global and Contextual Truncated Affinity Combined Maximization Model For Unsupervised Graph Anomaly Detection IJCAI2025
链接: https://arxiv.org/abs/2603.01806
作者: Xiong Zhang,Hong Peng,Zhenli He,Cheng Xie,Xin Jin,Hua Jiang
类目: ocial and Information Networks (cs.SI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted by IJCAI 2025
Abstract:Anomalies often occur in real-world information networks/graphs, such as malevolent users, malicious comments, banned users, and fake news in social graphs. The latest graph anomaly detection methods use a novel mechanism called truncated affinity maximization (TAM) to detect anomaly nodes without using any label information and achieve impressive results. TAM maximizes the affinities among the normal nodes while truncating the affinities of the anomalous nodes to identify the anomalies. However, existing TAM-based methods truncate suspicious nodes according to a rigid threshold that ignores the specificity and high-order affinities of different nodes. This inevitably causes inefficient truncations from both normal and anomalous nodes, limiting the effectiveness of anomaly detection. To this end, this paper proposes a novel truncation model combining contextual and global affinity to truncate the anomalous nodes. The core idea of the work is to use contextual truncation to decrease the affinity of anomalous nodes, while global truncation increases the affinity of normal nodes. Extensive experiments on massive real-world datasets show that our method surpasses peer methods in most graph anomaly detection tasks. In highlights, compared with previous state-of-the-art methods, the proposed method has +15% \sim +20% improvements in two famous real-world datasets, Amazon and YelpChi. Notably, our method works well in large datasets, Amazin-all and YelpChi-all, and achieves the best results, while most previous models cannot complete the tasks.
[LG-27] D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation
链接: https://arxiv.org/abs/2603.01780
作者: Zhao Yang,Hengchang Liu,Chuan Cao,Bing Su
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: Accepted as a workshop paper at MLGenX 2026
Abstract:Early DNA foundation models adopted BERT-style training, achieving good performance on DNA understanding tasks but lacking generative capabilities. Recent autoregressive models enable DNA generation, but employ left-to-right causal modeling that is suboptimal for DNA where regulatory relationships are inherently bidirectional. We present D3LM (\textbfDiscrete \textbfDNA \textbfDiffusion \textbfLanguage \textbfModel), which unifies bidirectional representation learning and DNA generation through masked diffusion. D3LM directly adopts the Nucleotide Transformer (NT) v2 architecture but reformulates the training objective as masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model. Compared to NT v2 of the same size, D3LM achieves improved performance on understanding tasks. Notably, on regulatory element generation, D3LM achieves an SFID of 10.92, closely approaching real DNA sequences (7.85) and substantially outperforming the previous best result of 29.16 from autoregressive models. Our work suggests diffusion language models as a promising paradigm for unified DNA foundation models. We further present the first systematic study of masked diffusion models in the DNA domain, investigating practical design choices such as tokenization schemes and sampling strategies, thereby providing empirical insights and a solid foundation for future research. D3LM has been released at this https URL.
[LG-28] DGNet: Discrete Green Networks for Data-Efficient Learning of Spatiotemporal PDEs ICLR2026
链接: https://arxiv.org/abs/2603.01762
作者: Yingjie Tan,Quanming Yao,Yaqing Wang
类目: Machine Learning (cs.LG)
*备注: Accepted as a conference paper at ICLR 2026
Abstract:Spatiotemporal partial differential equations (PDEs) underpin a wide range of scientific and engineering applications. Neural PDE solvers offer a promising alternative to classical numerical methods. However, existing approaches typically require large numbers of training trajectories, while high-fidelity PDE data are expensive to generate. Under limited data, their performance degrades substantially, highlighting their low data efficiency. A key reason is that PDE dynamics embody strong structural inductive biases that are not explicitly encoded in neural architectures, forcing models to learn fundamental physical structure from data. A particularly salient manifestation of this inefficiency is poor generalization to unseen source terms. In this work, we revisit Green’s function theory-a cornerstone of PDE theory-as a principled source of structural inductive bias for PDE learning. Based on this insight, we propose DGNet, a discrete Green network for data-efficient learning of spatiotemporal PDEs. The key idea is to transform the Green’s function into a graph-based discrete formulation, and embed the superposition principle into the hybrid physics-neural architecture, which reduces the burden of learning physical priors from data, thereby improving sample efficiency. Across diverse spatiotemporal PDE scenarios, DGNet consistently achieves state-of-the-art accuracy using only tens of training trajectories. Moreover, it exhibits robust zero-shot generalization to unseen source terms, serving as a stress test that highlights its data-efficient structural design.
[LG-29] Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning CVPR2025
链接: https://arxiv.org/abs/2603.01759
作者: Zichen Tian,Yaoyao Liu,Qianru Sun
类目: Machine Learning (cs.LG)
*备注: Accepted by CVPR 2025 (Highlight). Code is available at: this https URL
Abstract:Training large foundation models from scratch for domain-specific applications is almost impossible due to data limits and long-tailed distributions – taking remote sensing (RS) as an example. Fine-tuning natural image pre-trained models on RS images is a straightforward solution. To reduce computational costs and improve performance on tail classes, existing methods apply parameter-efficient fine-tuning (PEFT) techniques, such as LoRA and AdaptFormer. However, we observe that fixed hyperparameters – such as intra-layer positions, layer depth, and scaling factors, can considerably hinder PEFT performance, as fine-tuning on RS images proves highly sensitive to these settings. To address this, we propose MetaPEFT, a method incorporating adaptive scalers that dynamically adjust module influence during fine-tuning. MetaPEFT dynamically adjusts three key factors of PEFT on RS images: module insertion, layer selection, and module-wise learning rates, which collectively control the influence of PEFT modules across the network. We conduct extensive experiments on three transfer-learning scenarios and five datasets in both RS and natural image domains. The results show that MetaPEFT achieves state-of-the-art performance in cross-spectral adaptation, requiring only a small amount of trainable parameters and improving tail-class accuracy significantly.
[LG-30] Causal Circuit Tracing Reveals Distinct Computational Architectures in Single-Cell Foundation Models: Inhibitory Dominance Biological Coherence and Cross-Model Convergence
链接: https://arxiv.org/abs/2603.01752
作者: Ihor Kendiukhov
类目: Machine Learning (cs.LG); Cell Behavior (q-bio.CB); Genomics (q-bio.GN)
*备注:
Abstract:Motivation: Sparse autoencoders (SAEs) decompose foundation model activations into interpretable features, but causal feature-to-feature interactions across network depth remain unknown for biological foundation models. Results: We introduce causal circuit tracing by ablating SAE features and measuring downstream responses, and apply it to Geneformer V2-316M and scGPT whole-human across four conditions (96,892 edges, 80,191 forward passes). Both models show approximately 53 percent biological coherence and 65 to 89 percent inhibitory dominance, invariant to architecture and cell type. scGPT produces stronger effects (mean absolute d = 1.40 vs. 1.05) with more balanced dynamics. Cross-model consensus yields 1,142 conserved domain pairs (10.6x enrichment, p 0.001). Disease-associated domains are 3.59x more likely to be consensus. Gene-level CRISPRi validation shows 56.4 percent directional accuracy, confirming co-expression rather than causal encoding. Subjects: Machine Learning (cs.LG); Cell Behavior (q-bio.CB); Genomics (q-bio.GN) Cite as: arXiv:2603.01752 [cs.LG] (or arXiv:2603.01752v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.01752 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-31] Practical Deep Heteroskedastic Regression
链接: https://arxiv.org/abs/2603.01750
作者: Mikkel Jordahn,Jonas Vestergaard Jensen,James Harrison,Michael Riis Andersen,Mikkel N. Schmidt
类目: Machine Learning (cs.LG)
*备注:
Abstract:Uncertainty quantification (UQ) in deep learning regression is of wide interest, as it supports critical applications including sequential decision making and risk-sensitive tasks. In heteroskedastic regression, where the uncertainty of the target depends on the input, a common approach is to train a neural network that parameterizes the mean and the variance of the predictive distribution. Still, training deep heteroskedastic regression models poses practical challenges in the trade-off between uncertainty quantification and mean prediction, such as optimization difficulties, representation collapse, and variance overfitting. In this work we identify previously undiscussed fallacies and propose a simple and efficient procedure that addresses these challenges jointly by post-hoc fitting a variance model across the intermediate layers of a pretrained network on a hold-out dataset. We demonstrate that our method achieves on-par or state-of-the-art uncertainty quantification on several molecular graph datasets, without compromising mean prediction accuracy and remaining cheap to use at prediction time.
[LG-32] Decentralized Federated Learning by Partial Message Exchange
链接: https://arxiv.org/abs/2603.01730
作者: Shan Sha,Shenglong Zhou,Xin Wang,Lingchen Kong,Geoffrey Ye Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Decentralized federated learning (DFL) has emerged as a transformative server-free paradigm that enables collaborative learning over large-scale heterogeneous networks. However, it continues to face fundamental challenges, including data heterogeneity, restrictive assumptions for theoretical analysis, and degraded convergence when standard communication- or privacyenhancing techniques are applied. To overcome these drawbacks, this paper develops a novel algorithm, PaME (DFL by Partial Message Exchange). The central principle is to allow only randomly selected sparse coordinates to be exchanged between two neighbor nodes. Consequently, PaME achieves substantial reductions in communication costs while still preserving a high level of privacy, without sacrificing accuracy. Moreover, grounded in rigorous analysis, the algorithm is shown to converge at a linear rate under the gradient to be locally Lipschitz continuous and the communication matrix to be doubly stochastic. These two mild assumptions not only dispense with many restrictive conditions commonly imposed by existing DFL methods but also enables PaME to effectively address data heterogeneity. Furthermore, comprehensive numerical experiments demonstrate its superior performance compared with several representative decentralized learning algorithms.
[LG-33] Security Risks in Machining Process Monitoring: Sequence-to-Sequence Learning for Reconstruction of CNC Axis Positions
链接: https://arxiv.org/abs/2603.01702
作者: Lukas Krupp,Rickmar Stahlschmidt,Norbert Wehn
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted for presentation at the 2026 IEEE Symposium on Artificial Intelligence for Instrumentation and Measurement (AI4IM 2026). Proceedings to be included in IEEE Xplore
Abstract:Accelerometer-based process monitoring is widely deployed in modern machining systems. When mounted on moving machine components, such sensors implicitly capture kinematic information related to machine motion and tool trajectories. If this information can be reconstructed, condition monitoring data constitutes a severe security threat, particularly for retrofitted or weakly protected sensor systems. Classical signal processing approaches are infeasible for position reconstruction from broadband accelerometer signals due to sensor- and process-specific non-idealities, like noise or sensor placement effects. In this work, we demonstrate that sequence-to-sequence machine learning models can overcome these non-idealities and enable reconstruction of CNC axis and tool positions. Our approach employs LSTM-based sequence-to-sequence models and is evaluated on an industrial milling dataset. We show that learning-based models reduce the reconstruction error by up to 98% for low complexity motion profiles and by up to 85% for complex machining sequences compared to double integration. Furthermore, key geometric characteristics of tool trajectories and workpiece-related motion features are preserved. To the best of our knowledge, this is the first study demonstrating learning-based CNC position reconstruction from industrial condition monitoring accelerometer data.
[LG-34] Randomized Neural Networks for Partial Differential Equation on Static and Evolving Surfaces
链接: https://arxiv.org/abs/2603.01689
作者: Jingbo Sun,Fei Wang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Surface partial differential equations arise in numerous scientific and engineering applications. Their numerical solution on static and evolving surfaces remains challenging due to geometric complexity and, for evolving geometries, the need for repeated mesh updates and geometry or solution transfer. While neural-network-based methods offer mesh-free discretizations, approaches based on nonconvex training can be costly and may fail to deliver high accuracy in practice. In this work, we develop a randomized neural network (RaNN) method for solving PDEs on both static and evolving surfaces: the hidden-layer parameters are randomly generated and kept fixed, and the output-layer coefficients are determined efficiently by solving a least-squares problem. For static surfaces, we present formulations for parametrized surfaces, implicit level-set surfaces, and point-cloud geometries, and provide a corresponding theoretical analysis for the parametrization-based formulation with interface compatibility. For evolving surfaces with topology preserved over time, we introduce a RaNN-based strategy that learns the surface evolution through a flow-map representation and then solves the surface PDE on a space–time collocation set, avoiding remeshing. Extensive numerical experiments demonstrate broad applicability and favorable accuracy–efficiency performance on representative benchmarks.
[LG-35] ransform-Invariant Generative Ray Path Sampling for Efficient Radio Propagation Modeling
链接: https://arxiv.org/abs/2603.01655
作者: Jérome Eertmans,Enrico M. Vitucci,Vittorio Degli-Esposti,Nicola Di Cicco,Laurent Jacques,Claude Oestges
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: submitted to npj Wireless Technology, 26 pages, 14 figures
Abstract:Ray tracing has become a standard for accurate radio propagation modeling, but suffers from exponential computational complexity, as the number of candidate paths scales with the number of objects raised to the power of the interaction order. This bottleneck limits its use in large-scale or real-time applications, forcing traditional tools to rely on heuristics to reduce the number of path candidates at the cost of potentially reduced accuracy. To overcome this limitation, we propose a comprehensive machine-learning-assisted framework that replaces exhaustive path searching with intelligent sampling via Generative Flow Networks. Applying such generative models to this domain presents significant challenges, particularly sparse rewards due to the rarity of valid paths, which can lead to convergence failures and trivial solutions when evaluating high-order interactions in complex environments. To ensure robust learning and efficient exploration, our framework incorporates three key architectural components. First, we implement an \emphexperience replay buffer to capture and retain rare valid paths. Second, we adopt a uniform exploratory policy to improve generalization and prevent the model from overfitting to simple geometries. Third, we apply a physics-based action masking strategy that filters out physically impossible paths before the model even considers them. As demonstrated in our experimental validation, the proposed model achieves substantial speedups over exhaustive search – up to 10\times faster on GPU and 1000\times faster on CPU – while maintaining high coverage accuracy and successfully uncovering complex propagation paths. The complete source code, tests, and tutorial are available at this https URL.
[LG-36] owards OOD Generalization in Dynamic Graphs via Causal Invariant Learning AAAI2026
链接: https://arxiv.org/abs/2603.01626
作者: Xinxun Zhang,Pengfei Jiao,Mengzhou Gao,Tianpeng Li,Xuan Guo
类目: Machine Learning (cs.LG)
*备注: 16 pages, 9 figures, accepted by AAAI2026
Abstract:Although dynamic graph neural networks (DyGNNs) have demonstrated promising capabilities, most existing methods ignore out-of-distribution (OOD) shifts that commonly exist in dynamic graphs. Dynamic graph OOD generalization is non-trivial due to the following challenges: 1) Identifying invariant and variant patterns amid complex graph evolution, 2) Capturing the intrinsic evolution rationale from these patterns, and 3) Ensuring model generalization across diverse OOD shifts despite limited data distribution observations. Although several attempts have been made to tackle these challenges, none has successfully addressed all three simultaneously, and they face various limitations in complex OOD scenarios. To solve these issues, we propose a Dynamic graph Causal Invariant Learning (DyCIL) model for OOD generalization via exploiting invariant spatio-temporal patterns from a causal view. Specifically, we first develop a dynamic causal subgraph generator to identify causal dynamic subgraphs explicitly. Next, we design a causal-aware spatio-temporal attention module to extract the intrinsic evolution rationale behind invariant patterns. Finally, we further introduce an adaptive environment generator to capture the underlying dynamics of distributional shifts. Extensive experiments on both real-world and synthetic dynamic graph datasets demonstrate the superiority of our model over state-of-the-art baselines in handling OOD shifts.
[LG-37] Boosting Entropy with Bell Box Quantization ICLR2026
链接: https://arxiv.org/abs/2603.01599
作者: Ningfeng Yang,Tor M. Aamodt
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2026
Abstract:Quantization-Aware Pre-Training (QAPT) is an effective technique to reduce the compute and memory overhead of Deep Neural Networks while improving their energy efficiency on edge devices. Existing QAPT methods produce models stored in compute-efficient data types (e.g. integers) that are not information theoretically optimal (ITO). On the other hand, existing ITO data types (e.g. Quantile/NormalFloat Quantization) are not compute-efficient. We propose BBQ, the first ITO quantization method that is also compute-efficient. BBQ builds on our key insight that since learning is domain-agnostic, the output of a quantizer does not need to reside in the same domain as its input. BBQ performs ITO quantization in its input domain, and returns its output in a compute-efficient domain where ITO data types are mapped to compute-efficient data types. Without sacrificing compute efficiency, BBQ outperforms prior SOTA QAPT methods by a perplexity reduction of up to 2 points for 4-bit models, up to 4 points for 3-bit models, up to 5 points for 2-bit models, and up to 18 points for 1-bit models. Code is available at this https URL.
[LG-38] Jump Like A Squirrel: Optimized Execution Step Order for Anytime Random Forest Inference
链接: https://arxiv.org/abs/2603.01588
作者: Daniel Biebert,Christian Hakert,Kay Heider,Daniel Kuhse,Sebastian Buschjäger,Jian-Jia Chen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Due to their efficiency and small size, decision trees and random forests are popular machine learning models used for classification on resource-constrained systems. In such systems, the available execution time for inference in a random forest might not be sufficient for a complete model execution. Ideally, the already gained prediction confidence should be retained. An anytime algorithm is designed to be able to be aborted anytime, while giving a result with an increasing quality over time. Previous approaches have realized random forests as anytime algorithms on the granularity of trees, stopping after some but not all trees of a forest have been executed. However, due to the way decision trees subdivide the sample space in every step, an increase in prediction quality is achieved with every additional step in one tree. In this paper, we realize decision trees and random forest as anytime algorithms on the granularity of single steps in trees. This approach opens a design space to define the step order in a forest, which has the potential to optimize the mean accuracy. We propose the Optimal Order, which finds a step order with a maximal mean accuracy in exponential runtime and the polynomial runtime heuristics Forward Squirrel Order and Backward Squirrel Order, which greedily maximize the accuracy for each additional step taken down and up the trees, respectively. Our evaluation shows, that the Backward Squirrel Order performs \sim94% as well as the Optimal Order and \sim99% as well as all other step orders. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2603.01588 [cs.LG] (or arXiv:2603.01588v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.01588 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-39] KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models
链接: https://arxiv.org/abs/2603.01581
作者: Zihao Zheng,Zhihao Mao,Maoliang Li,Jiayu Chen,Xinhao Sun,Zhaobo Zhang,Donggang Cao,Hong Mei,Xiang Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This paper has been accepted by DAC 2026
Abstract:Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.
[LG-40] Adversarial Query Synthesis via Bayesian Optimization
链接: https://arxiv.org/abs/2603.01570
作者: Jeffrey Tao,Yimeng Zeng,Haydn Thomas Jones,Natalie Maus,Osbert Bastani,Jacob R. Gardner,Ryan Marcus
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:Benchmark workloads are extremely important to the database management research community, especially as more machine learning components are integrated into database systems. Here, we propose a Bayesian optimization technique to automatically search for difficult benchmark queries, significantly reducing the amount of manual effort usually required. In preliminary experiments, we show that our approach can generate queries with more than double the optimization headroom compared to existing benchmarks.
[LG-41] Scalable Multi-Task Low-Rank Model Adaptation ICLR2026
链接: https://arxiv.org/abs/2603.01526
作者: Zichen Tian,Antoine Ledent,Qianru Sun
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2026. 21 pages, 4 figures, 11 tables. Code is available
Abstract:Scaling multi-task low-rank adaptation (LoRA) to a large number of tasks induces catastrophic performance degradation, such as an accuracy drop from 88.2% to 2.0% on DOTA when scaling from 5 to 15 tasks. This failure is due to parameter and representation misalignment. We find that existing solutions, like regularization and dynamic routing, fail at scale because they are constrained by a fundamental trade-off: strengthening regularization to reduce inter-task conflict inadvertently suppresses the essential feature discrimination required for effective routing. In this work, we identify two root causes for this trade-off. First, uniform regularization disrupts inter-task knowledge sharing: shared underlying knowledge concentrates in high-SV components (89% alignment on Flanv2-BBH). Uniform regularization forces high-SV components to update in orthogonal directions, directly disrupting the shared knowledge. Second, Conflict Amplification: Applying LoRA at the component-level (e.g., W_q, W_v) amplifies gradient conflicts; we show block-level adaptation reduces this conflict by 76% with only 50% parameters. Based on these insights, we propose mtLoRA, a scalable solution with three novel designs: 1) Spectral-Aware Regularization to selectively orthogonalize low-SV components while preserving high-SV shared knowledge, 2) Block-Level Adaptation to mitigate conflict amplification and largely improve parameter efficiency, and 3) Fine-Grained Routing using dimension-specific weights for superior expressive power. On four large-scale (15-25 tasks) vision (DOTA and iNat2018) and NLP (Dolly-15k and BBH) benchmarks, mtLoRA achieves 91.7%, 81.5%, 44.5% and 38.5% accuracy on DOTA, iNat2018, Dolly-15k and BBH respectively, outperforming the state-of-the-art by 2.3% on average while using 47% fewer parameters and 24% less training time.
[LG-42] raining Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning
链接: https://arxiv.org/abs/2603.01514
作者: Gautam Goel,Mahdi Soltanolkotabi,Peter Bartlett
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel “structure-aware” variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the manifold of global minima with high probability.
[LG-43] Randomized Kiring Believer for Parallel Bayesian Optimization with Regret Bounds
链接: https://arxiv.org/abs/2603.01470
作者: Shuhei Sugiura,Ichiro Takeuchi,Shion Takeno
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We consider an optimization problem of an expensive-to-evaluate black-box function, in which we can obtain noisy function values in parallel. For this problem, parallel Bayesian optimization (PBO) is a promising approach, which aims to optimize with fewer function evaluations by selecting a diverse input set for parallel evaluation. However, existing PBO methods suffer from poor practical performance or lack theoretical guarantees. In this study, we propose a PBO method, called randomized kriging believer (KB), based on a well-known KB heuristic and inheriting the advantages of the original KB: low computational complexity, a simple implementation, versatility across various BO methods, and applicability to asynchronous parallelization. Furthermore, we show that our randomized KB achieves Bayesian expected regret guarantees. We demonstrate the effectiveness of the proposed method through experiments on synthetic and benchmark functions and emulators of real-world data.
[LG-44] SEAnet: A Deep Learning Architecture for Data Series Similarity Search DATE
链接: https://arxiv.org/abs/2603.01448
作者: Qitong Wang,Themis Palpanas
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: This paper was published in IEEE Transactions on Knowledge and Data Engineering (Volume: 35, Issue: 12, Page(s): 12972 - 12986, 01 December 2023). Date of Publication: 25 April 2023
Abstract:A key operation for massive data series collection analysis is similarity search. According to recent studies, SAX-based indexes offer state-of-the-art performance for similarity search tasks. However, their performance lags under high-frequency, weakly correlated, excessively noisy, or other dataset-specific properties. In this work, we propose Deep Embedding Approximation (DEA), a novel family of data series summarization techniques based on deep neural networks. Moreover, we describe SEAnet, a novel architecture especially designed for learning DEA, that introduces the Sum of Squares preservation property into the deep network design. We further enhance SEAnet with SEAtrans encoder. Finally, we propose novel sampling strategies, SEAsam and SEAsamE, that allow SEAnet to effectively train on massive datasets. Comprehensive experiments on 7 diverse synthetic and real datasets verify the advantages of DEA learned using SEAnet in providing high-quality data series summarizations and similarity search results.
[LG-45] Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data
链接: https://arxiv.org/abs/2603.01444
作者: Thomas Rückstieß,Robin Vujanic
类目: Machine Learning (cs.LG)
*备注: Under Submission
Abstract:Synthetic data generation is a critical capability for data sharing, privacy compliance, system benchmarking and test data provisioning. Existing methods assume dense, fixed-schema tabular data, yet this assumption is increasingly at odds with modern data systems - from document databases, REST APIs to data lakes - which store and exchange data in sparse, semi-structured formats like JSON. Applying existing tabular methods to such data requires flattening of nested data into wide, sparse tables which scales poorly. We present Origami, an autoregressive transformer-based architecture that tokenizes data records, including nested objects and variable length arrays, into sequences of key, value and structural tokens. This representation natively handles sparsity, mixed types and hierarchical structure without flattening or imputation. Origami outperforms baselines spanning GAN, VAE, diffusion and autoregressive architectures on fidelity, utility and detection metrics across nearly all settings, while maintaining high privacy scores. On semi-structured datasets with up to 38% sparsity, baseline synthesizers either fail to scale or degrade substantially, while Origami maintains high-fidelity synthesis that is harder to distinguish from real data. To the best of our knowledge, Origami is the first architecture capable of natively modeling and generating semi-structured data end-to-end.
[LG-46] ackling multiphysics problems via finite element-guided physics-informed operator learning
链接: https://arxiv.org/abs/2603.01420
作者: Yusuke Yamazaki,Reza Najian Asl,Markus Apel,Mayu Muramatsu,Shahed Rezaei
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work presents a finite element-guided physics-informed operator learning framework for multiphysics problems with coupled partial differential equations (PDEs) on arbitrary domains. Implemented with Folax, a JAX-based operator-learning platform, the proposed framework learns a mapping from the input parameter space to the solution space with a weighted residual formulation based on the finite element method, enabling discretization-independent prediction beyond the training resolution without relying on labaled simulation data. The present framework for multiphysics problems is verified on nonlinear thermo-mechanical problems. Two- and three-dimensional representative volume elements with varying heterogeneous microstructures, and a close-to-reality industrial casting example under varying boundary conditions are investigated as the example problems. We investigate the potential of several neural operator backbones, including Fourier neural operators (FNOs), deep operator networks (DeepONets), and a newly proposed implicit finite operator learning (iFOL) approach based on conditional neural fields. The results demonstrate that FNOs yield highly accurate solution operators on regular domains, where the global topology can be efficiently learned in the spectral domain, and iFOL offers efficient parametric operator learning capabilities for complex and irregular geometries. Furthermore, studies on training strategies, network decomposition, and training sample quality reveal that a monolithic training strategy using a single network is sufficient for accurate predictions, while training sample quality strongly influences performance. Overall, the present approach highlights the potential of physics-informed operator learning with a finite element-based loss as a unified and scalable approach for coupled multiphysics simulations.
[LG-47] One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers ICLR2026
链接: https://arxiv.org/abs/2603.01406
作者: Lennon J. Shikhman
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Accepted to the ICLR 2026 Workshop on AI PDEs. 10 pages, 5 figures
Abstract:Neural PDE solvers are often described as learning solution operators that map problem data to PDE solutions. In this work, we argue that this interpretation is generally incorrect when boundary conditions vary. We show that standard neural operator training implicitly learns a boundary-indexed family of operators, rather than a single boundary-agnostic operator, with the learned mapping fundamentally conditioned on the boundary-condition distribution seen during training. We formalize this perspective by framing operator learning as conditional risk minimization over boundary conditions, which leads to a non-identifiability result outside the support of the training boundary distribution. As a consequence, generalization in forcing terms or resolution does not imply generalization across boundary conditions. We support our theoretical analysis with controlled experiments on the Poisson equation, demonstrating sharp degradation under boundary-condition shifts, cross-distribution failures between distinct boundary ensembles, and convergence to conditional expectations when boundary information is removed. Our results clarify a core limitation of current neural PDE solvers and highlight the need for explicit boundary-aware modeling in the pursuit of foundation models for PDEs.
[LG-48] Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification
链接: https://arxiv.org/abs/2603.01399
作者: Guang Huang,Zeyi Wen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 10 pages
Abstract:Speculative Decoding (SD) has emerged as a premier technique for accelerating Large Language Model (LLM) inference by decoupling token generation into rapid drafting and parallel verification. While recent advancements in self-speculation and lookahead decoding have successfully minimized drafting overhead, they have shifted the primary performance bottleneck to the verification phase. Since verification requires a full forward pass of the target model, it remains strictly memory-bandwidth bound, fundamentally limiting the maximum achievable this http URL this paper, we introduce \textbfQuasar (\textbfQuantized \textbfSelf-speculative \textbfAcceleration for \textbfRapid Inference), a novel, training-free framework designed to overcome this “memory wall” by employing low-bit quantization specifically for the verification stage. Our empirical analysis reveals that while aggressive structural pruning significantly degrades verification accuracy, quantization-based verification preserves the logit distribution with high fidelity while effectively halving memory traffic. Extensive experiments on state-of-the-art models (e.g., OpenPangu and Qwen3) demonstrate that Quasar maintains a speculative acceptance length comparable to full-precision methods while achieving a 1.28\times improvement in end-to-end throughput. Being orthogonal to existing drafting strategies, Quasar offers a generic and efficient pathway to accelerate the verification leg of speculative execution. Code is available at this https URL.
[LG-49] Invariant-Stratified Propagation for Expressive Graph Neural Networks
链接: https://arxiv.org/abs/2603.01388
作者: Asela Hevapathige,Ahad N. Zehmakan,Asiri Wijesinghe,Saman Halgamuge
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Graph Neural Networks (GNNs) face fundamental limitations in expressivity and capturing structural heterogeneity. Standard message-passing architectures are constrained by the 1-dimensional Weisfeiler-Leman (1-WL) test, unable to distinguish graphs beyond degree sequences, and aggregate information uniformly from neighbors, failing to capture how nodes occupy different structural positions within higher-order patterns. While methods exist to achieve higher expressivity, they incur prohibitive computational costs and lack unified frameworks for flexibly encoding diverse structural properties. To address these limitations, we introduce Invariant-Stratified Propagation (ISP), a framework comprising both a novel WL variant (ISP-WL) and its efficient neural network implementation (ISPGNN). ISP stratifies nodes according to graph invariants, processing them in hierarchical strata that reveal structural distinctions invisible to 1-WL. Through hierarchical structural heterogeneity encoding, ISP quantifies differences in nodes’ structural positions within higher-order patterns, distinguishing interactions where participants occupy different roles from those with uniform participation. We provide formal theoretical analysis establishing enhanced expressivity beyond 1-WL, convergence guarantees, and inherent resistance to oversmoothing. Extensive experiments across graph classification, node classification, and influence estimation demonstrate consistent improvements over standard architectures and state-of-the-art expressive baselines.
[LG-50] 3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLM s
链接: https://arxiv.org/abs/2603.01376
作者: Mehdi Makni,Xiang Meng,Rahul Mazumder
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: The Thirty-ninth Annual Conference on Neural Information Processing Systems
Abstract:Sparse plus Low-Rank (\mathbfS + \mathbfLR) decomposition of Large Language Models (LLMs) has emerged as a promising direction in model compression, aiming to decompose pre-trained model weights into a sum of sparse and low-rank matrices (\mathbfW \approx \mathbfS + \mathbfLR) . Despite recent progress, existing methods often suffer from substantial performance degradation compared to dense models. In this work, we introduce 3BASiL-TM, an efficient one-shot post-training method for (\mathbfS + \mathbfLR) decomposition of LLMs that addresses this gap. Our approach first introduces a novel 3-Block Alternating Direction Method of Multipliers (ADMM) method, termed 3BASiL, to minimize the layer-wise reconstruction error with convergence guarantees. We then design an efficient transformer-matching ™ refinement step that jointly optimizes the sparse and low-rank components across transformer layers. This step minimizes a novel memory-efficient loss that aligns outputs at the transformer level. Notably, the TM procedure is universal as it can enhance any (\mathbfS + \mathbfLR) decomposition, including pure sparsity. Our numerical experiments show that 3BASiL-TM reduces the WikiText2 perplexity gap relative to dense LLaMA-8B model by over 30% under a (2:4 Sparse + 64 LR) configuration, compared to prior methods. Moreover, our method achieves over 2.5x faster compression runtime on an A100 GPU compared to SOTA (\mathbfS + \mathbfLR) method. Our code is available at this https URL.
[LG-51] DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking
链接: https://arxiv.org/abs/2603.01367
作者: Gilad Turok,Chris De Sa,Volodymyr Kuleshov
类目: Machine Learning (cs.LG)
*备注: 22 pages, 5 figures 8 tables
Abstract:Masked diffusion models (MDMs) generate text by iteratively selecting positions to unmask and then predicting tokens at those positions. Yet MDMs lack proper perplexity evaluation: the ELBO is a loose bound on likelihood under the training distribution, not the test-time distribution, while generative perplexity requires a biased external model and ignores diversity. To address this, we introduce the \textscDUEL framework, which formalizes \emphdeterministic position selection, unifying leading MDM sampling strategies. We prove \textbf\textscDUEL admits \emphexact likelihood computation via a simple algorithm, evaluated under the same position selection used at test time. This \textbfgives MDMs proper perplexity for the first time – the natural analogue of autoregressive perplexity. With proper perplexity in hand, we revisit key questions about MDMs. \textbfMDMs are substantially better than previously thought: the MDM-autoregressive perplexity gap shrinks by up to 32% on in-domain data and 82% on zero-shot benchmarks. \textscDUEL enables the first principled comparison of fast, parallel samplers across compute budgets – an analysis impossible with the ELBO and unreliable with generative perplexity – identifying probability margin \citepkim2025train as a strong default. Finally, oracle search over position orderings reveals MDMs can far surpass autoregressive models – achieving 36.47 vs.\ 52.11 perplexity on AG News – demonstrating the ceiling of MDM performance has not yet been reached.
[LG-52] Fed-GAME: Personalized Federated Learning with Graph Attention Mixture-of-Experts For Time-Series Forecasting
链接: https://arxiv.org/abs/2603.01363
作者: Yi Li,Han Liu,Mingfeng Fan,Guo Chen,Chaojie Li,Biplab Sikdar
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Federated learning (FL) on graphs shows promise for distributed time-series forecasting. Yet, existing methods rely on static topologies and struggle with client heterogeneity. We propose Fed-GAME, a framework that models personalized aggregation as message passing over a learnable dynamic implicit graph. The core is a decoupled parameter difference-based update protocol, where clients transmit parameter differences between their fine-tuned private model and a shared global model. On the server, these differences are decomposed into two streams: (1) averaged difference used to updating the global model for consensus (2) the selective difference fed into a novel Graph Attention Mixture-of-Experts (GAME) aggregator for fine-grained personalization. In this aggregator, shared experts provide scoring signals while personalized gates adaptively weight selective updates to support personalized aggregation. Experiments on two real-world electric vehicle charging datasets demonstrate that Fed-GAME outperforms state-of-the-art personalized FL baselines.
[LG-53] Relatively Smart: A New Approach for Instance-Optimal Learning
链接: https://arxiv.org/abs/2603.01346
作者: Shaddin Dughmi,Alireza F. Pour
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We revisit the framework of Smart PAC learning, which seeks supervised learners which compete with semi-supervised learners that are provided full knowledge of the marginal distribution on unlabeled data. Prior work has shown that such marginal-by-marginal guarantees are possible for “most” marginals, with respect to an arbitrary fixed and known measure, but not more generally. We discover that this failure can be attributed to an “indistinguishability” phenomenon: There are marginals which cannot be statistically distinguished from other marginals that require different learning approaches. In such settings, semi-supervised learning cannot certify its guarantees from unlabeled data, rendering them arguably non-actionable. We propose relatively smart learning, a new framework which demands that a supervised learner compete only with the best “certifiable” semi-supervised guarantee. We show that such modest relaxation suffices to bypass the impossibility results from prior work. In the distribution-free setting, we show that the OIG learner is relatively smart up to squaring the sample complexity, and show that no supervised learning algorithm can do better. For distribution-family settings, we show that relatively smart learning can be impossible or can require idiosyncratic learning approaches, and its difficulty can be non-monotone in the inclusion order on distribution families. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2603.01346 [cs.LG] (or arXiv:2603.01346v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.01346 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-54] PAC Guarantees for Reinforcement Learning: Sample Complexity Coverag e and Structure
链接: https://arxiv.org/abs/2603.01309
作者: Joshua Steier
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43 pages
Abstract:When data is scarce or mistakes are costly, average-case metrics fall short. What a practitioner needs is a guarantee: with probability at least 1-\delta , the learned policy is \varepsilon -close to optimal after N episodes. This is the PAC promise, and between 2018 and 2025 the RL theory community made striking progress on when such promises can be kept. We survey that progress. Our organizing tool is the Coverage-Structure-Objective (CSO) framework, proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver). CSO is not a theorem but an interpretive template that identifies bottlenecks and makes cross-setting comparison immediate. The technical core covers tight tabular baselines and the uniform-PAC bridge to regret; structural complexity measures (Bellman rank, witness rank, Bellman-Eluder dimension) governing learnability with function approximation; results for linear, kernel/NTK, and low-rank models; reward-free exploration as upfront coverage investment; and pessimistic offline RL where inherited coverage is the binding constraint. We provide practitioner tools: rate lookup tables indexed by CSO coordinates, Bellman residual diagnostics, coverage estimation with deployment gates, and per-episode policy certificates. A final section catalogs open problems, separating near-term targets from frontier questions where coverage, structure, and computation tangle in ways current theory cannot resolve.
[LG-55] Nonconvex Latent Optimally Partitioned Block-Sparse Recovery via Log-Sum and Minimax Concave Penalties
链接: https://arxiv.org/abs/2603.01304
作者: Takanobu Furuhashi,Hiroki Kuroda,Masahiro Yukawa,Qibin Zhao,Hidekata Hontani,Tatsuya Yokota
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 11 figures
Abstract:We propose two nonconvex regularization methods, LogLOP-l2/l1 and AdaLOP-l2/l1, for recovering block-sparse signals with unknown block partitions. These methods address the underestimation bias of existing convex approaches by extending log-sum penalty and the Minimax Concave Penalty (MCP) to the block-sparse domain via novel variational formulations. Unlike Generalized Moreau Enhancement (GME) and Bayesian methods dependent on the squared-error data fidelity term, our proposed methods are compatible with a broad range of data fidelity terms. We develop efficient Alternating Direction Method of Multipliers (ADMM)-based algorithms for these formulations that exhibit stable empirical convergence. Numerical experiments on synthetic data, angular power spectrum estimation, and denoising of nanopore currents demonstrate that our methods outperform state-of-the-art baselines in estimation accuracy.
[LG-56] he Impact of Battery Cell Configuration on Electric Vehicle Performance: An XGBoost-Based Classification with SHAP Interpretability
链接: https://arxiv.org/abs/2603.01275
作者: Santanam Wishal,Louis Filiepe Tio Jansel,Matthew Abednego Inkiriwang,Jason Sebastian
类目: Machine Learning (cs.LG)
*备注: 12 pages, 7 figures, 3 tables
Abstract:As the electric vehicle (EV) market continues to prioritize dynamic performance and rapid charging, battery configuration has rapidly evolved. Despite this, current literature has often overlooked the complex, non-linear relationship between battery configuration and electric vehicle performance. To address this gap, this study proposes a machine learning framework which categorizes the EV acceleration performance into High (= 4.0 seconds), Mid (4.0 - 7.0 seconds), and Low ( 7.0 seconds). Utilizing a preprocessed dataset consisting of 276 EV samples, an Extreme Gradient Boosting (XGBoost) classifier was utilized, achieving 87.5% predictive accuracy, a 0.968 ROC-AUC, and a 0.812 MCC. In order to ensure engineering transparency SHapley Additive exPlanations (SHAP) were employed. Results of analysis shows that an increase in battery cell count initially boosts power delivery, but its mass and complexity diminished performance gains eventually. As such, these findings indicate that battery configuration in EVs must balance system complexity and architectural configuration in order to receive and retain optimal vehicle performance.
[LG-57] S2O: Enhancing Adversarial Training with Second-Order Statistics of Weights
链接: https://arxiv.org/abs/2603.01264
作者: Gaojie Jin,Xinping Yi,Wei Huang,Sven Schewe,Xiaowei Huang
类目: Machine Learning (cs.LG)
*备注: Accepted to TPAMI 2025
Abstract:Adversarial training has emerged as a highly effective way to improve the robustness of deep neural networks (DNNs). It is typically conceptualized as a min-max optimization problem over model weights and adversarial perturbations, where the weights are optimized using gradient descent methods, such as SGD. In this paper, we propose a novel approach by treating model weights as random variables, which paves the way for enhancing adversarial training through \textbfSecond-Order \textbfStatistics \textbfOptimization (S ^2 O) over model weights. We challenge and relax a prevalent, yet often unrealistic, assumption in prior PAC-Bayesian frameworks: the statistical independence of weights. From this relaxation, we derive an improved PAC-Bayesian robust generalization bound. Our theoretical developments suggest that optimizing the second-order statistics of weights can substantially tighten this bound. We complement this theoretical insight by conducting an extensive set of experiments that demonstrate that S ^2 O not only enhances the robustness and generalization of neural networks when used in isolation, but also seamlessly augments other state-of-the-art adversarial training techniques. The code is available at this https URL.
[LG-58] Subliminal Signals in Preference Labels
链接: https://arxiv.org/abs/2603.01204
作者: Isotta Magistrali,Frédéric Berdoz,Sam Dauncey,Roger Wattenhofer
类目: Machine Learning (cs.LG)
*备注:
Abstract:As AI systems approach superhuman capabilities, scalable oversight increasingly relies on LLM-as-a-judge frameworks where models evaluate and guide each other’s training. A core assumption is that binary preference labels provide only semantic supervision about response quality. We challenge this assumption by demonstrating that preference labels can function as a covert communication channel. We show that even when a neutral student model generates semantically unbiased completions, a biased judge can transmit unintended behavioral traits through preference assignments, which even strengthen across iterative alignment rounds. Our findings suggest that robust oversight in superalignment settings requires mechanisms that can detect and mitigate subliminal preference transmission, particularly when judges may pursue unintended objectives.
[LG-59] Operator Learning Using Weak Supervision from Walk-on-Spheres
链接: https://arxiv.org/abs/2603.01193
作者: Hrishikesh Viswanath,Hong Chul Nam,Xi Deng,Julius Berner,Anima Anandkumar,Aniket Bera
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training neural PDE solvers is often bottlenecked by expensive data generation or unstable physics-informed neural network (PINN) that involves challenging optimization landscapes due to higher-order derivatives. To tackle this issue, we propose an alternative approach using Monte Carlo approaches to estimate the solution to the PDE as a stochastic process for weak supervision during training. Leveraging the walk-on-spheres method, we introduce a learning scheme called \emphWalk-on-Spheres Neural Operator (WoS-NO) which uses weak supervision from WoS to train any given neural operator. We propose to amortize the cost of Monte Carlo walks across the distribution of PDE instances using stochastic representations from the WoS algorithm to generate cheap, noisy, estimates of the PDE solution during training. This is formulated into a data-free physics-informed objective where a neural operator is trained to regress against these weak supervisions, allowing the operator to learn a generalized solution map for an entire family of PDEs. This strategy results in a mesh-free framework that operates without expensive pre-computed datasets, avoids the need for computing higher-order derivatives for loss functions that are memory-intensive and unstable, and demonstrates zero-shot generalization to novel PDE parameters and domains. Experiments show that for the same number of training steps, our method exhibits up to 8.75 \times improvement in L_2 -error compared to standard physics-informed training schemes, up to 6.31 \times improvement in training speed, and reductions of up to 2.97 \times in GPU memory consumption. We present the code at this https URL
[LG-60] PARWiS: Winner determination under shoestring budgets using active pairwise comparisons
链接: https://arxiv.org/abs/2603.01171
作者: Shailendra Bhandari
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Neural and Evolutionary Computing (cs.NE)
*备注: 12 pages
Abstract:Determining a winner among a set of items using active pairwise comparisons under a limited budget is a challenging problem in preference-based learning. The goal of this study is to implement and evaluate the PARWiS algorithm, which shows spectral ranking and disruptive pair selection to identify the best item under shoestring budgets. This work have extended the PARWiS with a contextual variant (Contextual PARWiS) and a reinforcement learning-based variant (RL PARWiS), comparing them against baselines, including Double Thompson Sampling and a random selection strategy. This evaluation spans synthetic and real-world datasets (Jester and MovieLens), using budgets of 40, 60, and 80 comparisons for 20 items. The performance is measured through recovery fraction, true rank of reported winner, reported rank of true winner, and cumulative regret, alongside the separation metric (\Delta_1,2). Results show that PARWiS and RL PARWiS outperform baselines across all datasets, particularly in the Jester dataset with a higher (\Delta_1,2), while performance gaps narrow in the more challenging MovieLens dataset with a smaller (\Delta_1,2). Contextual PARWiS shows comparable performance to PARWiS, indicating that contextual features may require further tuning to provide significant benefits.
[LG-61] Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic
链接: https://arxiv.org/abs/2603.01162
作者: Hongyi Zhou,Kai Ye,Erhan Xu,Jin Zhu,Shijin Gong,Chengchun Shi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 32 pages, 4 figures
Abstract:Group relative policy optimization (GRPO), a core methodological component of DeepSeekMath and DeepSeek-R1, has emerged as a cornerstone for scaling reasoning capabilities of large language models. Despite its widespread adoption and the proliferation of follow-up works, the theoretical properties of GRPO remain less studied. This paper provides a unified framework to understand GRPO through the lens of classical U-statistics. We demonstrate that the GRPO policy gradient is inherently a U-statistic, allowing us to characterize its mean squared error (MSE), derive the finite-sample error bound and asymptotic distribution of the suboptimality gap for its learned policy. Our findings reveal that GRPO is asymptotically equivalent to an oracle policy gradient algorithm – one with access to a value function that quantifies the goodness of its learning policy at each training iteration – and achieves asymptotically optimal performance within a broad class of policy gradient algorithms. Furthermore, we establish a universal scaling law that offers principled guidance for selecting the optimal group size. Empirical experiments further validate our theoretical findings, demonstrating that the optimal group size is universal, and verify the oracle property of GRPO.
[LG-62] A Decomposition Framework for Certifiably Optimal Orthogonal Sparse PCA
链接: https://arxiv.org/abs/2603.01144
作者: Difei Cheng,Qiao Hu
类目: Machine Learning (cs.LG)
*备注: 14 pages; 12 figures
Abstract:Sparse Principal Component Analysis (SPCA) is an important technique for high-dimensional data analysis, improving interpretability by imposing sparsity on principal components. However, existing methods often fail to simultaneously guarantee sparsity, orthogonality, and optimality of the principal components. To address this challenge, this work introduces a novel Sparse Principal Component Analysis (SPCA) algorithm called \textscGS-SPCA (SPCA with Gram-Schmidt Orthogonalization), which simultaneously enforces sparsity, orthogonality, and optimality. However, the original GS-SPCA algorithm is computationally expensive due to the inherent \ell_0 -norm constraint. To address this issue, we propose two acceleration strategies: First, we combine \textbfBranch-and-Bound with the GS-SPCA algorithm. By incorporating this strategy, we are able to obtain \varepsilon -optimal solutions with a trade-off between precision and efficiency, significantly improving computational speed. Second, we propose a \textbfdecomposition framework for efficiently solving \textbfmultiple principal components. This framework approximates the covariance matrix using a block-diagonal matrix through a thresholding method, reducing the original SPCA problem to a set of block-wise subproblems on approximately block-diagonal matrices. Comments: 14 pages; 12 figures Subjects: Machine Learning (cs.LG) MSC classes: 62H25 ACMclasses: I.5.0 Cite as: arXiv:2603.01144 [cs.LG] (or arXiv:2603.01144v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.01144 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Difei Cheng [view email] [v1] Sun, 1 Mar 2026 15:09:37 UTC (154 KB)
[LG-63] Understanding LoRA as Knowledge Memory: An Empirical Analysis
链接: https://arxiv.org/abs/2603.01097
作者: Seungju Back,Dongwoo Lee,Naun Kang,Taehee Lee,S. K. Hong,Youngjune Gwon,Sungjin Ahn
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continuous knowledge updating for pre-trained large language models (LLMs) is increasingly necessary yet remains challenging. Although inference-time methods like In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) are popular, they face constraints in context budgets, costs, and retrieval fragmentation. Departing from these context-dependent paradigms, this work investigates a parametric approach using Low-Rank Adaptation (LoRA) as a modular knowledge memory. Although few recent works examine this concept, the fundamental mechanics governing its capacity and composability remain largely unexplored. We bridge this gap through the first systematic empirical study mapping the design space of LoRA-based memory, ranging from characterizing storage capacity and optimizing internalization to scaling multi-module systems and evaluating long-context reasoning. Rather than proposing a single architecture, we provide practical guidance on the operational boundaries of LoRA memory. Overall, our findings position LoRA as the complementary axis of memory alongside RAG and ICL, offering distinct advantages.
[LG-64] Adaptive-Growth Randomized Neural Networks for Level-Set Computation of Multivalued Nonlinear First-Order PDEs with Hyperbolic Characteristics
链接: https://arxiv.org/abs/2603.01093
作者: Haoning Dang,Shi Jin,Fei Wang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes an Adaptive-Growth Randomized Neural Network (AG-RaNN) method for computing multivalued solutions of nonlinear first-order PDEs with hyperbolic characteristics, including quasilinear hyperbolic balance laws and Hamilton–Jacobi equations. Such solutions arise in geometric optics, seismic waves, semiclassical limit of quantum dynamics and high frequency limit of linear waves, and differ markedly from the viscosity or entropic solutions. The main computational challenges lie in that the solutions are no longer functions, and become union of multiple branches, after the formation of singularities. Level-set formulations offer a systematic alternative by embedding the nonlinear dynamics into linear transport equations posed in an augmented phase space, at the price of substantially increased dimensionality. To alleviate this computational burden, we combine AG-RaNN with an adaptive collocation strategy that concentrates samples in a tubular neighborhood of the zero level set, together with a layer-growth mechanism that progressively enriches the randomized feature space. Under standard regularity assumptions on the transport field and the characteristic flow, we establish a convergence result for the AG-RaNN approximation of the level-set equations. Numerical experiments demonstrate that the proposed method can efficiently recover multivalued structures and resolve nonsmooth features in high-dimensional settings.
[LG-65] A level-wise training scheme for learning neural multigrid smoothers with application to integral equations
链接: https://arxiv.org/abs/2603.01064
作者: Lingfeng Li,Yin King Chu,Raymond Chan,Justin Wan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Convolution-type integral equations commonly occur in signal processing and image processing. Discretizing these equations yields large and ill-conditioned linear systems. While the classic multigrid method is effective for solving linear systems derived from partial differential equations (PDE) problems, it fails to solve integral equations because its smoothers, which are implemented as conventional relaxation methods, are ineffective in reducing high-frequency components in the errors. We propose a novel neural multigrid scheme where learned neural operators replace classical smoothers. Unlike classical smoothers, these operators are trained offline. Once trained, the neural smoothers generalize to new right-hand-side vectors without retraining, making it an efficient solver. We design level-wise loss functions incorporating spectral filtering to emulate the multigrid frequency decomposition principle, ensuring each operator focuses on solving distinct high-frequency spectral bands. Although we focus on integral equations, the framework is generalizable to all kinds of problems, including PDE problems. Our experiments demonstrate superior efficiency over classical solvers and robust convergence across varying problem sizes and regularization weights. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.01064 [cs.LG] (or arXiv:2603.01064v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.01064 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-66] No More Maybe-Arrows: Resolving Causal Uncertainty by Breaking Symmetries
链接: https://arxiv.org/abs/2603.01052
作者: Tingrui Huang,Devendra Singh Dhami
类目: Machine Learning (cs.LG)
*备注:
Abstract:The recent works on causal discovery have followed a similar trend of learning partial ancestral graphs (PAGs) since observational data constrain the true causal directed acyclic graph (DAG) only up to a Markov equivalence class. This limits their application in the majority of downstream tasks, as uncertainty in causal relations remains unresolved. We propose a new refinement framework, CausalSAGE, for converting PAGs to DAGs while respecting the underlying causal relations. The framework expands discrete variables into state-level representations, constrains the search space using structural knowledge and soft priors, and applies a unified differentiable objective for joint optimization. The final DAG is obtained by aggregating the optimized structures and enforcing acyclicity when necessary. Our experimental evaluations show that the obtained DAGs preserve the underlying causal relations while also being efficient to obtain.
[LG-67] Evaluating GFlowNet from partial episodes for stable and flexible policy-based training ICLR2026
链接: https://arxiv.org/abs/2603.01047
作者: Puhua Niu,Shili Wu,Xiaoning Qian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by ICLR 2026
Abstract:Generative Flow Networks (GFlowNets) were developed to learn policies for efficiently sampling combinatorial candidates by interpreting their generative processes as trajectories in directed acyclic graphs. In the value-based training workflow, the objective is to enforce the balance over partial episodes between the flows of the learned policy and the estimated flows of the desired policy, implicitly encouraging policy divergence minimization. The policy-based strategy alternates between estimating the policy divergence and updating the policy, but reliable estimation of the divergence under directed acyclic graphs remains a major challenge. This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator. As demonstrated on both synthetic and real-world tasks, evaluation balance not only strengthens the reliability of policy-based training but also broadens its flexibility by seamlessly supporting parameterized backward policies and enabling the integration of offline data-collection techniques.
[LG-68] Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift CVPR2026
链接: https://arxiv.org/abs/2603.01040
作者: Heewon Park,Mugon Joe,Miru Kim,Kyungjin Im,Minhae Kwon
类目: Machine Learning (cs.LG)
*备注: Accepted at CVPR 2026
Abstract:Federated learning (FL) in post-deployment settings must adapt to non-stationary data streams across heterogeneous clients without access to ground-truth labels. A major challenge is learning rate selection under client-specific, time-varying distribution shifts, where fixed learning rates often lead to underfitting or divergence. We propose Fed-ADE (Federated Adaptation with Distribution Shift Estimation), an unsupervised federated adaptation framework that leverages lightweight estimators of distribution dynamics. Specifically, Fed-ADE employs uncertainty dynamics estimation to capture changes in predictive uncertainty and representation dynamics estimation to detect covariate-level feature drift, combining them into a per-client, per-timestep adaptive learning rate. We provide theoretical analyses showing that our dynamics estimation approximates the underlying distribution shift and yields dynamic regret and convergence guarantees. Experiments on image and text benchmarks under diverse distribution shifts (label and covariate) demonstrate consistent improvements over strong baselines. These results highlight that distribution shift-aware adaptation enables effective and robust federated post-adaptation under real-world non-stationarity.
[LG-69] BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models
链接: https://arxiv.org/abs/2603.01019
作者: Jiayao Wang,Yiping Zhang,Mohammad Maruf Hasan,Xiaoying Lei,Jiale Zhang,Junwu Zhu,Qilin Wu,Dongfang Zhao
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Self-supervised diffusion models learn high-quality visual representations via latent space denoising. However, their representation layer poses a distinct threat: unlike traditional attacks targeting generative outputs, its unconstrained latent semantic space allows for stealthy backdoors, permitting malicious control upon triggering. In this paper, we propose BadRSSD, the first backdoor attack targeting the representation layer of self-supervised diffusion models. Specifically, it hijacks the semantic representations of poisoned samples with triggers in Principal Component Analysis (PCA) space toward those of a target image, then controls the denoising trajectory during diffusion by applying coordinated constraints across latent, pixel, and feature distribution spaces to steer the model toward generating the specified target. Additionally, we integrate representation dispersion regularization into the constraint framework to maintain feature space uniformity, significantly enhancing attack stealth. This approach preserves normal model functionality (high utility) while achieving precise target generation upon trigger activation (high specificity). Experiments on multiple benchmark datasets demonstrate that BadRSSD substantially outperforms existing attacks in both FID and MSE metrics, reliably establishing backdoors across different architectures and configurations, and effectively resisting state-of-the-art backdoor defenses.
[LG-70] Feature-Weighted Maximum Representative Subsampling
链接: https://arxiv.org/abs/2603.01013
作者: Tony Hauptmann,Stefan Kramer
类目: Machine Learning (cs.LG)
*备注:
Abstract:In the social sciences, it is often necessary to debias studies and surveys before valid conclusions can be drawn. Debiasing algorithms enable the computational removal of bias using sample weights. However, an issue arises when only a subset of features is highly biased, while the rest is already representative. Algorithms need to strongly alter the sample distribution to manage a few highly biased features, which can in turn introduce bias into already representative variables. To address this issue, we developed a method that uses feature weights to minimize the impact of highly biased features on the computation of sample weights. Our algorithm is based on Maximum Representative Subsampling (MRS), which debiases datasets by aligning a non-representative sample with a representative one through iterative removal of elements to create a representative subsample. The new algorithm, named feature-weighted MRS (FW-MRS), decreases the emphasis on highly biased features, allowing it to retain more instances for downstream tasks. The feature weights are derived from the feature importance of a domain classifier trained to differentiate between the representative and non-representative datasets. We validated FW-MRS using eight tabular datasets, each of which we artificially biased. Biased features can be important for downstream tasks, and focusing less on them could lead to a decline in generalization. For this reason, we assessed the generalization performance of FW-MRS on downstream tasks and found no statistically significant differences. Additionally, FW-MRS was applied to a real-world dataset from the social sciences. The source code is available at this https URL.
[LG-71] DWAFM: Dynamic Weighted Graph Structure Embedding Integrated with Attention and Frequency-Domain MLPs for Traffic Forecasting
链接: https://arxiv.org/abs/2603.00997
作者: Sen Shi,Zhichao Zhang,Yangfan He
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Accurate traffic prediction is a key task for intelligent transportation systems. The core difficulty lies in accurately modeling the complex spatial-temporal dependencies in traffic data. In recent years, improvements in network architecture have failed to bring significant performance enhancements, while embedding technology has shown great potential. However, existing embedding methods often ignore graph structure information or rely solely on static graph structures, making it difficult to effectively capture the dynamic associations between nodes that evolve over time. To address this issue, this letter proposes a novel dynamic weighted graph structure (DWGS) embedding method, which relies on a graph structure that can truly reflect the changes in the strength of dynamic associations between nodes over time. By first combining the DWGS embedding with the spatial-temporal adaptive embedding, as well as the temporal embedding and feature embedding, and then integrating attention and frequency-domain multi-layer perceptrons (MLPs), we design a novel traffic prediction model, termed the DWGS embedding integrated with attention and frequency-domain MLPs (DWAFM). Experiments on five real-world traffic datasets show that the DWAFM achieves better prediction performance than some state-of-the-arts.
[LG-72] Compensation-free Machine Unlearning in Text-to-Image Diffusion Models by Eliminating the Mutual Information
链接: https://arxiv.org/abs/2603.00992
作者: Xinwen Cheng,Jingyuan Zhang,Zhehao Huang,Yingwen Wu,Xiaolin Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The powerful generative capabilities of diffusion models have raised growing privacy and safety concerns regarding generating sensitive or undesired content. In response, machine unlearning (MU) – commonly referred to as concept erasure (CE) in diffusion models – has been introduced to remove specific knowledge from model parameters meanwhile preserving innocent knowledge. Despite recent advancements, existing unlearning methods often suffer from excessive and indiscriminate removal, which leads to substantial degradation in the quality of innocent generations. To preserve model utility, prior works rely on compensation, i.e., re-assimilating a subset of the remaining data or explicitly constraining the divergence from the pre-trained model on remaining concepts. However, we reveal that generations beyond the compensation scope still suffer, suggesting such post-remedial compensations are inherently insufficient for preserving the general utility of large-scale generative models. Therefore, in this paper, we advocate for developing compensation-free concept erasure operations, which precisely identify and eliminate the undesired knowledge such that the impact on other generations is minimal. In technique, we propose to MiM-MU, which is to unlearn a concept by minimizing the mutual information with a delicate design for computational effectiveness and for maintaining sampling distribution for other concepts. Extensive evaluations demonstrate that our proposed method achieves effective concept removal meanwhile maintaining high-quality generations for other concepts, and remarkably, without relying on any post-remedial compensation for the first time.
[LG-73] SoberDSE: Sample-Efficient Design Space Exploration via Learning-Based Algorithm Selection
链接: https://arxiv.org/abs/2603.00986
作者: Lei Xu,Shanshan Wang,Chenglong Xiao
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:High-Level Synthesis (HLS) is a pivotal electronic design automation (EDA) technology that enables the generation of hardware circuits from high-level language descriptions. A critical step in HLS is Design Space Exploration (DSE), which seeks to identify high-quality hardware architectures under given constraints. However, the enormous size of the design space makes DSE computationally prohibitive. Although numerous algorithms have been proposed to accelerate DSE, our extensive experimental studies reveal that no single algorithm consistently achieves Pareto dominance across all problem instances. Consequently, the inability of any single algorithm to dominate all benchmarks necessitates an automated selection mechanism to identify the best-performing DSE algorithm for each specific case. To address this challenge, we propose the SoberDSE framework, which recommends suitable algorithm based on benchmark characteristics. Experimental results demonstrate that our SoberDSE framework significantly outperforms state-of-the-art heuristic-based DSE algorithms by up to 5.7 \times and state-of-the-art learning-based DSE methods by up to 4.2 \times . Furthermore, compared to conventional classification models, SoberDSE delivers superior accuracy in small-sample learning scenarios, with an average enhancement of 35.57%. Code and models are available at this https URL.
[LG-74] Intent-Context Synergy Reinforcement Learning for Autonomous UAV Decision-Making in Air Combat
链接: https://arxiv.org/abs/2603.00974
作者: Jiahao Fu,Feng Yang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Autonomous UAV infiltration in dynamic contested environments remains a significant challenge due to the partially observable nature of threats and the conflicting objectives of mission efficiency versus survivability. Traditional Reinforcement Learning (RL) approaches often suffer from myopic decision-making and struggle to balance these trade-offs in real-time. To address these limitations, this paper proposes an Intent-Context Synergy Reinforcement Learning (ICS-RL) framework. The framework introduces two core innovations: (1) An LSTM-based Intent Prediction Module that forecasts the future trajectories of hostile units, transforming the decision paradigm from reactive avoidance to proactive planning via state augmentation; (2) A Context-Analysis Synergy Mechanism that decomposes the mission into hierarchical sub-tasks (safe cruise, stealth planning, and hostile breakthrough). We design a heterogeneous ensemble of Dueling DQN agents, each specialized in a specific tactical context. A dynamic switching controller based on Max-Advantage values seamlessly integrates these agents, allowing the UAV to adaptively select the optimal policy without hard-coded rules. Extensive simulations demonstrate that ICS-RL significantly outperforms baselines (Standard DDQN) and traditional methods (PSO, Game Theory). The proposed method achieves a mission success rate of 88% and reduces the average exposure frequency to 0.24 per episode, validating its superiority in ensuring robust and stealthy penetration in high-dynamic scenarios.
[LG-75] Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning ICLR2026
链接: https://arxiv.org/abs/2603.00903
作者: Ke Sun,Hongming Zhang,Jun Jin,Chao Gao,Xi Chen,Wulong Liu,Linglong Kong
类目: Machine Learning (cs.LG)
*备注: Published in ICLR 2026
Abstract:Inspired by the human learning and memory system, particularly the interplay between the hippocampus and cerebral cortex, this study proposes a dual-learner framework comprising a fast learner and a meta learner to address continual Reinforcement Learning~(RL) problems. These two learners are coupled to perform distinct yet complementary roles: the fast learner focuses on knowledge transfer, while the meta learner ensures knowledge integration. In contrast to traditional multi-task RL approaches that share knowledge through average return maximization, our meta learner incrementally integrates new experiences by explicitly minimizing catastrophic forgetting, thereby supporting efficient cumulative knowledge transfer for the fast learner. To facilitate rapid adaptation in new environments, we introduce an adaptive meta warm-up mechanism that selectively harnesses past knowledge. We conduct experiments in various pixel-based and continuous control benchmarks, revealing the superior performance of continual learning for our proposed dual-learner approach relative to baseline methods. The code is released in this https URL.
[LG-76] Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark
链接: https://arxiv.org/abs/2603.00895
作者: Zhiqi Yu,Xingping Liu,Haobin Mao,Mingshuo Liu,Long Chen,Jack Xin,Yifeng Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research. Subjects: Machine Learning (cs.LG) MSC classes: 68T50, 68T01, 68U15 Cite as: arXiv:2603.00895 [cs.LG] (or arXiv:2603.00895v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.00895 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-77] Probabilistic Learning and Generation in Deep Sequence Models
链接: https://arxiv.org/abs/2603.00888
作者: Wenlong Chen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: PhD thesis
Abstract:Despite exceptional predictive performance of Deep sequence models (DSMs), the main concern of their deployment centers around the lack of uncertainty awareness. In contrast, probabilistic models quantify the uncertainty associated with unobserved variables with rules of probability. Notably, Bayesian methods leverage Bayes’ rule to express our belief of unobserved variables in a principled way. Since exact Bayesian inference is computationally infeasible at scale, approximate inference is required in practice. Two major bottlenecks of Bayesian methods, especially when applied in deep neural networks, are prior specification and approximation quality. In Chapter 3 4, we investigate how the architectures of DSMs themselves can be informative for the design of priors or approximations in probabilistic models. We first develop an approximate Bayesian inference method tailored to the Transformer based on the similarity between attention and sparse Gaussian process. Next, we exploit the long-range memory preservation capability of HiPPOs (High-order Polynomial Projection Operators) to construct an interdomain inducing point for Gaussian process, which successfully memorizes the history in online learning. In addition to the progress of DSMs in predictive tasks, sequential generative models consisting of a sequence of latent variables are popularized in the domain of deep generative models. Inspired by the explicit self-supervised signals for these latent variables in diffusion models, in Chapter 5, we explore the possibility of improving other generative models with self-supervision for their sequential latent states, and investigate desired probabilistic structures over them. Overall, this thesis leverages inductive biases in DSMs to design probabilistic inference or structure, which bridges the gap between DSMs and probabilistic models, leading to mutually reinforced improvement.
[LG-78] Active Flow Matching
链接: https://arxiv.org/abs/2603.00877
作者: Yashvir S. Grewal,Daniel M. Steinberg,Thang D. Bui,Cheng Soon Ong,Edwin V. Bonilla
类目: Machine Learning (cs.LG)
*备注:
Abstract:Discrete diffusion and flow matching models capture complex, non-additive and non-autoregressive structure in high-dimensional objective landscapes through parallel, iterative refinement. However, their implicit generative nature precludes direct integration with principled variational frameworks for online black-box optimisation, such as variational search distributions (VSD) and conditioning by adaptive sampling (CbAS). We introduce Active Flow Matching (AFM), which reformulates variational objectives to operate on conditional endpoint distributions along the flow, enabling gradient-based steering of flow models toward high-fitness regions while preserving the rigour of VSD and CbAS. We derive forward and reverse Kullback-Leibler (KL) variants using self-normalised importance sampling. Across a suite of online protein and small molecule design tasks, forward-KL AFM consistently performs competitively compared to state-of-the-art baselines, demonstrating effective exploration-exploitation under tight experimental budgets.
[LG-79] GCL-Sampler: Discovering Kernel Similarity for Sampled GPU Simulation via Graph Contrastive Learning
链接: https://arxiv.org/abs/2603.00551
作者: Jiaqi Wang,Jingwei Sun,Jiyu Luo,Han Li,Guangzhong Sun
类目: Performance (cs.PF); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:GPU architectural simulation is orders of magnitude slower than native execution, necessitating workload sampling for practical speedups. Existing methods rely on hand-crafted features with limited expressiveness, yielding either aggressive sampling with high errors or conservative sampling with constrained speedups. To address these issues, we propose GCL-Sampler, a sampling framework that leverages Relational Graph Convolutional Networks with contrastive learning to automatically discover high-dimensional kernel similarities from trace graphs. By encoding instruction sequences and data dependencies into graph embeddings, GCL-Sampler captures rich structural and semantic properties of program execution, enabling both high fidelity and substantial speedup. Evaluations on extensive benchmarks show that GCL-Sampler achieves 258.94x average speedup against full workload with 0.37% error, outperforming state-of-the-art methods, PKA (129.23x, 20.90%), Sieve (94.90x, 4.10%) and STEM+ROOT (56.57x, 0.38%).
[LG-80] Spectral Condition for μP under Width-Depth Scaling
链接: https://arxiv.org/abs/2603.00541
作者: Chenyu Zheng,Rongzhen Wang,Xinyu Zhang,Chongxuan Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 51 pages, 6 figures, 13 tables
Abstract:Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ( \mu P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for \mu P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral \mu P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate \mu P formulations as special cases. Building on this condition, we then derive a general recipe for implementing \mu P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing \mu P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers. Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral \mu P condition preserves stable feature learning and enables robust HP transfer under width-depth scaling.
[LG-81] Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions SIGMOD2026
链接: https://arxiv.org/abs/2603.00537
作者: Atsuki Sato,Martin Aumüller,Yusuke Matsui
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: SIGMOD 2026
Abstract:Learned indexes are a class of index data structures that enable fast search by approximating the cumulative distribution function (CDF) using machine learning models (Kraska et al., SIGMOD’18). However, recent studies have shown that learned indexes are vulnerable to poisoning attacks, where injecting a small number of poison keys into the training data can significantly degrade model accuracy and reduce index performance (Kornaropoulos et al., SIGMOD’22). In this work, we provide a rigorous theoretical analysis of poisoning attacks targeting linear regression models over CDFs, one of the most basic regression models and a core component in many learned indexes. Our main contributions are as follows: (i) We present a theoretical proof characterizing the optimal single-point poisoning attack and show that the existing method yields the optimal attack. (ii) We show that in multi-point attacks, the existing greedy approach is not always optimal, and we rigorously derive the key properties that an optimal attack should satisfy. (iii) We propose a method to compute an upper bound of the multi-point poisoning attack’s impact and empirically demonstrate that the loss under the greedy approach is often close to this bound. Our study deepens the theoretical understanding of attack strategies against linear regression models on CDFs and provides a foundation for the theoretical evaluation of attacks and defenses on learned indexes.
[LG-82] Bridge Matching Sampler: Scalable Sampling via Generalized Fixed-Point Diffusion Matching
链接: https://arxiv.org/abs/2603.00530
作者: Denis Blessing,Lorenz Richter,Julius Berner,Egor Malitskiy,Gerhard Neumann
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:Sampling from unnormalized densities using diffusion models has emerged as a powerful paradigm. However, while recent approaches that use least-squares `matching’ objectives have improved scalability, they often necessitate significant trade-offs, such as restricting prior distributions or relying on unstable optimization schemes. By generalizing these methods as special forms of fixed-point iterations rooted in Nelson’s relation, we develop a new method that addresses these limitations, called Bridge Matching Sampler (BMS). Our approach enables learning a stochastic transport map between arbitrary prior and target distributions with a single, scalable, and stable objective. Furthermore, we introduce a damped variant of this iteration that incorporates a regularization term to mitigate mode collapse and further stabilize training. Empirically, we demonstrate that our method enables sampling at unprecedented scales while preserving mode diversity, achieving state-of-the-art results on complex synthetic densities and high-dimensional molecular benchmarks.
[LG-83] rinity: A Scenario-Aware Recommendation Framework for Large-Scale Cold-Start Users
链接: https://arxiv.org/abs/2603.00502
作者: Wenhao Zheng,Wang Lu,Fangshuang Tang,Yiyang Lu,Jun Yang,Pengcheng Xiong,Yulan Yan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Early-stage users in a new scenario intensify cold-start challenges, yet prior works often address only parts of the problem through model architecture. Launching a new user experience to replace an established product involves sparse behavioral signals, low-engagement cohorts, and unstable model performance. We argue that effective recommendations require the synergistic integration of feature engineering, model architecture, and stable model updating. We propose Trinity, a framework embodying this principle. Trinity extracts valuable information from existing scenarios while ensuring predictive effectiveness and accuracy in the new scenario. In this paper, we showcase Trinity applied to a billion-user Microsoft product transition. Both offline and online experiments demonstrate that our framework achieves substantial improvements in addressing the combined challenge of new users in new scenarios.
[LG-84] Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence ICLR2026
链接: https://arxiv.org/abs/2603.00498
作者: Quoc Minh Nguyen,Trung Le,Jing Wu,Anh Tuan Bui,Mehrtash Harandi
类目: Machine Learning (cs.LG)
*备注: Published at ICLR 2026
Abstract:Fine-tuning-as-a-service introduces a threat to Large Language Models’ safety when service providers fine-tune their models on poisoned user-submitted datasets, a process known as harmful fine-tuning attacks. In this work, we show that by regularizing the gradient contribution of harmful samples encountered during fine-tuning, we can effectively mitigate the impact of harmful fine-tuning attacks. To this end, we introduce Antibody, a defense strategy that first ensures robust safety alignment for the model before fine-tuning, and then applies a safety-preservation learning algorithm during fine-tuning. Specifically, in the alignment stage before fine-tuning, we propose optimizing the model to be in a flat loss region with respect to harmful samples, which makes the safety alignment more resilient to subsequent harmful fine-tuning. Then, in the fine-tuning stage, we design a fine-tuning algorithm that applies a weighting scheme to all samples in each training batch to inhibit the model from learning from harmful samples while encouraging learning from benign samples. Experimental results demonstrate that Antibody successfully mitigates harmful fine-tuning attacks while boosting fine-tuning performance on the user-submitted dataset.
[LG-85] When does Chain-of-Thought Help: A Markovian Perspective
链接: https://arxiv.org/abs/2603.00306
作者: Zihan Wang,Yijun Dong,Qi Lei
类目: Machine Learning (cs.LG)
*备注:
Abstract:Chain-of-Thought (CoT) prompting is a widely used inference-time technique for improving reasoning, yet its gains are uneven across tasks. We analyze when and why CoT helps by modeling the step-wise reasoning trajectory as a Markov chain. Each intermediate step is a state and the dependence between steps is captured by a transition kernel. Our theory identifies transition alignment, whether instances share a common step-wise transition kernel, as the key determinant of CoT’s effectiveness. When transitions are identical across steps, CoT reduces inference-time sample complexity: fewer context sample trajectories suffice to recover the final decision. In contrast, when transitions differ across steps, these gains can vanish. We further quantify how noise in intermediate steps modulates CoT’s benefit. Beyond theory, we design synthetic benchmarks that isolate these factors to complement prior results on real-world tasks and to empirically validate our predictions.
[LG-86] Scalable Gaussian process modeling of parametrized spatio-temporal fields
链接: https://arxiv.org/abs/2603.00290
作者: Srinath Dama,Prasanth B. Nair
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce a scalable Gaussian process (GP) framework with deep product kernels for data-driven learning of parametrized spatio-temporal fields over fixed or parameter-dependent domains. The proposed framework learns a continuous representation, enabling predictions at arbitrary spatio-temporal coordinates, independent of the training data resolution. We leverage Kronecker matrix algebra to formulate a computationally efficient training procedure with complexity that scales nearly linearly with the total number of spatio-temporal grid points. A key feature of our approach is the efficient computation of the posterior variance at essentially the same computational cost as the posterior mean (exactly for Cartesian grids and via rigorous bounds for unstructured grids), thereby enabling scalable uncertainty quantification. Numerical studies on a range of benchmark problems demonstrate that the proposed method achieves accuracy competitive with operator learning methods such as Fourier neural operators and deep operator networks. On the one-dimensional unsteady Burgers’ equation, our method surpasses the accuracy of projection-based reduced-order models. These results establish the proposed framework as an effective tool for data-driven surrogate modeling, particularly when uncertainty estimates are required for downstream tasks.
[LG-87] CoPeP: Benchmarking Continual Pretraining for Protein Language Models
链接: https://arxiv.org/abs/2603.00253
作者: Darshan Patil,Pranshu Malviya,Mathieu Reymond,Quentin Fournier,Sarath Chandar
类目: Machine Learning (cs.LG)
*备注: 29 pages, 25 figures
Abstract:Protein language models (pLMs) have recently gained significant attention for their ability to uncover relationships between sequence, structure, and function from evolutionary statistics, thereby accelerating therapeutic drug discovery. These models learn from large protein databases that are continuously updated by the biology community and whose dynamic nature motivates the application of continual learning, not only to keep up with the ever-growing data, but also as an opportunity to take advantage of the temporal meta-information that is created during this process. As a result, we introduce the Continual Pretraining of Protein Language Models (CoPeP) benchmark, a novel benchmark for evaluating continual learning approaches on pLMs. Specifically, we curate a sequence of protein datasets derived from the UniProt Knowledgebase spanning a decade and define metrics to assess pLM performance across 31 protein understanding tasks. We evaluate several methods from the continual learning literature, including replay, unlearning, and plasticity-based methods, some of which have never been applied to models and data of this scale. Our findings reveal that incorporating temporal meta-information improves perplexity by up to 7% even when compared to training on data from all tasks jointly. Moreover, even at scale, several continual learning methods outperform naive continual pretraining. The CoPeP benchmark offers an exciting opportunity to study these methods at scale in an impactful real-world application.
[LG-88] A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients
链接: https://arxiv.org/abs/2603.00221
作者: Joakim Edin,Sedrah Butt Balaganeshan,Annike Kjølby Kristensen,Lars Maaløe,Ioannis Louloudis,Søren Brunak
类目: Machine Learning (cs.LG)
*备注:
Abstract:Medical coding translates clinical documentation into standardized codes for billing, research, and public health, but manual coding is time-consuming and error-prone. Existing automation efforts rely on small datasets that poorly represent real-world patient heterogeneity. We trained a language model on 5.8 million electronic health records from 1.8 million patients across nearly all specialties in Eastern Denmark (2006–2016) to predict ICD-10 codes from clinical notes, medications, and laboratory results. Evaluated on 270,000 held-out patients, the model achieved a micro F1 of 71.8% and a top-10 recall of 95.5%. Performance varied by specialty (F1: 53–91%), with higher scores in specialties with well-defined diagnostic criteria. Codes appearing predominantly as secondary diagnoses had markedly lower F1 scores. For three such codes (suicide-related behaviors, weight disorders, and hypertension), the model identified thousands of uncoded cases, of which 76-86% were confirmed valid upon manual review, suggesting systematic under-coding rather than model error. These findings suggest under-coding of secondary diagnoses in Eastern Denmark during this period, with potential implications for epidemiological research, public health surveillance, and understanding of multimorbidity. Similar time constraints and reimbursement structures in other healthcare systems suggest this may not be isolated to this dataset. The model can automate coding for approximately 50% of cases and provide accurate suggestions for most others, and may offer a practical solution to help capture missed secondary conditions.
[LG-89] Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare
链接: https://arxiv.org/abs/2603.00192
作者: Elizabeth W. Miller,Jeffrey D. Blume
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:In healthcare, predictive models increasingly inform patient-level decisions, yet little attention is paid to the variability in individual risk estimates and its impact on treatment decisions. For overparameterized models, now standard in machine learning, a substantial source of variability often goes undetected. Even when the data and model architecture are held fixed, randomness introduced by optimization and initialization can lead to materially different risk estimates for the same patient. This problem is largely obscured by standard evaluation practices, which rely on aggregate performance metrics (e.g., log-loss, accuracy) that are agnostic to individual-level stability. As a result, models with indistinguishable aggregate performance can nonetheless exhibit substantial procedural arbitrariness, which can undermine clinical trust. We propose an evaluation framework that quantifies individual-level prediction instability by using two complementary diagnostics: empirical prediction interval width (ePIW), which captures variability in continuous risk estimates, and empirical decision flip rate (eDFR), which measures instability in threshold-based clinical decisions. We apply these diagnostics to simulated data and GUSTO-I clinical dataset. Across observed settings, we find that for flexible machine-learning models, randomness arising solely from optimization and initialization can induce individual-level variability comparable to that produced by resampling the entire training dataset. Neural networks exhibit substantially greater instability in individual risk predictions compared to logistic regression models. Risk estimate instability near clinically relevant decision thresholds can alter treatment recommendations. These findings that stability diagnostics should be incorporated into routine model validation for assessing clinical reliability.
[LG-90] Wideband Power Amplifier Behavioral Modeling Using an Amplitude Conditioned LSTM
链接: https://arxiv.org/abs/2603.00101
作者: Abdelrahman Abdelsalam,You Fei
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 7 Pages, 6 Figures
Abstract:Wideband power amplifiers exhibit complex nonlinear and memory effects that challenge traditional behavioral modeling approaches. This paper proposes a novel amplitude conditioned long short-term memory (AC-LSTM) network that introduces explicit amplitude-dependent gating to enhance the modeling of wideband PA dynamics. The architecture incorporates a Feature-wise Linear Modulation (FiLM) layer that conditions the LSTM’s forget gate on the instantaneous input amplitude, providing a physics-aware inductive bias for capturing amplitude-dependent memory effects. Experimental validation using a 100 MHz 5G NR signal and a GaN PA demonstrates that the proposed AC-LSTM achieves a normalized mean square error (NMSE) of -41.25 dB, representing a 1.15 dB improvement over standard LSTM and 7.45 dB improvement over augmented real-valued time-delay neural network (ARVTDNN) baselines. The model also closely matches the measured PA’s spectral characteristics with an adjacent channel power ratio (ACPR) of -28.58 dB. These results shows the effectiveness of amplitude conditioning for improving both time-domain accuracy and spectral fidelity in wide-band PA behavioral modeling.
[LG-91] A Representation-Consistent Gated Recurrent Framework for Robust Medical Time-Series Classification
链接: https://arxiv.org/abs/2603.00067
作者: Maitri Krishna Sai
类目: Machine Learning (cs.LG)
*备注: 7 pages, 1 figure. Preprint
Abstract:Medical time-series data are characterized by irregular sampling, high noise levels, missing values, and strong inter-feature dependencies. Recurrent neural networks (RNNs), particularly gated architectures such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), are widely used for modeling such data due to their ability to capture temporal dependencies. However, standard gated recurrent models do not explicitly constrain the evolution of latent representations over time, leading to representation drift and instability under noisy or incomplete inputs. In this work, we propose a representation-consistent gated recurrent framework (RC-GRF) that introduces a principled regularization strategy to enforce temporal consistency in hidden-state representations. The proposed framework is model-agnostic and can be integrated into existing gated recurrent architectures without modifying their internal gating mechanisms. We provide a theoretical analysis demonstrating how the consistency constraint bounds hidden-state divergence and improves stability. Extensive experiments on medical time-series classification benchmarks show that the proposed approach improves robustness, reduces variance, and enhances generalization performance, particularly in noisy and low-sample settings.
[LG-92] he Hidden Costs of Domain Fine-Tuning: Pii-Bearing Data Degrades Safety and Increases Leakage
链接: https://arxiv.org/abs/2603.00061
作者: Jayesh Choudhari,Piyush Kumar Singh
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Domain fine-tuning is a common path to deploy small instruction-tuned language models as customer-support assistants, yet its effects on safety-aligned behavior and privacy are not well understood. In real deployments, such assistants receive a mixture of benign in-domain requests and out-of-domain user queries that are emotional, philosophical, or adversarial. Even when the target domain is benign, specialization may shift model behavior in ways that weaken refusal, increase harmful compliance, and induce privacy leakage. We present a controlled empirical study of how training data composition (presence vs.\ removal of PII) and fine-tuning configuration (role-swapping (RS)) shape safety and out-of-domain behavior in open-source chat models up to 8B parameters. We fine-tune each model on 5,000 real booking-support message pairs under three settings: \textscNoPII-NoRS, \textscPII-NoRS, and \textscPII-RS (role-swapped). We evaluate safety using \textscSORRY-Bench~\citexie2024sorry adversarial prompts and assess out-of-domain behavior using a suite of philosophical questions~\citebetley2025emergent. Across models, domain fine-tuning causes a large distributional shift from high-quality refusals toward harmful compliance on \textscSORRY-Bench, with the most severe degradation when PII is present in the fine-tuning data. For example, macro-averaged strong refusal drops from 42.6% in base models to single digits after fine-tuning, while PII-bearing runs additionally exhibit double-digit rates of harmful responses with PII leakage. On philosophical queries, fine-tuned models frequently exhibit domain anchoring and, when trained with PII, leak sensitive identifiers in irrelevant contexts. Role-swapping partially mitigates PII leakage but does not reliably restore refusal behavior. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2603.00061 [cs.CR] (or arXiv:2603.00061v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.00061 Focus to learn more arXiv-issued DOI via DataCite
[LG-93] Dual-space posterior sampling for Bayesian inference in constrained inverse problems
链接: https://arxiv.org/abs/2603.00393
作者: Ali Siahkoohi,Kamal Aghazade,Ali Gholami
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Inverse problems constrained by partial differential equations are often ill-conditioned due to noisy and incomplete data or inherent non-uniqueness. A prominent example is full waveform inversion, which estimates Earth’s subsurface properties by fitting seismic measurements subject to the wave equation, where ill-conditioning is inherent to noisy, band-limited, finite-aperture data and shadow zones. Casting the inverse problem into a Bayesian framework allows for a more comprehensive description of its solution, where instead of a single estimate, the posterior distribution characterizes non-uniqueness and can be sampled to quantify uncertainty. However, no clear procedure exists for translating hard physical constraints, such as the wave equation, into prior distributions amenable to existing sampling techniques. To address this, we perform posterior sampling in the dual space using an augmented Lagrangian formulation, which translates hard constraints into penalties amenable to sampling algorithms while ensuring their exact satisfaction. We achieve this by seamlessly integrating the alternating direction method of multipliers (ADMM) with Stein variational gradient descent (SVGD) – a particle-based sampler – where the constraint is relaxed at each iteration and multiplier updates progressively enforce satisfaction. This enables constrained posterior sampling while inheriting the favorable conditioning properties of dual-space solvers, where partial constraint relaxation allows productive updates even when the current model is far from the true solution. We validate the method on a stylized Rosenbrock conditional inference problem and on frequency-domain full waveform inversion for a Gaussian anomaly model and the Marmousi~II benchmark, demonstrating well-calibrated uncertainty estimates and posterior contraction with increasing data coverage.
[LG-94] A Monte Carlo estimator of flow fields for sampling and noise problems
链接: https://arxiv.org/abs/2603.00252
作者: Michael S. Albergo,Gurtej Kanwar
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures. Proceedings of the 42nd International Symposium on Lattice Field Theory (LATTICE2025)
Abstract:Learned field transformations may help address ubiquitous critical slowing down and signal-to-noise problems in lattice field theory. In the context of an annealed sequence of distributions, field transformations are defined by integrating flow fields that exactly solve a local transport problem. These proceedings discuss a new Monte Carlo approach to evaluating these flow fields, which can then be used directly in such contexts or as a means of generating unbiased training data for machine learning approaches. By defining the Monte Carlo estimator using coupled Langevin noise, the statistical noise in the required integrals is significantly mitigated. Demonstrations of the method include a U(1) transport problem and an SU(N) glueball correlator.
[LG-95] pping the Balance: Impact of Class Imbalance Correction on the Performance of Clinical Risk Prediction Models
链接: https://arxiv.org/abs/2603.00208
作者: Amalie Koch Andersen,Hadi Mehdizavareh,Arijit Khan,Tobias Becher,Simone Britsch,Markward Britsch,Morten Bøttcher,Simon Winther,Palle Duun Rohde,Morten Hasselstrøm Jensen,Simon Lebech Cichosz
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:
Abstract:Objective: ML-based clinical risk prediction models are increasingly used to support decision-making in healthcare. While class-imbalance correction techniques are commonly applied to improve model performance in settings with rare outcomes, their impact on probabilistic calibration remains insufficiently understood. This study evaluated the effect of widely used resampling strategies on both discrimination and calibration across real-world clinical prediction tasks. Methods: Ten clinical datasets spanning diverse medical domains and including 605,842 patients were analyzed. Multiple machine-learning model families, including linear models and several non-linear approaches, were evaluated. Models were trained on the original data and under three commonly used 1:1 class-imbalance correction strategies (SMOTE, RUS, ROS). Performance was assessed on held-out data using discrimination and calibration metrics. Results: Across all datasets and model families, resampling had no positive impact on predictive performance. Changes in the Receiver Operating Characteristic Area Under Curve (ROC-AUC) relative to models trained on the original data were small and inconsistent (ROS: -0.002, p0.05; RUS: -0.004, p0.05; SMOTE: -0.01, p0.05), with no resampling strategy demonstrating a systematic improvement. In contrast, resampling in general degraded the calibration performance. Models trained using imbalance correction exhibited higher Brier scores (0.029 to 0.080, p0.05), reflecting poorer probabilistic accuracy, and marked deviations in calibration intercept and slope, indicating systematic distortions of predicted risk despite preserved rank-based performance. Conclusion: In a diverse set of real-world clinical prediction tasks, commonly used class-imbalance correction techniques did not provide generalizable improvements in discrimination and were associated with degraded calibration.
[LG-96] he Partition Principle Revisited: Non-Equal Volume Designs Achieve Minimal Expected Star Discrepancy
链接: https://arxiv.org/abs/2603.00202
作者: Xiaoda Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: This article serves as a clarification of the previous claims that purported to solve the open problem but encountered difficulties. My latest findings have reached a superior level
Abstract:We study the expected star discrepancy under a newly designed class of non-equal volume partitions. The main contributions are twofold. First, we establish a strong partition principle for the star discrepancy, showing that our newly designed non-equal volume partitions yield stratified sampling point sets with lower expected star discrepancy than classical jittered sampling. Specifically, we prove that \mathbbE(D^_N(Z)) \mathbbE(D^_N(Y)) , where Y and Z represent jittered sampling and our non-equal volume partition sampling, respectively. Second, we derive explicit upper bounds for the expected star discrepancy under our non-equal volume partition models, which improve upon existing bounds for jittered sampling. Our results provide a theoretical foundation for using non-equal volume partitions in high-dimensional numerical integration.
[LG-97] RSS map-assisted MIMO channel estimation in the upper mid-band under pilot constraints
链接: https://arxiv.org/abs/2603.00112
作者: Alireza Javid,Nuria González-Prelcic
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to TMLCN
Abstract:Accurate wireless channel estimation is critical for next-generation wireless systems, enabling precise precoding for effective user separation, reduced interference across cells, and high-resolution sensing, among other benefits. Traditional model-based channel estimation methods suffer, however, from performance degradation in complex environments with a limited number of pilots, while purely data-driven approaches lack physical interpretability, require extensive data collection, and are usually site-specific. This paper presents a novel physics-informed neural network (PINN) framework that synergistically combines model-based channel estimation with a deep network to exploit prior information about environmental propagation characteristics and achieve superior performance under pilot-constrained scenarios. The proposed approach employs an enhanced U-Net architecture with transformer modules and cross-attention mechanisms to fuse initial channel estimates with RSS maps to provide refined channel estimates. Comprehensive evaluation using realistic ray-tracing data from urban environments demonstrates significant performance improvements, achieving over 5 dB gain in NMSE compared to state-of-the-art methods, with particularly strong performance in pilot-limited scenarios and robustness across different frequencies and environments with only minimal fine-tuning. We further extend the decoder for multi-step temporal prediction, enabling accurate forecasting of several future channel snapshots from a single estimate, useful for proactive beamforming and scheduling in mobile scenarios. The proposed framework maintains practical computational complexity, making it viable for massive MIMO systems in upper-mid band frequencies. Unlike black-box neural approaches, the physics-informed design provides a more interpretable channel estimation method.
[LG-98] CASCADE: Cross-scale Advective Super-resolution with Climate Assimilation and Downscaling Evolution
链接: https://arxiv.org/abs/2603.00109
作者: Alexander Kovalenko
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Super-resolution of geophysical fields presents unique challenges beyond natural image enhancement: fine-scale structures must respect physical dynamics, conserve mass and energy, and evolve coherently in time. These constraints are especially critical for extreme events, where rare, localized, high-intensity features drive impacts and where temporally inconsistent “hallucinated” detail can misrepresent hazards. We introduce CASCADE (Cross-scale Advective Super-resolution with Climate Assimilation and Downscaling Evolution), a framework that reframes spatiotemporal super-resolution as an explicit transport process across scales. Rather than hallucinating high-frequency content per pixel, CASCADE reconstructs fine structure by iteratively advecting coarse information along learned, flow-conditioned velocity fields through semi-Lagrangian warping. The architecture decomposes motion into resolved (large-scale) and subgrid (unresolved) components, mirroring the closure problem in numerical weather prediction, and enforces low-resolution consistency through an assimilation-style innovation step. Evaluated on SEVIR radar data for 4x super-resolution of severe convective storms, CASCADE outperforms strong baselines across both continuous metrics (PSNR, SSIM, MAE) and threshold-based skill scores (CSI, HSS, POD) while providing interpretable diagnostics through visualizable velocity and correction fields. By encoding advection as the fundamental operator rather than learning it implicitly, CASCADE produces temporally coherent, physically consistent, and mass-conserving reconstructions well suited to advection-dominated extremes in atmospheric and oceanic applications.
[LG-99] Alpha-RF: Automated RF-Filter-Circuit Design with Neural Simulator and Reinforcement Learning
链接: https://arxiv.org/abs/2603.00104
作者: Nhat Tran,Chenjie Hao,Alexander Stameroff,Anh-Vu Pham,Yubei Chen
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Accurate, high-performance radio-frequency (RF) filter circuits are ubiquitous in radio-frequency communication and sensing systems for accepting and rejecting signals at desired frequencies. Conventional RF filter design process involves manual calculations of design parameters, followed by intuition-guided iterations to achieve the desired response for a set of filter specifications. This process is time-consuming due to time- and resource-intensive electromagnetic simulations using full-wave numerical PDE solvers. This process is also highly sensitive to domain expertise and requires many years of professional training. To address these bottlenecks, we propose an automatic RF filter circuit design tool using neural simulator and reinforcement learning. First, we train a neural simulator to replace the PDE electromagnetic simulator. The neural-network-based simulator reduces each of the simulation time from 4 minutes on average to less than 100 millisecond while maintaining a high precision. Such dramatic acceleration enable us to leverage deep reinforcement learning algorithm and train an amortized inference policy to perform automatic design in the imagined space from the neural simulator. The resulted automatic circuit-design agent achieves super-human design results. The automatic circuit-design agent also reduces the on-average design cycle from days to under a few seconds. Even more surprisingly, we demonstrate that the neural simulator can generalize to design spaces far from the training dataset and in a sense it has learned the underlying physics–Maxwell equations. We also demonstrate that the reinforcement learning has discovered many expert-like design intuitions. This work marks a step in using neural simulators and reinforcement learning in RF circuit design and the proposed method is generally applicable to many other design problems and domains in close affinity
[LG-100] Using Artificial Neural Networks to Predict Claim Duration in a Work Injury Compensation Environment
链接: https://arxiv.org/abs/2603.00100
作者: Anthony Almudevar
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 8 pages; 9 figures; 6 tables
Abstract:Currently, work injury compensation boards in Canada track injury information using a standard system of codes (under the National Work Injury Statistics Program (NWISP)). These codes capture the medical nature and original cause of the injury in some detail, hence they potentially contain information which may be used to predict the severity of an injury and the resulting time loss from work. Claim duration easurements and forecasts are central to the operation of a work injury compensation program. However, due to the complexity of the codes traditional statistical modelling techniques are of limited value. We will describe an artificial neural network implementation of Cox proportional hazards regression due to Ripley (1998 thesis) which is used as the basis for a model for the prediction of claim duration within a work injury compensation environment. The model accepts as input the injury codes, as well as basic demographic and workplace information. The output consists of a claim duration prediction in the form of a distribution. The input represents information available when a claim is first filed, and may therefore be used in a claims management setting. We will describe the model selection procedure, as well as a procedure for accepting inputs with missing covariates. Comments: 8 pages; 9 figures; 6 tables Subjects: Applications (stat.AP); Machine Learning (cs.LG) MSC classes: 62M45, 62N01, 62P05 Cite as: arXiv:2603.00100 [stat.AP] (or arXiv:2603.00100v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2603.00100 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-101] Profiling vs. Case-specific Evidence: A Probabilistic Analysis
链接: https://arxiv.org/abs/2603.00098
作者: Marcello Di Bello,Nicolò Cangiotti,Michele Loi
类目: Other Statistics (stat.OT); Computers and Society (cs.CY); Machine Learning (cs.LG); General Economics (econ.GN); Probability (math.PR)
*备注: 16 pages
Abstract:The use of profiling evidence in criminal trials is a longstanding controversy in legal epistemology and evidence law theory. Many scholars, even when they oppose its use at trial, still assume that profiling evidence can be probative of guilt. We reject that assumption. Profiling evidence may support a generic hypothesis, but is not evidence that the defendant is guilty of the specific crime of which they are accused. We contrast profiling evidence with case-specific evidence, which speaks more directly to the facts of the case. Our critique departs from others by grounding the argument in a probabilistic analysis of evidentiary value. We also explore the implications of our account for debates about stereotyping.
[LG-102] A comparative study of transformer models and recurrent neural networks for path-dependent composite materials
链接: https://arxiv.org/abs/2603.00092
作者: Petter Uvdal,Mohsen Mirkhalaf
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Accurate modeling of Short Fiber Reinforced Composites (SFRCs) remains computationally expensive for full-field simulations. Data-driven surrogate models using Artificial Neural Networks (ANNs) have been proposed as an efficient alternative to numerical modeling, where Recurrent Neural Networks (RNNs) are increasingly being used for path-dependent multiscale modeling by predicting the homogenized response of a Representative Volume Element (RVE). However, recently, transformer models have been developed and they offer scalability and efficient parallelization, yet have not been systematically compared with RNNs in this field. In this study, we perform a systematic comparison between RNNs and transformer models trained on sequences of homogenized response of SFRC RVEs. We study the effect on two types of hyperparameters, namely architectural hyperparameters (such as the number of GRU layers, hidden size, number of attention heads, and encoder blocks) and training hyperparameters (such as learning rate and batch size). Both sets of hyperparameters are tuned using Bayesian optimization. We then analyze scaling laws with respect to dataset size and inference accuracy in interpolation and extrapolation regimes. The results show that while transformer models remain competitive in terms of accuracy on large datasets, the RNNs demonstrate better accuracy on small datasets and show better extrapolation performance. Furthermore, under extrapolation, there is a clear difference, where the RNN remains accurate, while the transformer model performs poorly. On the other hand, the transformer model is 7 times faster at inference, requiring 0.5 ms per prediction compared to the 3.5 ms per prediction for the RNN model.
[LG-103] he minimal width of universal p-adic ReLU neural networks
链接: https://arxiv.org/abs/2603.00064
作者: Sándor Z. Kiss,Ambrus Pál
类目: Number Theory (math.NT); Machine Learning (cs.LG)
*备注: 16 pages, comments welcome!
Abstract:We determine the minimal width of p -adic neural networks with the universal approximation property for continuous \mathbb Q_p -valued functions on compact open subsets with respect to the L_q norms and the C_1 norm, where the activation function is a natural p -adic analogue of the ReLU function.
[LG-104] Bilevel Optimization with Lower-Level Uniform Convexity: Theory and Algorithm
链接: https://arxiv.org/abs/2603.00027
作者: Yuman Wu,Xiaochuan Gong,Jie Hao,Mingrui Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Bilevel optimization is a hierarchical framework where an upper-level optimization problem is constrained by a lower-level problem, commonly used in machine learning applications such as hyperparameter optimization. Existing bilevel optimization methods typically assume strong convexity or Polyak-Łojasiewicz (PL) conditions for the lower-level function to establish non-asymptotic convergence to a solution with small hypergradient. However, these assumptions may not hold in practice, and recent work~\citepchen2024finding has shown that bilevel optimization is inherently intractable for general convex lower-level functions with the goal of finding small hypergradients. In this paper, we identify a tractable class of bilevel optimization problems that interpolates between lower-level strong convexity and general convexity via \emphlower-level uniform convexity. For uniformly convex lower-level functions with exponent p\geq 2 , we establish a novel implicit differentiation theorem characterizing the hyperobjective’s smoothness property. Building on this, we design a new stochastic algorithm, termed UniBiO, with provable convergence guarantees, based on an oracle that provides stochastic gradient and Hessian-vector product information for the bilevel problems. Our algorithm achieves \widetildeO(\epsilon^-5p+6) oracle complexity bound for finding \epsilon -stationary points. Notably, our complexity bounds match the optimal rates in terms of the \epsilon dependency for strongly convex lower-level functions ( p=2 ), up to logarithmic factors. Our theoretical findings are validated through experiments on synthetic tasks and data hyper-cleaning, demonstrating the effectiveness of our proposed algorithm. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) Cite as: arXiv:2603.00027 [math.OC] (or arXiv:2603.00027v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2603.00027 Focus to learn more arXiv-issued DOI via DataCite
[LG-105] Riemannian Dueling Optimization
链接: https://arxiv.org/abs/2603.00023
作者: Yuxuan Ren,Abhishek Roy,Shiqian Ma
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Dueling optimization considers optimizing an objective with access to only a comparison oracle of the objective function. It finds important applications in emerging fields such as recommendation systems and robotics. Existing works on dueling optimization mainly focused on unconstrained problems in the Euclidean space. In this work, we study dueling optimization over Riemannian manifolds, which covers important applications that cannot be solved by existing dueling optimization algorithms. In particular, we propose a Riemannian Dueling Normalized Gradient Descent (RDNGD) method and establish its iteration complexity when the objective function is geodesically L-smooth or geodesically (strongly) convex. We also propose a projection-free algorithm, named Riemannian Dueling Frank-Wolfe (RDFW) method, to deal with the situation where projection is prohibited. We establish the iteration and oracle complexities for RDFW. We illustrate the effectiveness of the proposed algorithms through numerical experiments on both synthetic and real applications.
附件下载



