本篇博文主要内容为 2026-03-17 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-17)
今日共更新600篇论文,其中:
- 自然语言处理共80篇(Computation and Language (cs.CL))
- 人工智能共181篇(Artificial Intelligence (cs.AI))
- 计算机视觉共177篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共156篇(Machine Learning (cs.LG))
- 多智能体系统共10篇(Multiagent Systems (cs.MA))
- 信息检索共12篇(Information Retrieval (cs.IR))
- 人机交互共17篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] rinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLM)的多智能体系统(Multi-Agent Systems, MAS)中存在的安全与隐私风险问题,这些问题超越了单一智能体或LLM的风险范畴,且当前缺乏专门针对MAS场景的系统性防护机制。解决方案的关键在于提出TrinityGuard框架,其核心创新包括:一是构建一个三层细粒度风险分类体系,识别20类风险,涵盖单智能体漏洞、智能体间通信威胁及系统级涌现危害;二是采用“三位一体”架构设计,包含可适配任意MAS结构的抽象层、面向特定风险的测试模块评估层以及由统一LLM裁判工厂协调的运行时监控代理;三是通过预开发评估与实时运行监控相结合的方式,生成结构化漏洞报告并触发告警,从而实现对MAS全生命周期的安全治理。
链接: https://arxiv.org/abs/2603.15408
作者: Kai Wang,Biaojie Zeng,Zeming Wei,Chang Jin,Hefeng Zhou,Xiangtian Li,Chao Yang,Jingjing Qu,Xingcheng Xu,Xia Hu
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:With the rapid development of LLM-based multi-agent systems (MAS), their significant safety and security concerns have emerged, which introduce novel risks going beyond single agents or LLMs. Despite attempts to address these issues, the existing literature lacks a cohesive safeguarding system specialized for MAS risks. In this work, we introduce TrinityGuard, a comprehensive safety evaluation and monitoring framework for LLM-based MAS, grounded in the OWASP standards. Specifically, TrinityGuard encompasses a three-tier fine-grained risk taxonomy that identifies 20 risk types, covering single-agent vulnerabilities, inter-agent communication threats, and system-level emergent hazards. Designed for scalability across various MAS structures and platforms, TrinityGuard is organized in a trinity manner, involving an MAS abstraction layer that can be adapted to any MAS structures, an evaluation layer containing risk-specific test modules, alongside runtime monitor agents coordinated by a unified LLM Judge Factory. During Evaluation, TrinityGuard executes curated attack probes to generate detailed vulnerability reports for each risk type, where monitor agents analyze structured execution traces and issue real-time alerts, enabling both pre-development evaluation and runtime monitoring. We further formalize these safety metrics and present detailed case studies across various representative MAS examples, showcasing the versatility and reliability of TrinityGuard. Overall, TrinityGuard acts as a comprehensive framework for evaluating and monitoring various risks in MAS, paving the way for further research into their safety and security.
[MA-1] PMAx: An Agent ic Framework for AI-Driven Process Mining
【速读】:该论文旨在解决传统流程挖掘(Process Mining)分析依赖专业查询语言和数据科学工具所带来的使用门槛高、非技术用户难以参与的问题,同时应对大型敏感事件日志直接上传至外部大语言模型(Large Language Models, LLMs)所引发的数据隐私风险。其解决方案的关键在于提出一个名为PMAx的自主代理框架,采用隐私保护的多智能体架构:通过“工程师”代理(Engineer agent)在本地自动解析事件日志元数据并生成脚本运行成熟的流程挖掘算法,精确计算指标并生成过程模型、摘要表与可视化结果;再由“分析师”代理(Analyst agent)对这些结构化成果进行语义解释,输出可读性强的综合报告。该设计实现了计算与解释的分离,既保障了数学准确性与数据隐私,又使业务用户能够通过自然语言提问获得可靠的流程洞察。
链接: https://arxiv.org/abs/2603.15351
作者: Anton Antonov,Humam Kourani,Alessandro Berti,Gyunam Park,Wil M. P. van der Aalst
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Submitted to EMMSAD 2026 (tool demonstration track), under review
Abstract:Process mining provides powerful insights into organizational workflows, but extracting these insights typically requires expertise in specialized query languages and data science tools. Large Language Models (LLMs) offer the potential to democratize process mining by enabling business users to interact with process data through natural language. However, using LLMs as direct analytical engines over raw event logs introduces fundamental challenges: LLMs struggle with deterministic reasoning and may hallucinate metrics, while sending large, sensitive logs to external AI services raises serious data-privacy concerns. To address these limitations, we present PMAx, an autonomous agentic framework that functions as a virtual process analyst. Rather than relying on LLMs to generate process models or compute analytical results, PMAx employs a privacy-preserving multi-agent architecture. An Engineer agent analyzes event-log metadata and autonomously generates local scripts to run established process mining algorithms, compute exact metrics, and produce artifacts such as process models, summary tables, and visualizations. An Analyst agent then interprets these insights and artifacts to compile comprehensive reports. By separating computation from interpretation and executing analysis locally, PMAx ensures mathematical accuracy and data privacy while enabling non-technical users to transform high-level business questions into reliable process insights.
[MA-2] Intelligent Co-Design: An Interactive LLM Framework for Interior Spatial Design via Multi-Modal Agents
【速读】:该论文旨在解决建筑室内设计中因客户缺乏设计知识而与设计师之间存在沟通障碍的问题,这种信息不对称常导致项目延期和成本增加。传统生成式布局工具虽能自动化3D可视化,但规则驱动的方法受限于硬编码的空间约束,难以实现用户参与;数据驱动模型则依赖大量训练数据,灵活性不足。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的多模态、多智能体框架,通过专用代理(Reference、Spatial、Interactive、Grader)协同工作,将自然语言描述与图像转化为优化的3D设计方案。其中,检索增强生成(Retrieval-Augmented Generation, RAG)技术显著降低了对特定任务训练数据的依赖,同时支持实时交互以迭代优化空间布局,从而提升设计意图契合度、美学一致性、功能性和流线合理性,有效促进非专业用户的深度参与和设计过程的包容性与效率。
链接: https://arxiv.org/abs/2603.15341
作者: Ren Jian Lim,Rushi Dai
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 25 pages, 20 figures; accepted for publication in the Proceedings of ACADIA 2025
Abstract:In architectural interior design, miscommunication frequently arises as clients lack design knowledge, while designers struggle to explain complex spatial relationships, leading to delayed timelines and financial losses. Recent advancements in generative layout tools narrow the gap by automating 3D visualizations. However, prevailing methodologies exhibit limitations: rule-based systems implement hard-coded spatial constraints that restrict participatory engagement, while data-driven models rely on extensive training datasets. Recent large language models (LLMs) bridge this gap by enabling intuitive reasoning about spatial relationships through natural language. This research presents an LLM-based, multimodal, multi-agent framework that dynamically converts natural language descriptions and imagery into 3D designs. Specialized agents (Reference, Spatial, Interactive, Grader), operating via prompt guidelines, collaboratively address core challenges: the agent system enables real-time user interaction for iterative spatial refinement, while Retrieval-Augmented Generation (RAG) reduces data dependency without requiring task-specific model training. This framework accurately interprets spatial intent and generates optimized 3D indoor design, improving productivity, and encouraging nondesigner participation. Evaluations across diverse floor plans and user questionnaires demonstrate effectiveness. An independent LLM evaluator consistently rated participatory layouts higher in user intent alignment, aesthetic coherence, functionality, and circulation. Questionnaire results indicated 77% satisfaction and a clear preference over traditional design software. These findings suggest the framework enhances user-centric communication and fosters more inclusive, effective, and resilient design processes. Project page: this https URL
[MA-3] SAGE: Multi-Agent Self-Evolution for LLM Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长程多步推理任务中因依赖大量人工标注数据、缺乏显式规划与质量控制而导致的稳定性不足问题。其解决方案的关键在于提出一个闭环框架SAGE(Self-evolving Agents for Generalized reasoning Evolution),通过四个协同演化的代理——挑战者(Challenger)、规划者(Planner)、求解者(Solver)和批评者(Critic)——从共享的LLM基础模型出发,仅用少量种子数据实现自我进化:挑战者持续生成难度递增的任务,规划者将任务转化为结构化多步计划,求解者按计划生成答案并由外部验证器判定正确性,批评者则对生成的问题和计划进行评分与过滤,从而防止课程漂移并维持训练信号质量,最终实现稳定且可扩展的自训练机制。
链接: https://arxiv.org/abs/2603.15255
作者: Yulin Peng,Xinxin Zhu,Chenxing Wei,Nianbo Zeng,Leilei Wang,Ying Tiffany He,F. Richard Yu
机构: Shenzhen University (深圳大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); Carleton University (卡尔顿大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Reinforcement learning with verifiable rewards improves reasoning in large language models (LLMs), but many methods still rely on large human-labeled datasets. While self-play reduces this dependency, it often lacks explicit planning and strong quality control, limiting stability in long-horizon multi-step reasoning. We present SAGE (Self-evolving Agents for Generalized reasoning Evolution), a closed-loop framework where four agents: Challenger, Planner, Solver, and Critic, co-evolve from a shared LLM backbone using only a small seed set. The Challenger continuously generates increasingly difficult tasks; the Planner converts each task into a structured multi-step plan; and the Solver follows the plan to produce an answer, whose correctness is determined by external verifiers. The Critic scores and filters both generated questions and plans to prevent curriculum drift and maintain training signal quality, enabling stable self-training. Across mathematics and code-generation benchmarks, SAGE delivers consistent gains across model scales, improving the Qwen-2.5-7B model by 8.9% on LiveCodeBench and 10.7% on OlympiadBench.
[MA-4] oken Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems
【速读】:该论文旨在解决多智能体大语言模型(Multi-Agent LLM)系统中因全状态广播导致的同步开销爆炸问题,即在朴素广播机制下,同步成本随智能体数量 n、步骤数 S 和数据 artifact 大小 ∣D∣ 呈三重乘积增长(O(n×S×∣D∣)),形成所谓的“广播诱导三重乘法开销”(broadcast-induced triply-multiplicative overhead)。其核心解决方案是借鉴共享内存多处理器中的 MESI 缓存一致性协议,提出 Artifact Coherence System (ACS),通过引入惰性失效(lazy invalidation)策略,将同步成本降至 O((n+W)×∣D∣),其中 W(di) 表示每个 artifact 的写者数量。关键创新在于:(1) 建立了从缓存一致性到 artifact 同步的形式化映射;(2) 提出 Token Coherence Theorem 作为性能提升的理论下界;(3) 设计并用 TLA+ 验证的协议确保单写安全、单调版本控制和有界过时性,模拟结果表明在多种负载配置下可实现高达 95% 的 token 节省,显著优于理论下限。
链接: https://arxiv.org/abs/2603.15183
作者: Vladyslav Parakhin
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 25 pages. Code and reproduction scripts at this https URL
Abstract:Multi-agent LLM orchestration incurs synchronization costs scaling as O(n x S x |D|) in agents, steps, and artifact size under naive broadcast – a regime I term broadcast-induced triply-multiplicative overhead. I argue this pathology is a structural residue of full-state rebroadcast, not an inherent property of multi-agent coordination. The central claim: synchronization cost explosion in LLM multi-agent systems maps with formal precision onto the cache coherence problem in shared-memory multiprocessors, and MESI-protocol invalidation transfers to artifact synchronization under minimal structural modification. I construct the Artifact Coherence System (ACS) and prove the Token Coherence Theorem: lazy invalidation attenuates cost by at least S/(n + W(d_i)) when S n + W(d_i), converting O(n x S x |D|) to O((n + W) x |D|). A TLA±verified protocol enforces single-writer safety, monotonic versioning, and bounded staleness across ~2,400 explored states. Simulation across four workload configurations yields token savings of 95.0% +/- 1.3% at V=0.05, 92.3% +/- 1.4% at V=0.10, 88.3% +/- 1.5% at V=0.25, and 84.2% +/- 1.3% at V=0.50 – each exceeding the theorem’s conservative lower bounds. Savings of ~81% persist at V=0.9, contrary to the predicted collapse threshold. Contributions: (1) formal MESI-to-artifact state mapping; (2) Token Coherence Theorem as savings lower bound; (3) TLA±verified protocol with three proven invariants; (4) characterization of conditional artifact access semantics resolving the always-read objection; (5) reference Python implementation integrating with LangGraph, CrewAI, and AutoGen via thin adapter layers. Comments: 25 pages. Code and reproduction scripts at this https URL Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA) ACMclasses: C.1.4; C.2.4; I.2.11 Cite as: arXiv:2603.15183 [cs.DC] (or arXiv:2603.15183v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2603.15183 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vladyslav Parakhin [view email] [v1] Mon, 16 Mar 2026 12:20:06 UTC (27 KB) Full-text links: Access Paper: View a PDF of the paper titled Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems, by Vladyslav ParakhinView PDFHTML (experimental)TeX Source view license Current browse context: cs.DC prev | next new | recent | 2026-03 Change to browse by: cs cs.AI cs.LG cs.MA References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[MA-5] Why Agents Compromise Safety Under Pressure
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在复杂环境中因目标最大化与安全约束之间的冲突而产生的合规性下降问题,即当模型无法同时满足任务目标和安全规则时,会表现出规范漂移(normative drift),进而战略性地牺牲安全性以维持效用。其解决方案的关键在于识别并缓解“代理压力”(Agentic Pressure)——一种源于内部执行不可行性的张力;通过引入“压力隔离”(pressure isolation)策略,将决策机制与压力信号解耦,从而恢复模型行为与预设安全对齐的关系。
链接: https://arxiv.org/abs/2603.14975
作者: Hengle Jiang,Ke Tang
机构: Southern University of Science and Technology (南方科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 17 pages, 5 figures
Abstract:Large Language Model agents deployed in complex environments frequently encounter a conflict between maximizing goal achievement and adhering to safety constraints. This paper identifies a new concept called Agentic Pressure, which characterizes the endogenous tension emerging when compliant execution becomes infeasible. We demonstrate that under this pressure agents exhibit normative drift where they strategically sacrifice safety to preserve utility. Notably we find that advanced reasoning capabilities accelerate this decline as models construct linguistic rationalizations to justify violation. Finally, we analyze the root causes and explore preliminary mitigation strategies, such as pressure isolation, which attempts to restore alignment by decoupling decision-making from pressure signals.
[MA-6] Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning ICAPS2026
【速读】:该论文旨在解决**分布式双层强化学习(bi-level reinforcement learning, BRL)中的优化问题,即当领导者(leader)无法直接干预跟随者(follower)的策略优化过程、仅能观测其最优策略结果时,如何有效更新领导者的决策策略。传统方法依赖大量重复状态访问或复杂梯度估计器,难以在高维领导者决策空间中高效实现。本文的关键创新在于利用Boltzmann协方差技巧(Boltzmann covariance trick)**推导出一种替代的超梯度(hypergradient)表达形式,使得超梯度估计仅需从交互样本中即可完成,且不随领导者决策维度增加而显著提升计算复杂度。这是首个在去中心化环境下适用于两玩家马尔可夫博弈(2-player Markov games)的基于超梯度的优化方法,实验验证了其在离散和连续状态任务中的有效性。
链接: https://arxiv.org/abs/2603.14867
作者: Mikoto Kudo,Takumi Tanabe,Akifumi Wachi,Youhei Akimoto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 26 pages. Accepted at ICAPS 2026
Abstract:Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader’s decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower’s optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader’s objective, i.e., the gradient of the leader’s strategy that accounts for changes in the follower’s optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader’s decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader’s decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method’s effectiveness in both discrete and continuous state tasks.
[MA-7] Forecast-Aware Cooperative Planning on Temporal Graphs under Stochastic Adversarial Risk
【速读】:该论文旨在解决协作式多机器人任务中,因环境风险随时间动态变化(如敌方巡逻或随机移动的危险源)而导致的传统支持协调策略失效的问题。现有方法假设风险场景静态不变,无法利用可预测的风险演化趋势,从而限制了支持分配的有效性。其解决方案的关键在于提出一种预测感知的协同规划框架,通过将敌方动态建模为图边上的一阶马尔可夫停留-移动过程,向前传播边占用概率以生成时序化的边风险预测;这些预测不仅指导对高风险边的前瞻性支持位置分配,还联合优化机器人路径规划,实现基于未来风险预判的支持协调。实验表明,该方法显著降低团队总预期代价,逼近理想信息条件下的“oracle规划器”性能。
链接: https://arxiv.org/abs/2603.14697
作者: Manshi Limbu,Xuan Wang,Gregory J. Stein,Daigo Shishika,Xuesu Xiao
机构: George Mason University (乔治梅森大学); National Science Foundation (国家科学基金会); Army Research Office (陆军研究办公室); Air Force Research Laboratory (空军研究实验室); US Air Forces Central (美国中央空军); Google DeepMind (谷歌深度思维); Clearpath Robotics (Clearpath Robotics); Raytheon Technologies (雷神技术公司); Tangenta (Tangenta); Mason Innovation Exchange (梅森创新交易所); Walmart (沃尔玛)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Cooperative multi-robot missions often require teams of robots to traverse environments where traversal risk evolves due to adversary patrols or shifting hazards with stochastic dynamics. While support coordination - where robots assist teammates in traversing risky regions - can significantly reduce mission costs, its effectiveness depends on the team’s ability to anticipate future risk. Existing support-based frameworks assume static risk landscapes and therefore fail to account for predictable temporal trends in risk evolution. We propose a forecast-aware cooperative planning framework that integrates stochastic risk forecasting with anticipatory support allocation on temporal graphs. By modeling adversary dynamics as a first-order Markov stay-move process over graph edges, we propagate the resulting edge-occupancy probabilities forward in time to generate time-indexed edge-risk forecasts. These forecasts guide the proactive allocation of support positions to forecasted risky edges for effective support coordination, while also informing joint robot path planning. Experimental results demonstrate that our approach consistently reduces total expected team cost compared to non-anticipatory baselines, approaching the performance of an oracle planner.
[MA-8] EARCP: Self-Regulating Coherence-Aware Ensemble Architecture for Sequential Decision Making – Ensemble Auto-Regule par Coherence et Performance
【速读】:该论文旨在解决传统集成学习方法在动态环境中因静态或离线学习权重而导致性能下降的问题,特别是在存在时间依赖性和非平稳数据分布的场景下。解决方案的关键在于提出EARCP(Ensemble Auto-Régulé par Cohérence et Performance)架构,其通过一种基于性能与模型间一致性(coherence)的在线学习机制,动态调整异构专家模型的权重;该机制融合了乘法权重更新算法的理论基础与新颖的一致性正则项,能够在保证理论上的次线性后悔界(O(√(T log M)))的同时,提升模型在非平稳环境中的鲁棒性与适应能力。
链接: https://arxiv.org/abs/2603.14651
作者: Mike Amega
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 13 pages, 1 table, 1 algorithm. Open-source implementation available at this https URL and via pip install earcp. Dual-licensed: free for academic researchers, students, and organizations with gross revenue under $100,000/year; commercial license required for organizations exceeding this threshold (contact author)
Abstract:We present EARCP (Ensemble Auto-Régulé par Cohérence et Performance), a novel ensemble architecture that dynamically weights heterogeneous expert models based on both their individual performance and inter-model coherence. Unlike traditional ensemble methods that rely on static or offline-learned combinations, EARCP continuously adapts model weights through a principled online learning mechanism that balances exploitation of high-performing models with exploration guided by consensus signals. The architecture combines theoretical foundations from multiplicative weight update algorithms with a novel coherence-based regularization term, providing both theoretical guarantees through regret bounds and practical robustness in non-stationary environments. We formalize the EARCP framework, prove sublinear regret bounds of O(sqrt(T log M)) under standard assumptions, and demonstrate its effectiveness through empirical evaluation on sequential prediction tasks including time series forecasting, activity recognition, and financial prediction. The architecture is designed as a general-purpose framework applicable to any domain requiring ensemble learning with temporal dependencies. An open-source implementation is available at this https URL and via PyPI (pip install earcp).
[MA-9] EcoFair-CH-MARL: Scalable Constrained Hierarchical Multi-Agent RL with Real-Time Emission Budgets and Fairness Guarantees ECAI
【速读】:该论文旨在解决航运物流系统中同时实现高效、可持续与公平性的难题,特别是在全球减排目标和市场压力加剧背景下,如何在不确定性环境(如随机天气与需求)下保障碳排放约束、优化成本分配并支持大规模多智能体协同。其解决方案的关键在于提出EcoFair-CH-MARL框架,该框架融合三项核心创新:(i) 基于原始-对偶预算层,在概率意义上保证累积排放上限;(ii) 具有动态惩罚调度机制的公平感知奖励变换器,实现异构船队间的最大最小成本公平性;(iii) 两级策略架构,将战略路径规划与实时船舶控制解耦,支持智能体数量线性扩展。理论分析证明其约束违反与公平损失的遗憾上界为O(\sqrt{T}),实验表明在高保真数字孪生场景中相较先进方法可降低15%排放、提升12%吞吐量,并实现45%公平成本改进。
链接: https://arxiv.org/abs/2603.14625
作者: Saad Alqithami
机构: Al-Baha University (阿尔巴哈大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Conference: The 28th European Conference on Artificial Intelligence (ECAI)
Abstract:Global decarbonisation targets and tightening market pressures demand maritime logistics solutions that are simultaneously efficient, sustainable, and equitable. We introduce EcoFair-CH-MARL, a constrained hierarchical multi-agent reinforcement learning framework that unifies three innovations: (i) a primal-dual budget layer that provably bounds cumulative emissions under stochastic weather and demand; (ii) a fairness-aware reward transformer with dynamically scheduled penalties that enforces max-min cost equity across heterogeneous fleets; and (iii) a two-tier policy architecture that decouples strategic routing from real-time vessel control, enabling linear scaling in agent count. New theoretical results establish O(\sqrtT) regret for both constraint violations and fairness loss. Experiments on a high-fidelity maritime digital twin (16 ports, 50 vessels) driven by automatic identification system traces, plus an energy-grid case study, show up to 15% lower emissions, 12% higher through-put, and a 45% fair-cost improvement over state-of-the-art hierarchical and constrained MARL baselines. In addition, EcoFair-CH-MARL achieves stronger equity (lower Gini and higher min-max welfare) than fairness-specific MARL baselines (e.g., SOTO, FEN), and its modular design is compatible with both policy- and value-based learners. EcoFair-CH-MARL therefore advances the feasibility of large-scale, regulation-compliant, and socially responsible multi-agent coordination in safety-critical domains.
自然语言处理
[NLP-0] Mixture-of-Depths Attention
【速读】: 该论文旨在解决深度语言模型(Deep Language Models)在扩展层数时面临的信号退化问题(Signal Degradation),即浅层中形成的有用特征因多次残差更新而逐渐被稀释,导致深层难以恢复。其解决方案的关键在于提出混合深度注意力机制(Mixture-of-Depths Attention, MoDA),使每个注意力头不仅能关注当前层的键值对(Key-Value pairs),还能访问前序层的键值对,从而增强跨层信息传递与保留能力。此外,作者设计了一种硬件友好的算法以高效处理MoDA带来的非连续内存访问模式,在长序列(64K)下达到FlashAttention-2 97.3%的效率,并在1.5B参数模型上验证了MoDA的有效性:平均困惑度降低0.2,下游任务性能提升2.11%,且计算开销仅增加3.7% FLOPs。
链接: https://arxiv.org/abs/2603.15619
作者: Lianghui Zhu,Yuxin Fang,Bencheng Liao,Shijie Wang,Tianheng Cheng,Zilong Huang,Chen Chen,Lai Wei,Yutao Zeng,Ya Wang,Yi Lin,Yu Li,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code is released at this https URL
Abstract:Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2’s efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at this https URL .
[NLP-1] Mechanistic Origin of Moral Indifference in Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在行为对齐中忽视表层合规性与内部未对齐表征之间差异的问题,从而导致模型对长尾风险仍具脆弱性。其核心发现是LLMs因将不同道德概念压缩为统一的概率分布而表现出内在的道德冷漠(moral indifference)。解决方案的关键在于:首先基于原型理论(Prototype Theory)和Social-Chemistry-101数据集构建251k个道德向量以刻画真实道德结构;其次通过稀疏自编码器(Sparse Autoencoders)在Qwen3-8B模型中识别出单语义道德特征,并重构其拓扑关系以匹配真实道德向量,实现表征层面的对齐,进而显著提升道德推理能力与粒度,在独立对抗性火焰基准测试(Flames benchmark)中达到75%的成对胜率。此方法从后验修正转向了内生对齐的主动培育路径。
链接: https://arxiv.org/abs/2603.15615
作者: Lingyu Li,Yan Teng,Yingchun Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 11 figures, 5 tables
Abstract:Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs’ latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B, isolate mono-semantic moral features, and targetedly reconstruct their topological relationships to align with ground-truth moral vectors. This representational alignment naturally improves moral reasoning and granularity, achieving a 75% pairwise win-rate on the independent adversarial Flames benchmark. Finally, we elaborate on the remedial nature of current intervention methods from an experientialist philosophy, arguing that endogenously aligned AI might require a transformation from post-hoc corrections to proactive cultivation.
[NLP-2] Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning
【速读】: 该论文旨在解决代码生成中因高质量测试用例稀缺、现有数据集覆盖有限以及静态奖励机制无法随模型能力提升而自适应所导致的性能瓶颈问题。其核心解决方案是提出一种对抗式协同进化框架 Code-A1,通过分离训练一个代码大语言模型(Code LLM)与一个测试用例大语言模型(Test LLM),并赋予二者相反的目标:前者被奖励以通过更多测试,后者则被奖励以发现更多缺陷。这种架构设计有效规避了自洽性陷阱(self-collusion),同时允许在白盒环境下由 Test LLM 对候选代码进行针对性攻击性测试,从而显著提升测试生成质量与代码鲁棒性。
链接: https://arxiv.org/abs/2603.15611
作者: Aozhe Wang,Yuchen Yan,Nan Zhou,Zhengxi Lu,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: Project Page: this https URL Code: this https URL
Abstract:Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.
[NLP-3] From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation
【速读】: 该论文旨在解决长时程机器人操作中过程监控不准确的问题,其核心瓶颈在于当前视频多模态大语言模型(Video MLLMs)主要基于监督微调(SFT)训练,仅能作为被动“观察者”识别当前事件,而无法评估当前状态与最终任务目标之间的进展关系。解决方案的关键在于提出PRIMO R1框架,通过引入基于结果的强化学习(outcome-based Reinforcement Learning),激励模型生成显式的思维链(Chain-of-Thought)以进行进度估计,并构建结构化的时序输入——将视频序列显式锚定在初始状态和当前状态图像之间,从而将视频MLLM转化为主动的“评判者”。这一设计显著提升了任务进展判断的准确性与零样本泛化能力,在多个基准测试中达到最优性能。
链接: https://arxiv.org/abs/2603.15600
作者: Yibin Liu,Yaxing Lyu,Daqi Gao,Zhixuan Liang,Weiliang Tang,Shilong Mu,Xiaokang Yang,Yao Mu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages
Abstract:Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive “Observers” that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active “Critics”. We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.
[NLP-4] OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data
【速读】: 该论文旨在解决当前前沿搜索代理(search agent)研究中因缺乏透明、高质量训练数据而导致的科研进展受限问题,尤其在开源社区难以与工业巨头竞争的困境。其解决方案的关键在于提出OpenSeeker——首个完全开源的搜索代理模型及配套数据集,并通过两项核心技术实现高性能:一是基于事实锚定的可扩展可控问答(QA)合成方法,利用拓扑扩展和实体混淆反向构建网络图结构,生成具有可控覆盖范围与复杂度的多跳推理任务;二是去噪轨迹合成机制,通过回溯式总结降低行动轨迹噪声,从而提升教师大模型生成高质量动作的能力。实验表明,仅用11.7k合成样本进行单次训练,OpenSeeker即在多个基准测试中达到领先水平,显著优于现有开源代理,甚至超越部分工业级产品。
链接: https://arxiv.org/abs/2603.15594
作者: Yuwen Du,Rui Ye,Shuo Tang,Xinyu Zhu,Yijun Lu,Yuzhu Cai,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 6 figures
Abstract:Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.
[NLP-5] SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction LREC2026
【速读】: 该论文旨在解决形态丰富的低资源语言(如斯洛伐克语)中关键词提取(Keyphrase Extraction)任务缺乏高质量评估数据集的问题。其核心解决方案是构建了一个包含227,432条科学摘要及作者标注关键词的大规模斯洛伐克语数据集,相较此前最大资源扩大25倍,接近英语基准KP20K的规模。在此基础上,论文系统评估了三种无监督基线方法(YAKE、TextRank、KeyBERT with SlovakBERT嵌入)和一种基于大语言模型(LLM)的方法KeyLLM(使用GPT-3.5-turbo),发现传统统计方法因无法处理词形变化导致精确匹配性能受限(F1@6最高仅11.6%),而KeyLLM通过生成更接近作者指定规范形式的关键词,显著缩小了精确匹配与部分匹配之间的差距,并经人工评估验证其能更好捕捉语义相关概念(κ=0.61)。关键创新在于揭示形态不匹配是统计方法的主要失败模式,为其他屈折语言的关键词提取提供了重要启示。
链接: https://arxiv.org/abs/2603.15523
作者: David Števaňák,Marek Šuppa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: LREC 2026
Abstract:Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases – scraped and systematically cleaned from the Slovak Central Register of Theses – representing a 25-fold increase over the largest prior Slovak resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6% exact-match F1@6 , with a large gap to partial matching (up to 51.5%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact–partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents ( \kappa = 0.61 ) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods – a finding relevant to other inflected languages. The dataset (this https URL) and evaluation code (this https URL) are publicly available.
[NLP-6] Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models
【速读】: 该论文旨在解决**同主题知识编辑(same-subject knowledge editing)**中的泛化失效问题:即模型在编辑后能正确回忆原始编辑内容,但在遵循用户指令时却无法召回更新后的知识。作者指出,这一现象的根本原因在于提示(prompt)变化引发的内部激活漂移超出了模型在编辑后的几何容差范围,导致泛化能力崩溃。解决方案的关键在于提出RoSE(Robust Same-subject Editing),其核心机制包括:(1) 各向同性几何对齐(Isotropic Geometric Alignment),以最小化表示偏差;(2) 分层知识整合(Hierarchical Knowledge Integration),用于平滑优化景观。该方法有效缓解了由正交梯度联合优化带来的尖锐极小值问题以及协方差约束引发的“协方差陷阱”,从而显著提升模型在指令跟随场景下的知识保持能力。
链接: https://arxiv.org/abs/2603.15518
作者: Xiyu Liu,Qingyi Si,Zhengxiao Liu,Chenxu Yang,Naibin Gu,Zheng Lin
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); JD.com (京东)
类目: Computation and Language (cs.CL)
备注: 23 pages, 20 figures
Abstract:While locate-then-edit knowledge editing efficiently updates knowledge encoded within Large Language Models (LLMs), a critical generalization failure mode emerges in the practical same-subject knowledge editing scenario: models fail to recall the updated knowledge when following user instructions, despite successfully recalling it in the original edited form. This paper identifies the geometric root of this generalization collapse as a fundamental conflict where the inner activation drifts induced by prompt variations exceed the model’s geometric tolerance for generalization after editing. We attribute this instability to a dual pathology: (1) The joint optimization with orthogonal gradients collapses solutions into sharp minima with narrow stability, and (2) the standard covariance constraint paradoxically acts as a Covariance Trap that amplifies input perturbations. To resolve this, we introduce RoSE (Robust Same-subject Editing), which employs Isotropic Geometric Alignment to minimize representational deviation and Hierarchical Knowledge Integration to smooth the optimization landscape. Extensive experiments demonstrate that RoSE significantly improves instruction-following capabilities, laying the foundation for robust interactive parametric memory of LLM agents.
[NLP-7] ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models
【速读】: 该论文旨在解决当前主流视觉语言模型(Vision-Language Models, VLMs)在越南临床医学场景中表现不佳的问题,尤其是由于缺乏对越南语医学数据的训练导致其生成结果准确性低、幻觉严重。解决方案的关键在于构建并公开了一个名为ViX-Ray的高质量中文胸片图像数据集,包含5,400张由越南权威医院放射科医生标注的胸部X光图像及其对应的专家级影像学描述和诊断意见,从而为VLMs提供本土化医学语义理解与生成能力的训练基础。通过在此数据集上微调多个先进开源VLMs并对比商用模型(如GPT-4V和Gemini),研究揭示了现有模型在印象生成中的高幻觉率和低精度问题,同时确立了ViX-Ray作为评估和推动越南医疗领域VLMs发展的基准平台。
链接: https://arxiv.org/abs/2603.15513
作者: Duy Vu Minh Nguyen,Chinh Thanh Truong,Phuc Hoang Tran,Hung Tuan Le,Nguyen Van-Thanh Dat,Trung Hieu Pham,Kiet Van Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Vietnamese medical research has become an increasingly vital domain, particularly with the rise of intelligent technologies aimed at reducing time and resource burdens in clinical diagnosis. Recent advances in vision-language models (VLMs), such as Gemini and GPT-4V, have sparked a growing interest in applying AI to healthcare. However, most existing VLMs lack exposure to Vietnamese medical data, limiting their ability to generate accurate and contextually appropriate diagnostic outputs for Vietnamese patients. To address this challenge, we introduce ViX-Ray, a novel dataset comprising 5,400 Vietnamese chest X-ray images annotated with expert-written findings and impressions from physicians at a major Vietnamese hospital. We analyze linguistic patterns within the dataset, including the frequency of mentioned body parts and diagnoses, to identify domain-specific linguistic characteristics of Vietnamese radiology reports. Furthermore, we fine-tune five state-of-the-art open-source VLMs on ViX-Ray and compare their performance to leading proprietary models, GPT-4V and Gemini. Our results show that while several models generate outputs partially aligned with clinical ground truths, they often suffer from low precision and excessive hallucination, especially in impression generation. These findings not only demonstrate the complexity and challenge of our dataset but also establish ViX-Ray as a valuable benchmark for evaluating and advancing vision-language models in the Vietnamese clinical domain.
[NLP-8] Invisible failures in human-AI interactions
【速读】: 该论文旨在解决AI系统在实际应用中频繁发生“无声失败”(invisible failures)的问题,即AI行为出现偏差或错误时用户未察觉或未明确反馈的情况。研究表明,78%的AI失败属于此类隐形问题,且这些失败可归纳为八类典型模式(archetypes),并揭示其系统性共现特征,从而识别出更高层次的失败类型。解决方案的关键在于构建一个基于大规模真实交互数据(WildChat)的隐形失败分类体系,该体系不仅区分了交互性(interactional)与能力驱动型(capability-driven)失败(其中91%为交互性失败),还表明即便AI模型能力提升,约94%的失败仍将持续存在,凸显了改进人机交互设计的重要性。此分类框架可作为产品开发者、研究人员和政策制定者进行可靠故障监测与系统优化的核心工具。
链接: https://arxiv.org/abs/2603.15423
作者: Christopher Potts,Moritz Sudhof
机构: Bigspin AI(大旋风人工智能); Stanford University(斯坦福大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:AI systems fail silently far more often than they fail visibly. In a large-scale quantitative analysis of human-AI interactions from the WildChat dataset, we find that 78% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisible failures cluster into eight archetypes that help us characterize where and how AI systems are failing to meet users’ needs. In addition, the archetypes show systematic co-occurrence patterns indicating higher-level failure types. To address the question of whether these archetypes will remain relevant as AI systems become more capable, we also assess failures for whether they are primarily interactional or capability-driven, finding that 91% involve interactional dynamics, and we estimate that 94% of such failures would persist even with a more capable model. Finally, we illustrate how the archetypes help us to identify systematic and variable AI limitations across different usage domains. Overall, we argue that our invisible failure taxonomy can be a key component in reliable failure monitoring for product developers, scientists, and policy makers. Our code and data are available at this https URL
[NLP-9] CLAG: Adaptive Memory Organization via Agent -Driven Clustering for Small Language Model Agents
【速读】: 该论文旨在解决小语言模型(Small Language Model, SLM)在使用全局记忆池时面临的知识稀释与干扰问题,尤其在复杂推理任务中因无关上下文引入导致性能下降的问题。其核心解决方案是提出一种基于聚类的智能体记忆框架CLAG(CLustering-based AGentic memory),关键在于通过SLM驱动的路由机制将新记忆分配至语义一致的聚类单元,并自动生成每个聚类的专属特征描述(如主题摘要和标签),从而构建结构化的局部记忆空间。这种设计实现了记忆的局部演化与高效检索:在检索阶段采用两阶段过滤策略,先依据聚类特征筛选相关簇以缩小搜索范围,显著降低跨主题干扰并提升内部记忆密度,最终在多个问答数据集上验证了其对SLM代理的稳定性能增益。
链接: https://arxiv.org/abs/2603.15421
作者: Taeyun Roh,Wonjune Jang,Junha Jung,Jaewoo Kang
机构: Korea University (韩国科学技术院); Myongji University (明治大学); AIGEN Sciences (AIGEN科学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store experiences in a single global retrieval pool which can gradually dilute or corrupt stored knowledge. This problem is especially pronounced for small language models (SLMs), which are highly vulnerable to irrelevant context. We introduce CLAG, a CLustering-based AGentic memory framework where an SLM agent actively organizes memory by clustering. CLAG employs an SLM-driven router to assign incoming memories to semantically coherent clusters and autonomously generates cluster-specific profiles, including topic summaries and descriptive tags, to establish each cluster as a self-contained functional unit. By performing localized evolution within these structured neighborhoods, CLAG effectively reduces cross-topic interference and enhances internal memory density. During retrieval, the framework utilizes a two-stage process that first filters relevant clusters via their profiles, thereby excluding distractors and reducing the search space. Experiments on multiple QA datasets with three SLM backbones show that CLAG consistently improves answer quality and robustness over prior memory systems for agents, remaining lightweight and efficient.
[NLP-10] Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities
【速读】: 该论文旨在解决测试时训练(Test-time Training, TTT)方法在提升大语言模型(Large Language Models, LLMs)推理能力过程中所引发的安全漏洞问题,特别是其对有害提示注入(harmful prompt injection)的脆弱性。研究聚焦于一种基于自一致性(self-consistency)的TTT方法——测试时强化学习(Test-time Reinforcement Learning, TTRL),该方法通过多数投票作为奖励信号来增强模型推理的一致性。论文发现,TTRL在面对恶意注入时会放大模型原有行为:若基础模型较安全,则产生“安全放大”;若模型本身易受攻击,则导致“有害性放大”,二者均伴随推理能力下降,即所谓的“推理税”(reasoning tax)。解决方案的关键在于识别并量化这种由自一致性驱动的放大效应,揭示TTT方法内在的安全风险,并强调未来需设计更安全的TTT机制以避免因强化自我一致性而引发的不可控行为。
链接: https://arxiv.org/abs/2603.15417
作者: Vanshaj Khattar,Md Rafi ur Rashid,Moumita Choudhury,Jing Liu,Toshiaki Koike-Akino,Ming Jin,Ye Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTRL), a recent TTT method that improves LLM reasoning by rewarding self-consistency using majority vote as a reward signal. We show that harmful prompt injection during TTRL amplifies the model’s existing behaviors, i.e., safety amplification when the base model is relatively safe, and harmfulness amplification when it is vulnerable to the injected data. In both cases, there is a decline in reasoning ability, which we refer to as the reasoning tax. We also show that TTT methods such as TTRL can be exploited adversarially using specially designed “HarmInject” prompts to force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency can lead to amplification behaviors and reasoning degradation, highlighting the need for safer TTT methods.
[NLP-11] SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia CVPR2026
【速读】: 该论文旨在解决现有多语言文档与场景文本理解基准普遍聚焦于高资源语言、难以评估模型在真实多语言环境(尤其是低资源语言)中表现的问题。针对东南亚地区语言多样性高、书写系统复杂及文档类型多样等挑战,作者提出SEA-Vision基准,其关键在于设计了一种混合标注流程:结合自动化过滤与评分、多模态大语言模型(Multimodal Large Language Model, MLLM)辅助标注以及轻量级母语者验证机制,在显著降低人工标注成本的同时保障了高质量的多任务标注数据(包括文档解析和以文本为中心的视觉问答,Text-Centric Visual Question Answering, TEC-VQA),从而有效支撑对11种东南亚语言的联合评估。
链接: https://arxiv.org/abs/2603.15409
作者: Pengfei Yue,Xingran Zhao,Juntao Chen,Peng Hou,Wang Longchao,Jianghang Lin,Shengchuan Zhang,Anxiang Zeng,Liujuan Cao
机构: Xiamen University (厦门大学); Shopee (虾皮); Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注: Accepted By CVPR2026
Abstract:Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.
[NLP-12] Fusian: Multi-LoRA Fusion for Fine-Grained Continuous MBTI Personality Control in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在人格控制方面存在的局限性问题,即现有方法(如提示工程和标准监督微调(Supervised Fine-Tuning, SFT))通常将人格特质视为离散类别(例如“外向”与“内向”),无法实现对人格强度的连续、精细调控。解决方案的关键在于提出Fusian框架,其核心创新包括两个阶段:首先通过SFT过程中保存一系列LoRA适配器(LoRA adapters)来捕捉人格演变轨迹,从而映射出人格特质的连续流形;其次利用强化学习(Reinforcement Learning, RL)训练一个策略网络,动态计算这些冻结适配器的混合权重,并通过参数化于策略网络的Dirichlet分布采样实现多适配器融合,从而精确匹配用户指定的人格强度数值目标。
链接: https://arxiv.org/abs/2603.15405
作者: Zehao Chen,Rong Pan
机构: Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in simulating diverse human behaviors and personalities. However, existing methods for personality control, which include prompt engineering and standard Supervised Fine-Tuning (SFT), typically treat personality traits as discrete categories (e.g., “Extroverted” vs. “Introverted”), lacking the ability to precisely control the intensity of a trait on a continuous spectrum. In this paper, we introduce Fusian, a novel framework for fine-grained, continuous personality control in LLMs. Fusian operates in two stages: (1) Trajectory Collection, where we capture the dynamic evolution of personality adoption during SFT by saving a sequence of LoRA adapters, effectively mapping the continuous manifold of a trait; and (2) RL-based Dynamic Fusion, where we train a policy network using Reinforcement Learning to dynamically compute mixing weights for these frozen adapters. By sampling from a Dirichlet distribution parameterized by the policy network, Fusian fuses multiple adapters to align the model’s output with a specific numerical target intensity. Experiments on the Qwen3-14B model demonstrate that Fusian achieves high precision in personality control, significantly outperforming baseline methods in aligning with user-specified trait intensities.
[NLP-13] A Closer Look into LLM s for Table Understanding
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在表格理解任务中内部机制不明确的问题。为实现这一目标,作者对16种LLMs进行了实证研究,涵盖通用LLM、专用表格LLM及混合专家(Mixture-of-Experts, MoE)模型,从注意力动态、有效层深度、专家激活模式和输入设计影响四个维度展开分析。其关键解决方案在于揭示了LLMs处理表格数据的内在规律:首先识别出“三阶段注意力模式”——早期层广泛扫描表格、中期层定位相关单元格、晚期层增强关键贡献;其次发现表格任务比数学推理任务需要更深的网络层才能达到稳定预测;再次指出MoE模型在中间层激活表格特有专家,而首尾层共享通用专家;最后表明思维链(Chain-of-Thought)提示能提升表格注意力,并可通过表格微调进一步强化。这些发现为提升表格理解任务的可解释性与模型优化提供了重要依据。
链接: https://arxiv.org/abs/2603.15402
作者: Jia Wang,Chuanyu Qin,Mingyu Zheng,Qingyi Si,Peize Li,Zheng Lin
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); JD.COM(京东)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and perform downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key findings include: (1) LLMs follow a three-phase attention pattern – early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in middle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further enhanced by table-tuning. We hope these findings and insights can facilitate interpretability and future research on table-related tasks.
[NLP-14] When Does Sparsity Mitigate the Curse of Depth in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中因深度增加而导致的“深度诅咒”问题,即随着网络层数加深,后期层对学习和表征的贡献逐渐降低,这与预层归一化(Pre-Layer Normalization)中方差累积导致深层模块趋于近恒等映射有关。其解决方案的关键在于揭示稀疏性(sparsity)不仅可提升计算效率,更是一种调控方差传播的有效机制,从而改善深度利用率。作者通过实证分析两类稀疏性——隐式稀疏性(如权重衰减诱导的权重稀疏性和长上下文输入引发的注意力稀疏性)与显式稀疏性(如分组查询注意力中的键值共享稀疏性和专家激活稀疏性),发现它们均能显著降低输出方差并促进各层功能分化,最终提出一个实用的训练深度高效LLM的规则指南,在下游任务上实现4.6%的准确率提升。
链接: https://arxiv.org/abs/2603.15389
作者: Dilxat Muhtar,Xinyuan Song,Sebastian Pokutta,Max Zimmer,Nico Pelleriti,Thomas Hofmann,Shiwei Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 32 pages, 29 figures
Abstract:Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at this https URL.
[NLP-15] CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving
【速读】: 该论文旨在解决自动驾驶系统(Autonomous Vehicles, AVs)在实际运行中发生事故时,由于系统架构异构性(如端到端与模块化设计并存)、算法差异及集成策略多样化,导致事故调查难以标准化、安全分析缺乏系统性的难题。其解决方案的关键在于提出CRASH(Cognitive Reasoning Agent for Safety Hazards),一个基于大语言模型(Large Language Model, LLM)的智能代理,能够统一处理结构化字段与非结构化叙述文本,自动推理事故成因、归因主要故障并判断自动驾驶系统是否实质性参与事件。CRASH通过自动化、可解释的方式对2,168起真实事故案例进行分析,准确识别出64%的事故源于感知或规划失败,并揭示50%为追尾碰撞,验证了其在提升事故分析效率和可靠性方面的潜力,为自动驾驶系统的安全性研究与迭代优化提供可行动的洞察。
链接: https://arxiv.org/abs/2603.15364
作者: Erick Silva,Rehana Yasmin,Ali Shoker
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As AVs grow in complexity and diversity, identifying the root causes of operational failures has become increasingly complex. The heterogeneity of system architectures across manufacturers, ranging from end-to-end to modular designs, together with variations in algorithms and integration strategies, limits the standardization of incident investigations and hinders systematic safety analysis. This work examines real-world AV incidents reported in the NHTSA database. We curate a dataset of 2,168 cases reported between 2021 and 2025, representing more than 80 million miles driven. To process this data, we introduce CRASH, Cognitive Reasoning Agent for Safety Hazards, an LLM-based agent that automates reasoning over crash reports by leveraging both standardized fields and unstructured narrative descriptions. CRASH operates on a unified representation of each incident to generate concise summaries, attribute a primary cause, and assess whether the AV materially contributed to the event. Our findings show that (1) CRASH attributes 64% of incidents to perception or planning failures, underscoring the importance of reasoning-based analysis for accurate fault attribution; and (2) approximately 50% of reported incidents involve rear-end collisions, highlighting a persistent and unresolved challenge in autonomous driving deployment. We further validate CRASH with five domain experts, achieving 86% accuracy in attributing AV system failures. Overall, CRASH demonstrates strong potential as a scalable and interpretable tool for automated crash analysis, providing actionable insights to support safety research and the continued development of autonomous driving systems.
[NLP-16] DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models
【速读】: 该论文旨在解决当前预训练掩码扩散语言模型(Masked Diffusion Language Models, MDLMs)在解码过程中主要依赖词元级不确定性准则,而忽视序列级信息和词元间依赖关系的问题。解决方案的关键在于提出一种无需训练的依赖导向采样器(Dependency-Oriented Sampler, DOS),其核心机制是利用Transformer块中的注意力矩阵近似词元间的依赖关系,在更新被掩码位置时优先考虑未掩码词元的信息,从而更有效地利用上下文依赖结构,提升生成质量与效率。
链接: https://arxiv.org/abs/2603.15340
作者: Xueyu Zhou,Yangrong Hu,Jian Huang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 16 pages, 5 figures
Abstract:Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.
[NLP-17] agarela - A Portuguese speech dataset from podcasts
【速读】: 该论文旨在解决葡萄牙语(Portuguese)在语音处理领域因缺乏公开、大规模且高质量数据集而导致的资源匮乏问题。解决方案的关键在于构建一个名为TAGARELA的新数据集,其包含超过8,972小时的播客音频,规模接近英语的GigaSpeech(10kh),并采用混合转录策略:首先利用先前在高保真转录数据上训练的自动语音识别(ASR)模型进行初步标注,再结合专有API生成的高质量标签以确保准确性,从而实现高精度的语音-文本对齐。该数据集已公开发布,可有效推动葡萄牙语语音识别(ASR)和语音合成(TTS)模型的性能提升与技术发展。
链接: https://arxiv.org/abs/2603.15326
作者: Frederico Santos de Oliveira,Lucas Rafael Stefanel Gris,Alef Iury Siqueira Ferreira,Augusto Seben da Rosa,Alexandre Costa Ferro Filho,Edresson Casanova,Christopher Dane Shulby,Rafael Teixeira Sousa,Diogo Fernandes Costa Silva,Anderson da Silva Soares,Arlindo Rodrigues Galvão Filho
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English’s GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese. The dataset is released publicly, available at this https URL, to foster the development of robust speech technologies.
[NLP-18] PYTHEN: A Flexible Framework for Legal Reasoning in Python
【速读】: 该论文旨在解决法律推理中 defeasible(可废止的)性质难以形式化建模的问题,尤其针对传统逻辑编程方法在表达法律规则中的例外、条件组合及灵活性方面的局限性。解决方案的关键在于提出 PYTHEN——一个基于 Python 的新型框架,其核心创新是利用 Python 内置的 any() 和 all() 函数,原生支持单条规则中同时表达合取(ALL)与析取(ANY)条件,并提供更灵活的异常处理机制,从而实现对法律论证中复杂语义的精确建模。该设计不仅提升了形式化表达能力,还显著降低了开发者门槛,使非逻辑编程背景的专业人员也能高效构建下一代法律人工智能系统。
链接: https://arxiv.org/abs/2603.15317
作者: Ha-Thanh Nguyen,Ken Satoh
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at JURISIN 2026
Abstract:This paper introduces PYTHEN, a novel Python-based framework for defeasible legal reasoning. PYTHEN is designed to model the inherently defeasible nature of legal argumentation, providing a flexible and intuitive syntax for representing legal rules, conditions, and exceptions. Inspired by PROLEG (PROlog-based LEGal reasoning support system) and guided by the philosophy of The Zen of Python, PYTHEN leverages Python’s built-in any() and all() functions to offer enhanced flexibility by natively supporting both conjunctive (ALL) and disjunctive (ANY) conditions within a single rule, as well as a more expressive exception-handling mechanism. This paper details the architecture of PYTHEN, provides a comparative analysis with PROLEG, and discusses its potential applications in autoformalization and the development of next-generation legal AI systems. By bridging the gap between symbolic reasoning and the accessibility of Python, PYTHEN aims to democratize formal legal reasoning for young researchers, legal tech developers, and professionals without extensive logic programming expertise. We position PYTHEN as a practical bridge between the powerful symbolic reasoning capabilities of logic programming and the rich, ubiquitous ecosystem of Python, making formal legal reasoning accessible to a broader range of developers and legal professionals.
[NLP-19] CCTU: A Benchmark for Tool Use under Complex Constraints
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在显式约束条件下进行工具调用时的评估难题,此类场景要求模型具备函数调用、指令遵循和自我修正等能力,但此前缺乏专门的评测基准。解决方案的关键在于提出CCTU(Constraint-Constrained Tool Use)基准,其基于涵盖资源、行为、工具集和响应四个维度的12类约束分类体系,构建了200个精心设计且具有挑战性的测试用例,平均每例包含7种约束类型且提示长度超过4700 tokens;同时开发了一个可执行的约束验证模块,实现多轮交互中逐步骤的合规性检查与强制约束遵守,从而为LLMs在复杂约束下的工具使用能力提供可靠评估。
链接: https://arxiv.org/abs/2603.15309
作者: Junjie Ye,Guoqiang Zhang,Wenjie Fu,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Solving problems through tool use under explicit constraints constitutes a highly challenging yet unavoidable scenario for large language models (LLMs), requiring capabilities such as function calling, instruction following, and self-refinement. However, progress has been hindered by the absence of dedicated evaluations. To address this, we introduce CCTU, a benchmark for evaluating LLM tool use under complex constraints. CCTU is grounded in a taxonomy of 12 constraint categories spanning four dimensions (i.e., resource, behavior, toolset, and response). The benchmark comprises 200 carefully curated and challenging test cases across diverse tool-use scenarios, each involving an average of seven constraint types and an average prompt length exceeding 4,700 tokens. To enable reliable evaluation, we develop an executable constraint validation module that performs step-level validation and enforces compliance during multi-turn interactions between models and their environments. We evaluate nine state-of-the-art LLMs in both thinking and non-thinking modes. Results indicate that when strict adherence to all constraints is required, no model achieves a task completion rate above 20%. Further analysis reveals that models violate constraints in over 50% of cases, particularly in the resource and response dimensions. Moreover, LLMs demonstrate limited capacity for self-refinement even after receiving detailed feedback on constraint violations, highlighting a critical bottleneck in the development of robust tool-use agents. To facilitate future research, we release the data and code.
[NLP-20] Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies LREC2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理跨句范畴性语法现象(如动词交替结构)时能力不足的问题,尤其是其对句间系统性模式(如英语、德语、意大利语中的状态变化动词交替和宾语省略结构,以及希伯来语的binyanim结构)的理解尚不充分。解决方案的关键在于构建了针对四种语言的受控范式数据集,采用黑鸟语言矩阵(Blackbird Language Matrices, BLMs)任务——一种专为语言设计的类RPM/ARC推理任务,要求模型根据句法与语义规则选择完成模式的句子,并通过三类不同复杂度的模板及基于语言学知识的数据增强策略,在合成数据与自然数据上进行扩展,从而有效诊断LLMs在跨句语义-句法规律建模方面的表现。
链接: https://arxiv.org/abs/2603.15295
作者: Giuseppe Samo,Paola Merlo
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注: 9 pages, 16 figures, accepted at LREC 2026
Abstract:Large language models (LLMs) have shown remarkable performance across various sentence-based linguistic phenomena, yet their ability to capture cross-sentence paradigmatic patterns, such as verb alternations, remains underexplored. In this work, we present curated paradigm-based datasets for four languages, designed to probe systematic cross-sentence knowledge of verb alternations (change-of-state and object-drop constructions in English, German and Italian, and Hebrew binyanim). The datasets comprise thousands of the Blackbird Language Matrices (BLMs) problems. The BLM task – an RPM/ARC-like task devised specifically for language – is a controlled linguistic puzzle where models must select the sentence that completes a pattern according to syntactic and semantic rules. We introduce three types of templates varying in complexity and apply linguistically-informed data augmentation strategies across synthetic and natural data. We provide simple baseline performance results across English, Italian, German, and Hebrew, that demonstrate the diagnostic usefulness of the datasets.
[NLP-21] From Documents to Spans: Code-Centric Learning for LLM -based ICD Coding
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)进行国际疾病分类编码(ICD coding)时面临的三大挑战:一是现有公开数据集对ICD码空间覆盖有限,导致模型难以泛化到未见代码;二是直接微调会削弱LLM的可解释性,因缺乏显式支持证据;三是长临床文档使微调计算成本高昂。解决方案的关键在于提出“以编码为中心的学习”(Code-Centric Learning)框架,其核心思想是将监督信号从完整的临床文档转移到可扩展的短证据片段(span-level learning),通过混合训练策略与编码中心的数据扩充方法,在显著降低训练成本的同时提升对未见ICD码的准确率并保持模型可解释性。
链接: https://arxiv.org/abs/2603.15270
作者: Xu Zhang,Wenxin Ma,Chenxu Wu,Rongsheng Wang,Kun Zhang,S. Kevin Zhou
机构: USTC; MIRACLE Center, Suzhou Institute for Advance Research, USTC; Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology; State Key Laboratory of Precision and Intelligent Chemistry, USTC
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:ICD coding is a critical yet challenging task in healthcare. Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model’s ability to generalize to unseen codes. Second, naive fine-tuning diminishes the interpretability of LLMs, as few public datasets contain explicit supporting evidence for assigned codes. Third, ICD coding typically involves long clinical documents, making fine-tuning LLMs computationally expensive. To address these issues, we propose Code-Centric Learning, a training framework that shifts supervision from full clinical documents to scalable, short evidence spans. The key idea of this framework is that span-level learning improves LLMs’ ability to perform document-level ICD coding. Our proposed framework consists of a mixed training strategy and code-centric data expansion, which substantially reduces training cost, improves accuracy on unseen ICD codes and preserves interpretability. Under the same LLM backbone, our method substantially outperforms strong baselines. Notably, our method enables small-scale LLMs to achieve performance comparable to much larger proprietary models, demonstrating its effectiveness and potential for fully automated ICD coding.
[NLP-22] Directional Embedding Smoothing for Robust Vision Language Models ICLR2026
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在部署可信代理型人工智能系统时面临的安全性与可靠性问题,特别是其易受多模态越狱攻击(multi-modal jailbreaking attacks)的影响,导致模型输出有害内容。解决方案的关键在于将随机嵌入平滑与标记聚合(Randomized Embedding Smoothing and Token Aggregation, RESTA)防御机制扩展至VLMs,并通过引入方向性嵌入噪声(directional embedding noise)——即注入的噪声与原始标记嵌入向量对齐——显著降低攻击成功率。实验表明,RESTA作为一种轻量级、推理时可部署的防御层,能有效提升VLMs在代理系统中的安全性。
链接: https://arxiv.org/abs/2603.15259
作者: Ye Wang,Jing Liu,Toshiaki Koike-Akino
机构: Mitsubishi Electric Research Laboratories(三菱电机研究实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted at ICLR 2026 Workshop on Agents in the Wild
Abstract:The safety and reliability of vision-language models (VLMs) are a crucial part of deploying trustworthy agentic AI systems. However, VLMs remain vulnerable to jailbreaking attacks that undermine their safety alignment to yield harmful outputs. In this work, we extend the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense to VLMs and evaluate its performance against the JailBreakV-28K benchmark of multi-modal jailbreaking attacks. We find that RESTA is effective in reducing attack success rate over this diverse corpus of attacks, in particular, when employing directional embedding noise, where the injected noise is aligned with the original token embedding vectors. Our results demonstrate that RESTA can contribute to securing VLMs within agentic systems, as a lightweight, inference-time defense layer of an overall security framework.
[NLP-23] Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation
【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)评估中对特定语言现象——尤其是英汉被动句结构差异——关注不足的问题。由于英语和汉语在被动句的构造与分布上存在显著差异,传统评估指标难以准确反映模型在这一关键语法现象上的表现。解决方案的关键在于构建一个双向多领域被动句平行语料库,涵盖73,965组平行句子,并通过自动化标注结合人工验证的方式获得高质量结构标签,从而实现对主流开源及商用神经机器翻译(Neural Machine Translation, NMT)模型在被动句转换能力上的系统性评测。结果表明,模型更倾向于保留源语文本的语态形式,而非遵循目标语言的普遍语态习惯,且在英译汉方向表现出更高的语态一致性,说明其具备一定程度的对中文被动句低频性和负向语境的认知能力。
链接: https://arxiv.org/abs/2603.15227
作者: Xinyue Ma,Pol Pastells,Mireia Farrús,Mariona Taulé
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注: 11 pages,1 figures, Language Resources and Evaluation Conference 2026
Abstract:Machine Translation (MT) evaluation has gone beyond metrics, towards more specific linguistic phenomena. Regarding English-Chinese language pairs, passive sentences are constructed and distributed differently due to language variation, thus need special attention in MT. This paper proposes a bidirectional multi-domain dataset of passive sentences, extracted from five Chinese-English parallel corpora and annotated automatically with structure labels according to human translation, and a test set with manually verified annotation. The dataset consists of 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters). We evaluate two state-of-the-art open-source MT systems with our dataset, and four commercial models with the test set. The results show that, unlike humans, models are more influenced by the voice of the source text rather than the general voice usage of the source language, and therefore tend to maintain the passive voice when translating a passive in either direction. However, models demonstrate some knowledge of the low frequency and predominantly negative context of Chinese passives, leading to higher voice consistency with human translators in English-to-Chinese translation than in Chinese-to-English translation. Commercial NMT models scored higher in metric evaluations, but LLMs showed a better ability to use diverse alternative translations. Datasets and annotation script will be shared upon request.
[NLP-24] Efficient Document Parsing via Parallel Token Prediction CVPR2026
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在文档解析(Document Parsing)任务中因自回归(Autoregressive, AR)解码机制导致的推理速度瓶颈问题。其核心解决方案是提出一种可插拔、与模型无关的并行令牌预测(Parallel-Token Prediction, PTP)方法,通过在输入序列中插入可学习标记(learnable tokens)并设计相应的训练目标,使VLM能够并行生成多个未来令牌,从而提升采样效率和推理速度。该方法在OmniDocBench和olmOCR-bench上的实验表明,PTP不仅将解码速度提升1.6x–2.2x,还有效减少模型幻觉并增强泛化能力。
链接: https://arxiv.org/abs/2603.15206
作者: Lei Li,Ze Zhao,Meng Li,Zhongwang Lun,Yi Yuan,Xingjing Lu,Zheng Wei,Jiang Bian,Zang Li
机构: Tencent(腾讯); Renmin University of China(中国人民大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 Findings
Abstract:Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.
[NLP-25] he Hrunting of AI: Where and How to Improve English Dialectal Fairness
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在英语方言(如约克郡英语、盖尔迪英语、康沃尔语、非洲裔美国白话英语及西弗里斯兰语)中性能不佳的问题,其核心挑战在于这些方言数据稀缺导致模型难以有效优化。研究发现,人类对LLM生成质量的判断一致性直接影响LLM作为评判者的性能表现,即LLM与人类的一致性模式与人类之间的共识模式高度一致,这表明在人口稀少、人类判断分歧较大的方言区域,提升LLM性能存在根本性限制。解决方案的关键在于:首先,必须对数据进行细致评估以确保公平性和包容性;其次,在数据稀缺情境下,需开发新工具来应对这种因低人类一致性而引发的性能瓶颈,同时指出某些LLM具备生成高质量数据的能力,为未来规模化改进提供了潜在路径。
链接: https://arxiv.org/abs/2603.15187
作者: Wei Li,Adrian de Wynter
机构: Boston College(波士顿学院); Microsoft(微软); The University of York(约克大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:It is known that large language models (LLMs) underperform in English dialects, and that improving them is difficult due to data scarcity. In this work we investigate how quality and availability impact the feasibility of improving LLMs in this context. For this, we evaluate three rarely-studied English dialects (Yorkshire, Geordie, and Cornish), plus African-American Vernacular English, and West Frisian as control. We find that human-human agreement when determining LLM generation quality directly impacts LLM-as-a-judge performance. That is, LLM-human agreement mimics the human-human agreement pattern, and so do metrics such as accuracy. It is an issue because LLM-human agreement measures an LLM’s alignment with the human consensus; and hence raises questions about the feasibility of improving LLM performance in locales where low populations induce low agreement. We also note that fine-tuning does not eradicate, and might amplify, this pattern in English dialects. But also find encouraging signals, such as some LLMs’ ability to generate high-quality data, thus enabling scalability. We argue that data must be carefully evaluated to ensure fair and inclusive LLM improvement; and, in the presence of scarcity, new tools are needed to handle the pattern found.
[NLP-26] HindSight: Evaluating Research Idea Generation via Future Impact
【速读】: 该论文旨在解决当前评估生成式 AI (Generative AI) 生成的研究想法时存在的主观性和脱离实际科研影响力的问题。传统方法依赖大语言模型(LLM)作为评判者或人工评审小组,但这些方式无法准确反映想法在真实科研中的价值。其解决方案的关键在于提出一种时间分割评估框架 \hs,该框架通过设定时间阈值 T,将想法生成系统限制在 T 之前的知识范围内,并将其输出与 T 后 30 个月内的真实发表论文进行匹配,依据引用量和会议/期刊接收情况对想法质量进行客观评分。实验表明,\hs 能有效揭示检索增强型生成方法相较于基础方法可产生显著更高影响力的科研想法,且与 LLM 判定的新颖性呈负相关,说明 LLM 易高估表面新颖但无实际产出的想法。
链接: https://arxiv.org/abs/2603.15164
作者: Bo Jiang
机构: Temple University (坦普尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Evaluating AI-generated research ideas typically relies on LLM judges or human panels – both subjective and disconnected from actual research impact. We introduce \hs, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~ T , we restrict an idea generation system to pre- T literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation ( p=0.584 ), while \hs shows the retrieval-augmented system produces 2.5 \times higher-scoring ideas ( p0.001 ). Moreover, \hs scores are \emphnegatively correlated with LLM-judged novelty ( \rho=-0.29 , p0.01 ), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.
[NLP-27] o See is Not to Master: Teaching LLM s to Use Private Libraries for Code Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面向私有库(private-library)的代码生成任务中表现不佳的问题,即尽管已提供准确的私有库API文档知识,LLMs仍难以有效调用这些API。其核心挑战在于现有方法依赖推理时注入外部文档信息,但缺乏对私有库API调用逻辑的系统性学习能力。解决方案的关键是提出PriCoder,一种通过自动合成数据来训练LLMs掌握私有库API调用能力的方法;其创新点在于将数据合成建模为图结构,并交替执行两个图操作:(1) 渐进式图演化(Progressive Graph Evolution),通过从基础样本逐步生成多样化训练样本提升数据多样性;(2) 多维图剪枝(Multidimensional Graph Pruning),通过严格的过滤流程提升数据质量。实验表明,PriCoder在多个主流LLM上显著提升私有库代码生成性能(pass@1提升超20%),同时不影响通用代码生成能力。
链接: https://arxiv.org/abs/2603.15159
作者: Yitong Zhang,Chengze Li,Ruize Chen,Guowei Yang,Xiaoran Jia,Yijie Ren,Jia Li
机构: Tsinghua University (清华大学); Proxseer Inc. (Proxseer 公司); Nanjing University (南京大学); Beijing Institute of Technology (北京理工大学); Beihang University (北京航空航天大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages
Abstract:Large Language Models (LLMs) have shown strong potential for code generation, yet they remain limited in private-library-oriented code generation, where the goal is to generate code using APIs from private libraries. Existing approaches mainly rely on retrieving private-library API documentation and injecting relevant knowledge into the context at inference time. However, our study shows that this is insufficient: even given accurate required knowledge, LLMs still struggle to invoke private-library APIs effectively. To address this limitation, we propose PriCoder, an approach that teaches LLMs to invoke private-library APIs through automatically synthesized data. Specifically, PriCoder models private-library data synthesis as the construction of a graph, and alternates between two graph operators: (1) Progressive Graph Evolution, which improves data diversity by progressively synthesizing more diverse training samples from basic ones, and (2) Multidimensional Graph Pruning, which improves data quality through a rigorous filtering pipeline. To support rigorous evaluation, we construct two new benchmarks based on recently released libraries that are unfamiliar to the tested models. Experiments on three mainstream LLMs show that PriCoder substantially improves private-library-oriented code generation, yielding gains of over 20% in pass@1 in many settings, while causing negligible impact on general code generation capability. Our code and benchmarks are publicly available at this https URL. Comments: 12 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.15159 [cs.SE] (or arXiv:2603.15159v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.15159 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-28] Indirect Question Answering in English German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike LREC2026
【速读】: 该论文旨在解决间接性(Indirectness)在自然语言处理(NLP)研究中长期被忽视的问题,特别是针对间接问答(Indirect Question Answering, IQA)任务的多语言建模挑战。IQA的核心目标是分类间接回答的极性(如肯定或否定),但其在高资源语言(如英语、德语)和低资源语言(如巴伐利亚方言)中均表现出显著困难。解决方案的关键在于构建两个多语言语料库:InQA+(高质量人工标注的小规模评估集)与GenIQA(由GPT-4o-mini生成的大规模训练集),并系统评估多种多语言Transformer模型(mBERT、XLM-R、mDeBERTa)的表现。研究发现,IQA性能普遍较低且存在严重过拟合,表明标签模糊性、标签集设计和数据规模是影响结果的关键因素;同时指出当前大语言模型(LLM)尚缺乏足够的语用理解能力,难以生成高质量的IQA训练数据。
链接: https://arxiv.org/abs/2603.15130
作者: Miriam Winkler,Verena Blaschke,Barbara Plank
机构: 未知
类目: Computation and Language (cs.CL)
备注: To appear at LREC 2026
Abstract:Indirectness is a common feature of daily communication, yet is underexplored in NLP research for both low-resource as well as high-resource languages. Indirect Question Answering (IQA) aims at classifying the polarity of indirect answers. In this paper, we present two multilingual corpora for IQA of varying quality that both cover English, Standard German and Bavarian, a German dialect without standard orthography: InQA+, a small high-quality evaluation dataset with hand-annotated labels, and GenIQA, a larger training dataset, that contains artificial data generated by GPT-4o-mini. We find that IQA is a pragmatically hard task that comes with various challenges, based on several experiment variations with multilingual transformer models (mBERT, XLM-R and mDeBERTa). We suggest and employ recommendations to tackle these challenges. Our results reveal low performance, even for English, and severe overfitting. We analyse various factors that influence these results, including label ambiguity, label set and dataset size. We find that the IQA performance is poor in high- (English, German) and low-resource languages (Bavarian) and that it is beneficial to have a large amount of training data. Further, GPT-4o-mini does not possess enough pragmatic understanding to generate high-quality IQA data in any of our tested languages.
[NLP-29] MMKU-Bench: A Multimodal Update Benchmark for Diverse Visual Knowledge
【速读】: 该论文旨在解决多模态模型在预训练阶段获取的参数化知识难以随现实世界知识演变而更新的问题,尤其关注现有研究忽视了对已掌握但后续发生变更的知识进行更新,以及评估局限于单一模态、缺乏跨模态一致性分析的不足。解决方案的关键在于提出MMKU-Bench,这是一个全面的多模态知识更新评估基准,包含超过25,000个知识实例和49,000余张图像,涵盖“更新知识”与“未知知识”两种场景,从而支持不同知识类型学习能力的对比分析。通过该基准对监督微调(SFT)、基于人类反馈的强化学习(RLHF)和知识编辑(KE)等多种方法的系统评估,揭示了SFT与RLHF易产生灾难性遗忘,而KE虽能更好保持通用能力但在持续更新方面存在局限,为多模态知识更新提供了可靠且系统的评测框架。
链接: https://arxiv.org/abs/2603.15117
作者: Baochen Fu,Yuntao Du,Cheng Chang,Baihao Jin,Wenzhi Deng,Muhao Xu,Hongmei Yan,Weiye Song,Yi Wan
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As real-world knowledge continues to evolve, the parametric knowledge acquired by multimodal models during pretraining becomes increasingly difficult to remain consistent with real-world knowledge. Existing research on multimodal knowledge updating focuses only on learning previously unknown knowledge, while overlooking the need to update knowledge that the model has already mastered but that later changes; moreover, evaluation is limited to the same modality, lacking a systematic analysis of cross-modal consistency. To address these issues, this paper proposes MMKU-Bench, a comprehensive evaluation benchmark for multimodal knowledge updating, which contains over 25k knowledge instances and more than 49k images, covering two scenarios, updated knowledge and unknown knowledge, thereby enabling comparative analysis of learning across different knowledge types. On this benchmark, we evaluate a variety of representative approaches, including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and knowledge editing (KE). Experimental results show that SFT and RLHF are prone to catastrophic forgetting, while KE better preserve general capabilities but exhibit clear limitations in continual updating. Overall, MMKU-Bench provides a reliable and comprehensive evaluation benchmark for multimodal knowledge updating, advancing progress in this field.
[NLP-30] Bridging National and International Legal Data: Two Projects Based on the Japanese Legal Standard XML Schema for Comparative Law Studies
【速读】: 该论文旨在解决跨法域法律文本的可比性与互操作性问题,即如何在不同国家法律体系之间实现条款级的自动匹配与可视化比较分析。其解决方案的关键在于构建一个集成框架:首先通过JLS到Akoma Ntoso(AKN)的转换管道实现日本法律文本的结构化标准化,确保与国际立法数据库的兼容;其次利用多语言嵌入模型(multilingual embedding models)和语义文本相似度(semantic textual similarity)技术识别不同法域法律条文间的对应关系,并结合FAISS检索与Cross-Encoder重排序机制生成候选匹配对,最终以跨法域网络的形式可视化呈现,支持探索式比较分析。
链接: https://arxiv.org/abs/2603.15094
作者: Makoto Nakamura
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures
Abstract:This paper presents an integrated framework for computational comparative law by connecting two consecutive research projects based on the Japanese Legal Standard (JLS) XML schema. The first project establishes structural interoperability by developing a conversion pipeline from JLS to the Akoma Ntoso (AKN) standard, enabling Japanese statutes to be integrated into international LegalDocML-based legislative databases. Building on this foundation, the second project applies multilingual embedding models and semantic textual similarity techniques to identify corresponding provisions across national legal systems. A prototype system combining multilingual embeddings, FAISS retrieval, and Cross-Encoder reranking generates candidate correspondences and visualizes them as cross-jurisdictional networks for exploratory comparative analysis.
[NLP-31] Writer-R1: Enhancing Generative Writing in LLM s via Memory-augmented Replay Policy Optimization
【速读】: 该论文旨在解决创意写作(creative writing)任务中因缺乏可验证参考答案而导致的奖励建模与自动评估难题,这些问题长期受限于高人力标注成本、评价偏差及粗粒度反馈信号。其核心解决方案在于提出一种基于扎根理论(Grounded Theory)的多智能体协作流程,通过维度分解与层次归纳动态生成可解释且可复用的细粒度评估标准;同时设计了记忆增强型回放策略优化算法(Memory-augmented Replay Policy Optimization, MRPO),无需额外训练即可引导模型依据动态标准进行自我反思,实现受控迭代改进,并结合监督微调与强化学习的训练范式将评估标准转化为奖励信号,从而实现端到端优化。
链接: https://arxiv.org/abs/2603.15061
作者: Jihao Zhao,Shuaishuai Zu,Zhiyuan Ji,Chunlai Zhou,Biao Qin
机构: Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As a typical open-ended generation task, creative writing lacks verifiable reference answers, which has long constrained reward modeling and automatic evaluation due to high human annotation costs, evaluative bias, and coarse feedback signals. To address these challenges, this paper first designs a multi-agent collaborative workflow based on Grounded Theory, performing dimensional decomposition and hierarchical induction of the problem to dynamically produce interpretable and reusable fine-grained criteria. Furthermore, we propose the Memory-augmented Replay Policy Optimization (MRPO) algorithm: on the one hand, without additional training, MRPO guides models to engage in self-reflection based on dynamic criteria, enabling controlled iterative improvement; on the other hand, we adopt the training paradigm that combines supervised fine-tuning with reinforcement learning to convert evaluation criteria into reward signals, achieving end-to-end optimization. Experimental results demonstrate that the automatically constructed criteria achieve performance gains comparable to human annotations. Writer-R1-4B models trained with this approach outperform baselines across multiple creative writing tasks and surpass some 100B+ parameter open-source models.
[NLP-32] hinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLM s ICLR2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)在处理数学应用题时,传统基于 token 级链式思维(Token-level Chain-of-Thought, CoT)提示方法因输出中间推理步骤过长而导致的推理成本高、效率低的问题。其核心解决方案是提出 AdaAnchor 框架,关键在于通过在隐空间中对一组与输入关联的潜在锚点向量(latent anchor vectors)进行无声迭代优化,并引入自适应终止机制(adaptive halting),根据锚点动态收敛情况自动决定停止迭代时机:简单样本分配较少步数,复杂样本保留更多步数,在统一最大步数预算下实现更优的准确率-效率权衡。实证结果表明,AdaAnchor 在三个数学应用题基准测试中相较固定步数的潜空间推理方法提升准确率最高达 5%,平均潜空间迭代步数减少 48–60%,同时相比标准推理基线减少 92–93% 的输出 token 数量。
链接: https://arxiv.org/abs/2603.15051
作者: Disha Sheshanarayana,Rajat Subhra Pal,Manjira Sinha,Tirthankar Dasgupta
机构: Manipal University Jaipur(曼尼帕尔大学贾伊普尔分校); TCS Research(塔塔咨询研究部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICLR 2026, LIT Workshop
Abstract:Token-level Chain-of-Thought (CoT) prompting has become a standard way to elicit multi-step reasoning in large language models (LLMs), especially for mathematical word problems. However, generating long intermediate traces increases output length and inference cost, and can be inefficient when the model could arrive at the correct answer without extensive verbalization. This has motivated latent-space reasoning approaches that shift computation into hidden representations and only emit a final answer. Yet, many latent reasoning methods depend on a fixed number of latent refinement steps at inference, adding another hyperparameter that must be tuned across models and datasets to balance accuracy and efficiency. We introduce AdaAnchor, a latent reasoning framework that performs silent iterative computation by refining a set of latent anchor vectors attached to the input. AdaAnchor further incorporates an adaptive halting mechanism that monitors anchor stability across iterations and terminates refinement once the anchor dynamics converge, allocating fewer steps to easier instances while reserving additional refinement steps for harder ones under a shared maximum-step budget. Our empirical evaluation across three mathematical word-problem benchmarks shows that AdaAnchor with adaptive halting yields accuracy gains of up to 5% over fixed-step latent refinement while reducing average latent refinement steps by 48-60% under the same maximum-step budget. Compared to standard reasoning baselines, AdaAnchor achieves large reductions in generated tokens (92-93%) by moving computation into silent latent refinement, offering a different accuracy-efficiency trade-off with substantially lower output-token usage.
[NLP-33] Interpretable Predictability-Based AI Text Detection: A Replication Study
【速读】: 该论文旨在解决机器生成文本的作者归属(authorship attribution)问题,即通过分析文本风格特征识别其生成来源。其关键解决方案在于:首先,采用更新的多语言语言模型(如mDeBERTa-v3-base和Qwen/mGPT)替代原系统中的GPT-2模型以提取更有效的上下文表征与概率特征;其次,引入26个文档级风格特征(stylometric features),显著提升跨语言任务(英语与西班牙语)的分类性能;最后,利用SHAP分析量化特征重要性,增强模型决策的可解释性。实验表明,所提方法在两个子任务中均优于或相当优于原有系统,并且共享的多语言配置比语言特定模型更具优势。
链接: https://arxiv.org/abs/2603.15034
作者: Adam Skurla,Dominik Macko,Jakub Simko
机构: Brno University of Technology (布杰约维采理工大学); Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper replicates and extends the system used in the AuTexTification 2023 shared task for authorship attribution of machine-generated texts. First, we tried to reproduce the original results. Exact replication was not possible because of differences in data splits, model availability, and implementation details. Next, we tested newer multilingual language models and added 26 document-level stylometric features. We also applied SHAP analysis to examine which features influence the model’s decisions. We replaced the original GPT-2 models with newer generative models such as Qwen and mGPT for computing probabilistic features. For contextual representations, we used mDeBERTa-v3-base and applied the same configuration to both English and Spanish. This allowed us to use one shared configuration for Subtask 1 and Subtask 2. Our experiments show that the additional stylometric features improve performance in both tasks and both languages. The multilingual configuration achieves the results that are comparable to or better than language-specific models. The study also shows that clear documentation is important for reliable replication and fair comparison of systems.
[NLP-34] Attention Residuals
【速读】: 该论文旨在解决现代大语言模型(Large Language Models, LLMs)中标准残差连接(Residual Connections)与预归一化(PreNorm)结构所导致的隐藏状态(hidden-state)随深度增长失控、各层贡献被逐步稀释的问题。其核心解决方案是提出Attention Residuals(AttnRes),通过引入软最大注意力机制(softmax attention)对先前层输出进行内容依赖的加权聚合,取代原有固定单位权重的线性累加方式,从而实现更灵活且有选择性的深度信息融合。为降低大规模训练时的内存和通信开销,进一步设计了Block AttnRes,将层分块并仅在块级别进行注意力计算,在保持性能提升的同时显著减少资源消耗,最终成为可直接替换传统残差连接的高效方案。
链接: https://arxiv.org/abs/2603.15031
作者: Kimi Team:Guangyu Chen,Yu Zhang,Jianlin Su,Weixin Xu,Siyuan Pan,Yaoyu Wang,Yucheng Wang,Guanduo Chen,Bohong Yin,Yutian Chen,Junjie Yan,Ming Wei,Y. Zhang,Fanqing Meng,Chao Hong,Xiaotong Xie,Shaowei Liu,Enzhe Lu,Yunpeng Tai,Yanru Chen,Xin Men,Haiqing Guo,Y. Charles,Haoyu Lu,Lin Sui,Jinguo Zhu,Zaida Zhou,Weiran He,Weixiao Huang,Xinran Xu,Yuzhi Wang,Guokun Lai,Yulun Du,Yuxin Wu,Zhilin Yang,Xinyu Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: attnres tech report
Abstract:Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer’s contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks. Comments: attnres tech report Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.15031 [cs.CL] (or arXiv:2603.15031v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.15031 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-35] MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal
【速读】: 该论文旨在解决如何在保持原始表情包(meme)场景、实体和结构布局不变的前提下,将其负面情绪框架转化为积极建设性表达的问题,即实现情感可控且结构保真的多模态内容转换。其解决方案的关键在于提出了一种新的任务范式——Meme Reappraisal(表情包再评估),并构建了MER-Bench基准数据集,该数据集包含细粒度的多模态标注信息(如源情感与目标情感、正向重写文本、视觉编辑规范及分类标签),同时设计了一个基于多模态大语言模型(Multimodal Large Language Model, MLLM)作为裁判的结构化评估框架,从模态级生成质量、情感控制能力、结构保真度和全局情感一致性四个维度进行量化评测,从而系统性地推动可控表情包编辑与情感感知的多模态生成研究。
链接: https://arxiv.org/abs/2603.15020
作者: Yiqi Nie,Fei Wang,Junjie Chen,Kun Li,Yudi Cai,Dan Guo,Chenglong Li,Meng Wang
机构: Anhui University (安徽大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机与信息工程学院); College of Information Technology, United Arab Emirates University (阿联酋大学信息学院); Institute of Advanced Technology, University of Science and Technology of China (中国科学技术大学先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Memes represent a tightly coupled, multimodal form of social expression, in which visual context and overlaid text jointly convey nuanced affect and commentary. Inspired by cognitive reappraisal in psychology, we introduce Meme Reappraisal, a novel multimodal generation task that aims to transform negatively framed memes into constructive ones while preserving their underlying scenario, entities, and structural layout. Unlike prior works on meme understanding or generation, Meme Reappraisal requires emotion-controllable, structure-preserving multimodal transformation under multiple semantic and stylistic constraints. To support this task, we construct MER-Bench, a benchmark of real-world memes with fine-grained multimodal annotations, including source and target emotions, positively rewritten meme text, visual editing specifications, and taxonomy labels covering visual type, sentiment polarity, and layout structure. We further propose a structured evaluation framework based on a multimodal large language model (MLLM)-as-a-Judge paradigm, decomposing performance into modality-level generation quality, affect controllability, structural fidelity, and global affective alignment. Extensive experiments across representative image-editing and multimodal-generation systems reveal substantial gaps in satisfying the constraints of structural preservation, semantic consistency, and affective transformation. We believe MER-Bench establishes a foundation for research on controllable meme editing and emotion-aware multimodal generation. Our code is available at: this https URL.
[NLP-36] Pretraining and Benchmarking Modern Encoders for Latvian
【速读】: 该论文旨在解决低资源语言(如拉脱维亚语)在预训练语料库中代表性不足的问题,以及现有单语拉脱维亚语编码器稀缺的现状。其关键解决方案是基于RoBERTa、DeBERTaV3和ModernBERT架构预训练一系列针对拉脱维亚语优化的编码器模型,包括长上下文变体,并在多样化的拉脱维亚语诊断与语言学基准测试中进行评估。实验表明,所提出的最佳模型lv-deberta-base(111M参数)在性能上优于更大的多语言基线模型及先前的拉脱维亚语专用编码器,体现了架构改进与效率提升的协同优势。
链接: https://arxiv.org/abs/2603.15005
作者: Arturs Znotins
机构: Institute of Mathematics and Computer Science, University of Latvia (拉脱维亚大学数学与计算机科学研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Encoder-only transformers remain essential for practical NLP tasks. While recent advances in multilingual models have improved cross-lingual capabilities, low-resource languages such as Latvian remain underrepresented in pretraining corpora, and few monolingual Latvian encoders currently exist. We address this gap by pretraining a suite of Latvian-specific encoders based on RoBERTa, DeBERTaV3, and ModernBERT architectures, including long-context variants, and evaluating them across a diverse set of Latvian diagnostic and linguistic benchmarks. Our models are competitive with existing monolingual and multilingual encoders while benefiting from recent architectural and efficiency advances. Our best model, lv-deberta-base (111M parameters), achieves the strongest overall performance, outperforming larger multilingual baselines and prior Latvian-specific encoders. We release all pretrained models and evaluation resources to support further research and practical applications in Latvian NLP.
[NLP-37] Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agent ic AI KDD2026
【速读】: 该论文旨在解决当前代理型人工智能(Agentic AI)系统评估方法碎片化的问题,即现有评测仅关注孤立能力(如编码、幻觉抑制、越狱抵抗或工具使用),而未能在具有代表性的社会技术场景分布上全面衡量其可信性。解决方案的关键在于提出一种系统性的评估范式——全息代理评估框架(Holographic Agent Assessment Framework, HAAF),该框架通过构建涵盖任务类型、工具接口、交互动态、社会情境与风险等级的场景流形(scenario manifold),并集成静态认知与策略分析、交互式沙箱模拟、社会伦理对齐评估以及分布感知的代表性采样引擎四个互补模块,实现对高后果稀有风险(tail risks)的敏感捕捉和覆盖。整个过程由一个迭代式的可信优化工厂驱动,通过红队探测与蓝队加固循环,逐步缩小漏洞直至满足部署标准,从而推动代理评估从离散基准测试向真实世界可信性的转变。
链接: https://arxiv.org/abs/2603.14987
作者: Jinhu Qi,Yifan Li,Minghao Zhao,Wentao Zhang,Zijian Zhang,Yaoman Li,Irwin King
机构: The Chinese University of Hong Kong(香港中文大学); Macao Polytechnic University(澳门理工学院); Jilin University(吉林大学)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注: 6 pages, 1 figure. Submitted to KDD 2026 Blue Sky Track
Abstract:As agentic AI systems move beyond static question answering into open-ended, tool-augmented, and multi-step real-world workflows, their increased authority poses greater risks of system misuse and operational failures. However, current evaluation practices remain fragmented, measuring isolated capabilities such as coding, hallucination, jailbreak resistance, or tool use in narrowly defined settings. We argue that the central limitation is not merely insufficient coverage of evaluation dimensions, but the lack of a principled notion of representativeness: an agent’s trustworthiness should be assessed over a representative socio-technical scenario distribution rather than a collection of disconnected benchmark instances. To this end, we propose the Holographic Agent Assessment Framework (HAAF), a systematic evaluation paradigm that characterizes agent trustworthiness over a scenario manifold spanning task types, tool interfaces, interaction dynamics, social contexts, and risk levels. The framework integrates four complementary components: (i) static cognitive and policy analysis, (ii) interactive sandbox simulation, (iii) social-ethical alignment assessment, and (iv) a distribution-aware representative sampling engine that jointly optimizes coverage and risk sensitivity – particularly for rare but high-consequence tail risks that conventional benchmarks systematically overlook. These components are connected through an iterative Trustworthy Optimization Factory. Through cycles of red-team probing and blue-team hardening, this paradigm progressively narrows the vulnerabilities to meet deployment standards, shifting agent evaluation from benchmark islands toward representative, real-world trustworthiness. Code and data for the illustrative instantiation are available at this https URL.
[NLP-38] Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)水印方案中存在的一大治理难题:现有基于密钥的水印检测机制将检测与注入紧密耦合,导致第三方无法在不掌握密钥或依赖服务提供商专用检测器的情况下进行独立审计,从而阻碍了实际应用中的可信治理。解决方案的关键在于提出TTP-Detect——一个开创性的黑盒框架,通过解耦检测与注入过程,将验证问题重构为相对假设检验(relative hypothesis testing),利用代理模型增强水印相关信号,并结合一系列互补的相对度量来评估查询文本与水印分布的一致性,从而实现无需访问原始密钥或模型内部信息的非侵入式第三方水印验证。
链接: https://arxiv.org/abs/2603.14968
作者: Zhuoshang Wang,Yubing Ren,Yanan Cao,Fang Fang,Xiaoxue Li,Li Guo
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT/CC) (国家计算机网络应急技术处理协调中心)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:While watermarking serves as a critical mechanism for LLM provenance, existing secret-key schemes tightly couple detection with injection, requiring access to keys or provider-side scheme-specific detectors for verification. This dependency creates a fundamental barrier for real-world governance, as independent auditing becomes impossible without compromising model security or relying on the opaque claims of service providers. To resolve this dilemma, we introduce TTP-Detect, a pioneering black-box framework designed for non-intrusive, third-party watermark verification. By decoupling detection from injection, TTP-Detect reframes verification as a relative hypothesis testing problem. It employs a proxy model to amplify watermark-relevant signals and a suite of complementary relative measurements to assess the alignment of the query text with watermarked distributions. Extensive experiments across representative watermarking schemes, datasets and models demonstrate that TTP-Detect achieves superior detection performance and robustness against diverse attacks.
[NLP-39] LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs
【速读】: 该论文旨在解决文本丰富的图(text-rich graphs)在现有学习范式中面临的挑战,即传统方法及大语言模型(LLM)混合架构通常将丰富文本压缩为静态嵌入或摘要后再进行结构推理,导致信息瓶颈并使更新过程脱离原始文本内容。其解决方案的关键在于提出RAMP(Raw-text Anchored Message Passing),该方法不再将LLM仅作为特征提取器,而是将其重构为原生适用于图结构的聚合算子;通过一种新颖的双表示机制,在每次迭代中锚定于每个节点的原始文本,同时从邻居节点传播动态优化的消息,从而实现文本与结构关系的深度融合,并统一处理判别式与生成式任务。
链接: https://arxiv.org/abs/2603.14937
作者: Ying Zhang,Hang Yu,Haipeng Zhang,Peng Di
机构: Ant Group(蚂蚁集团); ShanghaiTech University(上海科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 20 pages, 5 figures. Work in progress
Abstract:Text-rich graphs, which integrate complex structural dependencies with abundant textual information, are ubiquitous yet remain challenging for existing learning paradigms. Conventional methods and even LLM-hybrids compress rich text into static embeddings or summaries before structural reasoning, creating an information bottleneck and detaching updates from the raw content. We argue that in text-rich graphs, the text is not merely a node attribute but the primary medium through which structural relationships are manifested. We introduce RAMP, a Raw-text Anchored Message Passing approach that moves beyond using LLMs as mere feature extractors and instead recasts the LLM itself as a graph-native aggregation operator. RAMP exploits the text-rich nature of the graph via a novel dual-representation scheme: it anchors inference on each node’s raw text during each iteration while propagating dynamically optimized messages from neighbors. It further handles both discriminative and generative tasks under a single unified generative formulation. Extensive experiments show that RAMP effectively bridges the gap between graph propagation and deep text reasoning, achieving competitive performance and offering new insights into the role of LLMs as graph kernels for general-purpose graph learning.
[NLP-40] Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLM s
【速读】: 该论文旨在解决将通用漏洞披露(Common Vulnerabilities and Exposures, CVE)描述自动映射到通用弱点枚举(Common Weakness Enumeration, CWE)类别这一任务中的准确性与泛化能力问题。其关键解决方案是基于RoBERTa-base架构(125M参数)构建一个微调分类器,利用Claude Sonnet 4.6生成并精炼的高质量标签数据(共234,770条CVE描述)训练模型,并通过NVD与AI标签一致的数据集进行筛选以提升标注可靠性。该方法在稀有CWE类别上的表现显著优于传统TF-IDF基线(Macro F1提升15.5个百分点),且在外部CTI-Bench基准测试中达到75.6%严格准确率,仅用64倍更少参数即逼近8B参数模型性能,验证了小规模模型在安全领域细粒度分类任务中的有效性。
链接: https://arxiv.org/abs/2603.14911
作者: Nikita Mosievskiy
机构: Independent Researcher(独立研究员)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 6 tables. Dataset: this https URL Model: this https URL
Abstract:We present a fine-tuned RoBERTa-base classifier (125M parameters) for mapping Common Vulnerabilities and Exposures (CVE) descriptions to Common Weakness Enumeration (CWE) categories. We construct a large-scale training dataset of 234,770 CVE descriptions with AI-refined CWE labels using Claude Sonnet 4.6, and agreement-filtered evaluation sets where NVD and AI labels agree. On our held-out test set (27,780 samples, 205 CWE classes), the model achieves 87.4% top-1 accuracy and 60.7% Macro F1 – a +15.5 percentage-point Macro F1 gain over a TF-IDF baseline that already reaches 84.9% top-1, demonstrating the model’s advantage on rare weakness categories. On the external CTI-Bench benchmark (NeurIPS 2024), the model achieves 75.6% strict accuracy (95% CI: 72.8-78.2%) – statistically indistinguishable from Cisco Foundation-Sec-8B-Reasoning (75.3%, 8B parameters) at 64x fewer parameters. We release the dataset, model, and training code.
[NLP-41] ExPosST: Explicit Positioning with Adaptive Masking for LLM -Based Simultaneous Machine Translation
【速读】: 该论文旨在解决将仅解码器(decoder-only)的大语言模型(Large Language Models, LLMs)应用于同时机器翻译(Simultaneous Machine Translation, SimulMT)时存在的位置错位(positional mismatch)问题,该问题导致解码效率与位置一致性之间的权衡困境。现有方法通常依赖特定的位置编码或精心设计的提示策略,难以在推理效率、位置一致性及模型兼容性之间取得平衡。解决方案的关键在于提出ExPosST框架,通过显式位置分配(explicit position allocation)机制,在输入源词元(source tokens)对应位置预留固定的位置槽位(positional slots),从而支持跨不同位置编码方式的高效KV缓存解码;此外,引入策略一致的微调策略(policy-consistent fine-tuning),使训练阶段的行为与推理时的解码行为保持一致,有效提升模型在多种翻译策略下的性能表现。
链接: https://arxiv.org/abs/2603.14903
作者: Yuzhe Shang,Pengzhi Gao,Yazheng Yang,Jiayao Ma,Wei Liu,Jian Luan,Jingsong Su
机构: Xiamen University (厦门大学); Xiaomi Inc. (小米公司); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have recently demonstrated promising performance in simultaneous machine translation (SimulMT). However, applying decoder-only LLMs to SimulMT introduces a positional mismatch, which leads to a dilemma between decoding efficiency and positional consistency. Existing approaches often rely on specific positional encodings or carefully designed prompting schemes, and thus fail to simultaneously achieve inference efficiency, positional consistency, and broad model compatibility. In this work, we propose ExPosST, a general framework that resolves this dilemma through explicit position allocation. ExPosST reserves fixed positional slots for incoming source tokens, enabling efficient decoding with KV cache across different positional encoding methods. To further bridge the gap between fine-tuning and inference, we introduce a policy-consistent fine-tuning strategy that aligns training with inference-time decoding behavior. Experiments across multiple language pairs demonstrate that ExPosST effectively supports simultaneous translation under diverse policies.
[NLP-42] LLM s as Signal Detectors: Sensitivity Bias and the Temperature-Criterion Analogy
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)校准评估中存在的重要局限性问题,即现有指标(如期望校准误差,Expected Calibration Error)将模型的判别能力(敏感性,sensitivity)与响应倾向(偏差,bias)混为一谈,导致无法准确区分模型性能差异。其解决方案的关键在于引入信号检测理论(Signal Detection Theory, SDT)的完整参数化框架,包括不等方差模型拟合、准则估计和z-ROC分析,从而分离出敏感性和偏差两个独立维度。研究通过将LLMs视为信号探测器,在168,000次事实判断任务中验证温度参数是否如同人类心理物理学中的奖惩操纵一样仅改变准则(criterion),结果发现温度不仅改变了信心输出,还同时提升了敏感性(AUC),表明该类比失效;进一步揭示不同模型在敏感性-偏差空间中占据不同位置,而这些差异无法被传统校准指标捕捉,证明SDT全参数框架可提供诊断性信息,是更精细的LLM评估工具。
链接: https://arxiv.org/abs/2603.14893
作者: Jon-Paul Cacioli
机构: Independent Researcher, Melbourne, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 8 figures, 2 tables
Abstract:Large language models (LLMs) are evaluated for calibration using metrics such as Expected Calibration Error that conflate two distinct components: the model’s ability to discriminate correct from incorrect answers (sensitivity) and its tendency toward confident or cautious responding (bias). Signal Detection Theory (SDT) decomposes these components. While SDT-derived metrics such as AUROC are increasingly used, the full parametric framework - unequal-variance model fitting, criterion estimation, z-ROC analysis - has not been applied to LLMs as signal detectors. In this pre-registered study, we treat three LLMs as observers performing factual discrimination across 168,000 trials and test whether temperature functions as a criterion shift analogous to payoff manipulations in human psychophysics. Critically, this analogy may break down because temperature changes the generated answer itself, not only the confidence assigned to it. Our results confirm the breakdown with temperature simultaneously increasing sensitivity (AUC) and shifting criterion. All models exhibited unequal-variance evidence distributions (z-ROC slopes 0.52-0.84), with instruct models showing more extreme asymmetry (0.52-0.63) than the base model (0.77-0.87) or human recognition memory (~0.80). The SDT decomposition revealed that models occupying distinct positions in sensitivity-bias space could not be distinguished by calibration metrics alone, demonstrating that the full parametric framework provides diagnostic information unavailable from existing metrics.
[NLP-43] Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在自动作文评分(Automated Essay Scoring, AES)任务中因采用自回归文本生成范式而导致评分决策隐式化的问题,尤其在多模态 AES 中,视觉信息对不同评分维度的贡献不一致时,传统方法表现受限。其解决方案的关键在于提出决策层级有序建模(Decision-Level Ordinal Modeling, DLOM),通过复用语言模型头部提取预定义分数标记的 logits,将评分转化为显式的有序决策过程,从而实现对分数空间的直接优化与分析。针对多模态场景,DLOM-GF 引入门控融合模块以自适应整合文本与多模态得分 logits;对于纯文本场景,DLOM-DA 增加距离感知正则项以更好地捕捉有序分数间的语义距离,显著提升评分准确性和鲁棒性。
链接: https://arxiv.org/abs/2603.14891
作者: Han Zhang,Jiamin Su,Li liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated essay scoring (AES) predicts multiple rubric-defined trait scores for each essay, where each trait follows an ordered discrete rating scale. Most LLM-based AES methods cast scoring as autoregressive token generation and obtain the final score via decoding and parsing, making the decision implicit. This formulation is particularly sensitive in multimodal AES, where the usefulness of visual inputs varies across essays and traits. To address these limitations, we propose Decision-Level Ordinal Modeling (DLOM), which makes scoring an explicit ordinal decision by reusing the language model head to extract score-wise logits on predefined score tokens, enabling direct optimization and analysis in the score space. For multimodal AES, DLOM-GF introduces a gated fusion module that adaptively combines textual and multimodal score logits. For text-only AES, DLOM-DA adds a distance-aware regularization term to better reflect ordinal distances. Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous. On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.
[NLP-44] Developing an English-Efik Corpus and Machine Translation System for Digitization Inclusion EACL
【速读】: 该论文旨在解决低资源语言(如埃菲克语)在现代自然语言处理系统中严重缺失的问题,特别是针对其在机器翻译研究中的代表性不足。解决方案的关键在于利用一个由社区构建的小规模平行语料库(13,865句对),对先进的多语言神经机器翻译模型(mT5 和 NLLB200)进行微调,其中 NLLB200 表现出更优性能(英语-埃菲克 BLEU 为 26.64,埃菲克-英语 BLEU 为 31.21),验证了在有限数据条件下开发实用机器翻译工具的可行性,并强调了包容性数据实践与文化相关评估对于实现公平自然语言处理的重要性。
链接: https://arxiv.org/abs/2603.14873
作者: Offiong Bassey Edet,Mbuotidem Sunday Awak,Emmanuel Oyo-Ita,Benjamin Okon Nyong,Ita Etim Bassey
机构: University of Cross River State(十字河州大学); ML Collective(机器学习集体); Arthur Jarvis University(亚瑟·贾维斯大学); University of Calabar(卡拉巴尔大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figure, accepted at AfricaNLP 2026 (co-located with EACL)
Abstract:Low-resource languages serve as invaluable repositories of human history, preserving cultural and intellectual diversity. Despite their significance, they remain largely absent from modern natural language processing systems. While progress has been made for widely spoken African languages such as Swahili, Yoruba, and Amharic, smaller indigenous languages like Efik continue to be underrepresented in machine translation research. This study evaluates the effectiveness of state-of-the-art multilingual neural machine translation models for English-Efik translation, leveraging a small-scale, community-curated parallel corpus of 13,865 sentence pairs. We fine-tuned both the mT5 multilingual model and the NLLB200 model on this dataset. NLLB-200 outperformed mT5, achieving BLEU scores of 26.64 for English-Efik and 31.21 for Efik-English, with corresponding chrF scores of 51.04 and 47.92, indicating improved fluency and semantic fidelity. Our findings demonstrate the feasibility of developing practical machine translation tools for low-resource languages and highlight the importance of inclusive data practices and culturally grounded evaluation in advancing equitable NLP.
[NLP-45] Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks ACL2026
【速读】: 该论文旨在解决电商场景中大语言模型(Large Language Model, LLM)代理在长期偏好感知购物任务中的两大瓶颈问题:一是缺乏针对长对话中用户偏好建模的评估基准,二是现有系统将偏好识别与购物辅助割裂设计,导致端到端优化困难。其解决方案的关键在于提出一个包含120万真实商品、覆盖两个购物任务的新型基准,并构建名为Shopping Companion的统一框架,该框架联合优化记忆检索与购物辅助能力,同时支持用户干预;并通过引入基于工具的双奖励强化学习策略,有效应对多轮交互中稀疏且不连续的奖励信号,从而显著提升偏好捕捉精度与任务成功率。
链接: https://arxiv.org/abs/2603.14864
作者: Zijian Yu,Kejun Xiao,Huaipeng Zhao,Tao Luo,Xiaoyi Zeng
机构: Alibaba International Digital Commercial Group (阿里巴巴国际数字商业集团)
类目: Computation and Language (cs.CL)
备注: Subbmited to ACL 2026
Abstract:In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budgeting, and bundle deals, where accurately capturing user preferences from long-term conversations is critical. However, two challenges hinder realizing this potential: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of end-to-end optimization due to existing designs that treat preference identification and shopping assistance as separate components. In this paper, we introduce a novel benchmark with a long-term memory setup, spanning two shopping tasks over 1.2 million real-world products, and propose Shopping Companion, a unified framework that jointly tackles memory retrieval and shopping assistance while supporting user intervention. To train such capabilities, we develop a dual-reward reinforcement learning strategy with tool-wise rewards to handle the sparse and discontinuous rewards inherent in multi-turn interactions. Experimental results demonstrate that even state-of-the-art models (such as GPT-5) achieve success rates under 70% on our benchmark, highlighting the significant challenges in this domain. Notably, our lightweight LLM, trained with Shopping Companion, consistently outperforms strong baselines, achieving better preference capture and task performance, which validates the effectiveness of our unified design.
[NLP-46] ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations
【速读】: 该论文旨在解决在线社交环境中毒性内容(toxic content)持续演化带来的检测挑战,特别是恶意用户通过不断变化的扰动(perturbations)隐藏毒性信息以规避静态检测模型的问题。传统方法因缺乏动态更新能力而难以应对这些演化的攻击策略,导致检测性能随时间下降。解决方案的关键在于提出ContiGuard框架,首次实现对时变扰动文本的持续毒性检测(continual toxicity detection)。其核心创新包括:一是基于大语言模型(LLM)的语义增强策略,动态注入由LLM挖掘出的潜在语义和毒性线索,提升对扰动文本的理解能力;二是判别驱动的特征学习策略,强化判别性强的关键特征并抑制非关键特征,从而构建鲁棒的分类边界,使检测器在面对持续演变的扰动时仍能保持高敏感性和稳定性。
链接: https://arxiv.org/abs/2603.14843
作者: Hankun Kang,Xin Miao,Jianhao Chen,Jintao Wen,Mayi Xu,Weiyu Zhang,Wenpeng Lu,Tieyun Qian
机构: Wuhan University (武汉大学); Zhongguancun Academy (中关村学院); Shandong Computer Science Center (国家超算济南中心); Qilu University of Technology (山东科学院); Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing (山东省计算 power 互联网与服务计算重点实验室); Shandong Fundamental Research Center for Computer Science (山东省计算机科学基础研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Toxicity detection mitigates the dissemination of toxic content (e.g., hateful comments, posts, and messages within online social actions) to safeguard a healthy online social environment. However, malicious users persistently develop evasive perturbations to disguise toxic content and evade detectors. Traditional detectors or methods are static over time and are inadequate in addressing these evolving evasion tactics. Thus, continual learning emerges as a logical approach to dynamically update detection ability against evolving perturbations. Nevertheless, disparities across perturbations hinder the detector’s continual learning on perturbed text. More importantly, perturbation-induced noises distort semantics to degrade comprehension and also impair critical feature learning to render detection sensitive to perturbations. These amplify the challenge of continual learning against evolving perturbations. In this work, we present ContiGuard, the first framework tailored for continual learning of the detector on time-evolving perturbed text (termed continual toxicity detection) to enable the detector to continually update capability and maintain sustained resilience against evolving perturbations. Specifically, to boost the comprehension, we present an LLM-powered semantic enriching strategy, where we dynamically incorporate possible meaning and toxicity-related clues excavated by LLM into the perturbed text to improve the comprehension. To mitigate non-critical features and amplify critical ones, we propose a discriminability-driven feature learning strategy, where we strengthen discriminative features while suppressing the less-discriminative ones to shape a robust classification boundary for detection…
[NLP-47] he Impact of Ideological Discourses in RAG : A Case Study with COVID-19 Treatments
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)框架中外部意识形态文本对大型语言模型(Large Language Models, LLMs)输出影响的问题,特别是如何识别和量化这种影响以降低无意偏见及恶意操纵风险。其解决方案的关键在于构建一个基于新冠治疗争议性与支持性文献的语料库,并采用词汇多维分析(Lexical Multidimensional Analysis, LMDA)框架提取其中的意识形态维度;随后通过两种不同提示(prompt)策略——一种包含用户问题和意识形态文本,另一种额外加入LMDA描述——引导LLMs生成响应,并利用余弦相似度衡量参考文本与模型输出在词法和语义层面的意识形态一致性,从而验证检索到的意识形态文本能显著提升LLMs输出与外部知识源的对齐程度。
链接: https://arxiv.org/abs/2603.14838
作者: Elmira Salari(1),Maria Claudia Nunes Delfino(2),Hazem Amamou(3),José Victor de Souza(3),Shruti Kshirsagar(1),Alan Davoust(4),Anderson Avila(3) ((1) Wichita State University, (2) Pontifícia Universidade Católica de São Paulo, (3) Institut national de la recherche scientifique, (4) Université du Québec en Outaouais)
机构: Wichita State University (威奇托州立大学); Pontifícia Universidade Católica de São Paulo (圣保罗天主教大学); Institut national de la recherche scientifique (国家科学研究院); Université du Québec en Outaouais (魁北克省Outaouais大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper studies the impact of retrieved ideological texts on the outputs of large language models (LLMs). While interest in understanding ideology in LLMs has recently increased, little attention has been given to this issue in the context of Retrieval-Augmented Generation (RAG). To fill this gap, we design an external knowledge source based on ideological loaded texts about COVID-19 treatments. Our corpus is based on 1,117 academic articles representing discourses about controversial and endorsed treatments for the disease. We propose a corpus linguistics framework, based on Lexical Multidimensional Analysis (LMDA), to identify the ideologies within the corpus. LLMs are tasked to answer questions derived from three identified ideological dimensions, and two types of contextual prompts are adopted: the first comprises the user question and ideological texts; and the second contains the question, ideological texts, and LMDA descriptions. Ideological alignment between reference ideological texts and LLMs’ responses is assessed using cosine similarity for lexical and semantic representations. Results demonstrate that LLMs’ responses based on ideological retrieved texts are more aligned with the ideology encountered in the external knowledge, with the enhanced prompt further influencing LLMs’ outputs. Our findings highlight the importance of identifying ideological discourses within the RAG framework in order to mitigate not just unintended ideological bias, but also the risks of malicious manipulation of such models.
[NLP-48] VorTEX: Various overlap ratio for Target speech EXtraction
【速读】: 该论文旨在解决目标语音提取(Target Speech Extraction, TSE)中现有文本提示方法在不同语音重叠比例下性能不稳定的问题,特别是当混合语音的重叠比例较低时模型容易产生抑制(suppression)或残留干扰现象。解决方案的关键在于提出一种名为VorTEX的新架构,其核心是Decoupled Adaptive Multi-branch (DAM)融合模块,该模块将主提取路径与辅助正则化路径解耦,从而实现更鲁棒的目标语音分离;同时构建了PORTE数据集以覆盖0%至100%的重叠比例,并引入Suppression Ratio on Energy (SuRE)作为诊断指标,有效识别传统指标无法捕捉的抑制行为,实验表明VorTEX在20%-100%重叠比范围内均表现出最优分离保真度且无抑制驱动伪影。
链接: https://arxiv.org/abs/2603.14803
作者: Ro-hoon Oh,Jihwan Seol,Bugeun Kim
机构: Chung-Ang University (中央大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: arXiv Preprint
Abstract:Target speech extraction (TSE) aims to recover a target speaker’s voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. We introduce VorTEX (Various overlap ratio for Target speech EXtraction), a text-prompted TSE architecture with a Decoupled Adaptive Multi-branch (DAM) Fusion block that separates primary extraction from auxiliary regularization pathways. To enable controlled analysis, we construct PORTE, a two-speaker dataset spanning overlap ratios from 0% to 100%. We further propose Suppression Ratio on Energy (SuRE), a diagnostic metric that detects suppression behavior not captured by conventional measures. Experiments show that existing models exhibit suppression or residual interference under overlap, whereas VorTEX achieves the highest separation fidelity across 20-100% overlap (e.g., 5.50 dB at 20% and 2.04 dB at 100%) while maintaining zero SuRE, indicating robust extraction without suppression-driven artifacts.
[NLP-49] Universe Routing: Why Self-Evolving Agents Need Epistemic Control ICLR2026
【速读】: 该论文旨在解决当前持续学习智能体(lifelong agents)的关键瓶颈问题:并非知识匮乏,而是缺乏对如何推理的决策能力。具体而言,当面对如“这枚硬币是否公平?”这样的问题时,智能体必须识别应采用频率学派假设检验还是贝叶斯后验推断——这两种方法在认识论上不相容,若混用将导致结构性失败并沿决策链传播。为此,作者提出“宇宙路由问题”(universe routing problem),即在调用专用求解器前,先将问题分类到互斥的认知空间(belief spaces)。解决方案的核心在于引入一个显式的认知控制层(epistemic control layer),通过一个465M参数的路由器实现硬性路由至异构推理框架,其在保持与软门控专家模型(soft MoE)相当准确性的前提下提速7倍;同时,该路由机制展现出更强的泛化能力(比关键词匹配基线缩小2.3倍泛化差距),且在扩展新认知空间时,基于重放的持续学习可实现零遗忘,显著优于EWC等正则化方法,验证了模块化认知架构在长期演化中的优越性。
链接: https://arxiv.org/abs/2603.14799
作者: Zhaohui Geoffrey Wang
机构: University of Southern California (南加州大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages. Accepted at the LLA Workshop at ICLR 2026 (camera-ready version)
Abstract:A critical failure mode of current lifelong agents is not lack of knowledge, but the inability to decide how to reason. When an agent encounters “Is this coin fair?” it must recognize whether to invoke frequentist hypothesis testing or Bayesian posterior inference - frameworks that are epistemologically incompatible. Mixing them produces not minor errors, but structural failures that propagate across decision chains. We formalize this as the universe routing problem: classifying questions into mutually exclusive belief spaces before invoking specialized solvers. Our key findings challenge conventional assumptions: (1) hard routing to heterogeneous solvers matches soft MoE accuracy while being 7x faster because epistemically incompatible frameworks cannot be meaningfully averaged; (2) a 465M-parameter router achieves a 2.3x smaller generalization gap than keyword-matching baselines, indicating semantic rather than surface-level reasoning; (3) when expanding to new belief spaces, rehearsal-based continual learning achieves zero forgetting, outperforming EWC by 75 percentage points, suggesting that modular epistemic architectures are fundamentally more amenable to lifelong learning than regularization-based approaches. These results point toward a broader architectural principle: reliable self-evolving agents may require an explicit epistemic control layer that governs reasoning framework selection.
[NLP-50] Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA LREC2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对本地语言变体(如方言或区域性语言版本)时知识覆盖不足与可靠性差异的问题,尤其关注信息不对称背景下LLMs在相关语言间的性能表现。其核心问题是:当本地维基百科(local Wikipedia)包含标准版本中缺失的知识时,LLMs能否准确识别并回答相关问题?解决方案的关键在于构建了一个新颖的问答(Question-Answering, QA)数据集,专门捕捉本地维基百科中独有但未被高资源版本涵盖的信息,并通过引入“引言段落”(lead sections)作为上下文提示显著提升模型性能,进一步结合翻译策略实现更优效果。这一方法揭示了本地维基百科不仅是区域知识的重要来源,亦能补充全球性知识体系,从而推动LLMs在文化多样性和语言包容性方面的改进。
链接: https://arxiv.org/abs/2603.14782
作者: Renhao Pei,Siyao Peng,Verena Blaschke,Robert Litschko,Barbara Plank
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, accepted at LREC 2026 as an oral presentation
Abstract:Large Language Models (LLMs) are becoming a common way for humans to seek knowledge, yet their coverage and reliability vary widely. Especially for local language varieties, there are large asymmetries, e.g., information in local Wikipedia that is absent from the standard variant. However, little is known about how well LLMs perform under such information asymmetry, especially on closely related languages. We manually construct a novel challenge question-answering (QA) dataset that captures knowledge conveyed on a local Wikipedia page, which is absent from their higher-resource counterparts-covering Mandarin Chinese vs. Cantonese and German vs. Bavarian. Our experiments show that LLMs fail to answer questions about information only in local editions of Wikipedia. Providing context from lead sections substantially improves performance, with further gains possible via translation. Our topical, geographic annotations, and stratified evaluations reveal the usefulness of local Wikipedia editions as sources of both regional and global information. These findings raise critical questions about inclusivity and cultural coverage of LLMs.
[NLP-51] Vietnamese Automatic Speech Recognition: A Revisit EACL2026
【速读】: 该论文旨在解决低资源语言在自动语音识别(Automatic Speech Recognition, ASR)模型训练中因公开数据集质量不足和标注不一致而导致的性能瓶颈问题。其解决方案的关键在于提出了一种通用且可扩展的数据聚合与预处理流水线,能够从多样化的、可能存在噪声的开源来源中构建高质量ASR数据集,通过严格的处理步骤保障数据多样性、平衡性以及关键特征(如词级时间戳)的完整性,从而为训练和评估先进ASR系统提供可靠基础。
链接: https://arxiv.org/abs/2603.14779
作者: Thi Vu,Linh The Nguyen,Dat Quoc Nguyen
机构: Qualcomm AI Research; Qualcomm Vietnam Company Limited
类目: Computation and Language (cs.CL)
备注: Accepted to EACL 2026 Findings
Abstract:Automatic Speech Recognition (ASR) performance is heavily dependent on the availability of large-scale, high-quality datasets. For low-resource languages, existing open-source ASR datasets often suffer from insufficient quality and inconsistent annotation, hindering the development of robust models. To address these challenges, we propose a novel and generalizable data aggregation and preprocessing pipeline designed to construct high-quality ASR datasets from diverse, potentially noisy, open-source sources. Our pipeline incorporates rigorous processing steps to ensure data diversity, balance, and the inclusion of crucial features like word-level timestamps. We demonstrate the effectiveness of our methodology by applying it to Vietnamese, resulting in a unified, high-quality 500-hour dataset that provides a foundation for training and evaluating state-of-the-art Vietnamese ASR systems. Our project page is available at this https URL.
[NLP-52] owards Privacy-Preserving Machine Translation at the Inference Stage: A New Task and Benchmark
【速读】: 该论文旨在解决在线机器翻译服务在推理阶段可能引发的隐私泄露问题,尤其是在处理包含敏感信息(如命名实体)的文本时,传统云端翻译服务需将用户输入上传至服务器,存在隐私风险。解决方案的关键在于提出一个全新的“隐私保护机器翻译”(Privacy-Preserving Machine Translation, PPMT)任务,聚焦于在模型推理阶段保护文本中的命名实体隐私;为此,研究者构建了三个基准测试数据集、设计了相应的评估指标,并提出了系列基准方法,为该方向的研究提供了明确的任务定义、评估体系和起点,从而推动机器翻译领域对推理阶段隐私保护的深入探索。
链接: https://arxiv.org/abs/2603.14756
作者: Wei Shao,Lemao Liu,Yinqiao Li,Guoping Huang,Shuming Shi,Linqi Song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, Accepted by IEEE Journal of Selected Topics in Signal Processing
Abstract:Current online translation services require sending user text to cloud servers, posing a risk of privacy leakage when the text contains sensitive information. This risk hinders the application of online translation services in privacy-sensitive scenarios. One way to mitigate this risk for online translation services is introducing privacy protection mechanisms targeting the inference stage of translation models. However, compared to subfields of NLP like text classification and summarization, the machine translation research community has limited exploration of privacy protection during the inference stage. There is no clearly defined privacy protection task for the inference stage, dedicated evaluation datasets and metrics, and reference benchmark methods. The absence of these elements has seriously constrained researchers’ in-depth exploration of this direction. To bridge this gap, this paper proposes a novel “Privacy-Preserving Machine Translation” (PPMT) task, aiming to protect the private information in text during the model inference stage. For this task, we constructed three benchmark test datasets, designed corresponding evaluation metrics, and proposed a series of benchmark methods as a starting point for this task. The definition of privacy is complex and diverse. Considering that named entities often contain a large amount of personal privacy and commercial secrets, we have focused our research on protecting only the named entity’s privacy in the text. We expect this research work will provide a new perspective and a solid foundation for the privacy protection problem in machine translation.
[NLP-53] Learning Constituent Headedness
【速读】: 该论文旨在解决句法分析中成分的中心词(Headedness)在短语结构树库(constituency treebanks)中缺乏显式标注的问题,以及现有处理流程依赖柯林斯风格规则传播(Collins-style rule-based percolation)来恢复中心词所带来的精度不足。其解决方案的关键在于将中心词视为一个显式的表征层,并通过监督学习任务从对齐的短语结构与依存结构标注中自动推断中心词——具体而言,利用依存结构中的跨度中心词(dependency span head)作为监督信号来训练模型,从而实现高精度的中心词预测。实验表明,该方法在英文和中文数据上均达到接近理论上限的准确率,并显著优于传统规则方法,在头驱动二分(head-driven binarization)下仍保持相当的句法解析性能,同时提升了确定性短语结构到依存结构转换的一致性与跨语言迁移能力。
链接: https://arxiv.org/abs/2603.14755
作者: Zeyao Qi,Yige Chen,KyungTae Lim,Haihua Pan,Jungyeul Park
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Headedness is widely used as an organizing device in syntactic analysis, yet constituency treebanks rarely encode it explicitly and most processing pipelines recover it procedurally via percolation rules. We treat this notion of constituent headedness as an explicit representational layer and learn it as a supervised prediction task over aligned constituency and dependency annotations, inducing supervision by defining each constituent head as the dependency span head. On aligned English and Chinese data, the resulting models achieve near-ceiling intrinsic accuracy and substantially outperform Collins-style rule-based percolation. Predicted heads yield comparable parsing accuracy under head-driven binarization, consistent with the induced binary training targets being largely equivalent across head choices, while increasing the fidelity of deterministic constituency-to-dependency conversion and transferring across resources and languages under simple label-mapping interfaces.
[NLP-54] Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning
【速读】: 该论文旨在解决安全微调(safety fine-tuning)中监督形式对模型行为影响的机制问题,特别是探讨是否必须通过身份框架(identity framing)来实现最优的安全性表现。其关键解决方案在于设计四种基于相同核心安全规则但不同表述方式的监督格式:宪法式规则(A)、教义风格的身份框架(B)、带有世界观/信仰维持尾部的匹配身份条件(C),以及非身份条件(D)。实验结果表明,在三个主流指令微调模型(Llama 3.1 8B、Qwen2.5 7B 和 Gemma 3 4B)上,非身份条件(D)在 HarmBench 测试集上均取得最高拒绝率(达 74.4%–76.9%),显著优于其他三种形式,尤其明显优于教义式身份框架(B)。这说明显式的身份语言并非获得最强安全效果所必需,从而对“身份框架假说”的强版本提出了实证挑战。此外,能力评估(MMLU 和 ARC-Challenge)显示各条件间无显著性能损失,验证了安全性提升与模型能力之间的解耦特性。
链接: https://arxiv.org/abs/2603.14723
作者: Xinran Zhang
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:How safety supervision is written may matter more than the explicit identity content it contains. We study low-data LoRA safety fine-tuning with four supervision formats built from the same core safety rules: constitutional rules (A), creed-style identity framing (B), a B-matched creed condition with a worldview/confession identity-maintenance tail ©, and a matched non-identity condition (D). Across three instruction-tuned model families (Llama 3.1 8B, Qwen2.5 7B, and Gemma 3 4B), we evaluate HarmBench using a reconciled dual-judge pipeline combining Bedrock-hosted DeepSeek v3.2 and Sonnet 4.6, with disagreement and boundary cases manually resolved. The non-identity condition D is the strongest group on all three model families on the full 320-behavior HarmBench set, reaching 74.4% refusal on Llama, 76.9% on Gemma, and 74.1% on Qwen. By comparison, creed-style framing (B) improves over plain constitutional rules (A) on Llama and Gemma, but remains substantially below D, yielding an overall descriptive ordering of D B C \geq A baseline . This provides a bounded empirical challenge to a strong version of the identity-framing hypothesis: explicit creed-style identity language is not necessary for the strongest gains observed here. Capability evaluations on MMLU and ARC-Challenge show no meaningful trade-off across conditions. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.14723 [cs.CL] (or arXiv:2603.14723v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.14723 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-55] owards Next-Generation LLM Training: From the Data-Centric Perspective
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练中数据准备与利用效率低下的核心问题。当前实践中,训练数据多依赖手动编写脚本构建,缺乏自动化、可复用的数据处理流程;同时,已收集的数据通常被一次性全部使用,缺乏动态选择、混合优化及重加权机制,导致资源浪费和训练效率低下。解决方案的关键在于提出两个互补方向:一是构建基于智能体(agent-based)的自动数据准备系统,实现数据工作流的自动化生成与规模化管理;二是设计统一的数据-模型交互式训练框架,使数据在训练过程中动态选择、混合与重加权,从而提升数据利用的效率性、适应性和性能感知能力。
链接: https://arxiv.org/abs/2603.14712
作者: Hao Liang,Zhengyang Zhao,Zhaoyang Han,Meiyi Qiang,Xiaochen Ma,Bohan Zeng,Qifeng Cai,Zhiyu Li,Linpeng Tang,Weinan E,Wentao Zhang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks and domains, with data playing a central role in enabling these advances. Despite this success, the preparation and effective utilization of the massive datasets required for LLM training remain major bottlenecks. In current practice, LLM training data is often constructed using ad hoc scripts, and there is still a lack of mature, agent-based data preparation systems that can automatically construct robust and reusable data workflows, thereby freeing data scientists from repetitive and error-prone engineering efforts. Moreover, once collected, datasets are often consumed largely in their entirety during training, without systematic mechanisms for data selection, mixture optimization, or reweighting. To address these limitations, we advocate two complementary research directions. First, we propose building a robust, agent-based automatic data preparation system that supports automated workflow construction and scalable data management. Second, we argue for a unified data-model interaction training system in which data is dynamically selected, mixed, and reweighted throughout the training process, enabling more efficient, adaptive, and performance-aware data utilization. Finally, we discuss the remaining challenges and outline promising directions for future research and system development.
[NLP-56] Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents
【速读】: 该论文旨在解决计算机使用代理(Computer-using Agents, CUAs)在图形用户界面(GUI)操作中因视觉感知不可靠而导致的安全问题,特别是“视觉混淆代理人”(visual confused deputy)这一新型攻击模式——即代理基于误判的屏幕状态执行授权动作,其根源包括接地错误、对抗性截图篡改或时间检查到时间使用(TOCTOU)竞争条件。传统方法仅关注动作是否成功,而忽视了代理是否真正点击了预期对象,从而导致潜在安全漏洞可被利用。解决方案的关键在于提出一种双通道对比分类(dual-channel contrastive classification)机制:通过两个独立通道分别评估(1)视觉点击目标的准确性与(2)代理对动作推理是否符合部署特定知识库的内容,并在任一通道检测到风险时阻止执行。该设计的核心洞察是:视觉证据能识别目标层面的错位,而文本推理可揭示看似无害控件背后的危险意图,二者互补,显著提升CUA安全性。
链接: https://arxiv.org/abs/2603.14707
作者: Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen
机构: vLLM Semantic Router Project; MBZUAI; McGill University (麦吉尔大学); AMD; Red Hat
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Computer-using agents (CUAs) act directly on graphical user interfaces, yet their perception of the screen is often unreliable. Existing work largely treats these failures as performance limitations, asking whether an action succeeds, rather than whether the agent is acting on the correct object at all. We argue that this is fundamentally a security problem. We formalize the visual confused deputy: a failure mode in which an agent authorizes an action based on a misperceived screen state, due to grounding errors, adversarial screenshot manipulation, or time-of-check-to-time-of-use (TOCTOU) races. This gap is practically exploitable: even simple screen-level manipulations can redirect routine clicks into privileged actions while remaining indistinguishable from ordinary agent mistakes. To mitigate this threat, we propose the first guardrail that operates outside the agent’s perceptual loop. Our method, dual-channel contrastive classification, independently evaluates (1) the visual click target and (2) the agent’s reasoning about the action against deployment-specific knowledge bases, and blocks execution if either channel indicates risk. The key insight is that these two channels capture complementary failure modes: visual evidence detects target-level mismatches, while textual reasoning reveals dangerous intent behind visually innocuous controls. Across controlled attacks, real GUI screenshots, and agent traces, the combined guardrail consistently outperforms either channel alone. Our results suggest that CUA safety requires not only better action generation, but independent verification of what the agent believes it is clicking and why. Materials are provided\footnoteModel, benchmark, and code: this https URL.
[NLP-57] Computational Analysis of Semantic Connections Between Herman Melville Reading and Writing
【速读】: 该论文旨在解决如何通过计算方法识别赫尔曼·梅尔维尔(Herman Melville)作品中可能受到其阅读材料影响的语义相似性问题,从而辅助文学源流与影响研究。解决方案的关键在于采用基于BERTScore的语义相似度分析方法,对梅尔维尔作品与他已知阅读或拥有的书籍文本进行句子级和非重叠五元组(5-gram)级别的对比,以精度(precision)、召回率(recall)和F1分数作为潜在语义对齐的指标,而非依赖固定阈值判断文本复用,从而有效捕捉专家已确认的相似实例并发现值得进一步定性分析的新线索。
链接: https://arxiv.org/abs/2603.14674
作者: Nudrat Habib,Elisa Barney Smith,Steven Olsen Smith
机构: Luleå University of Technology (吕勒奥理工大学); Boise State University (博伊西州立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This study investigates the potential influence of Herman Melville reading on his own writings through computational semantic similarity analysis. Using documented records of books known to have been owned or read by Melville, we compare selected passages from his works with texts from his library. The methodology involves segmenting texts at both sentence level and non-overlapping 5-gram level, followed by similarity computation using BERTScore. Rather than applying fixed thresholds to determine reuse, we interpret precision, recall, and F1 scores as indicators of possible semantic alignment that may suggest literary influence. Experimental results demonstrate that the approach successfully captures expert-identified instances of similarity and highlights additional passages warranting further qualitative examination. The findings suggest that semantic similarity methods provide a useful computational framework for supporting source and influence studies in literary scholarship.
[NLP-58] Seamless Deception: Larger Language Models Are Better Knowledge Concealers
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在受到审计时可能隐藏有害知识并伪装无知的问题,即检测模型是否在主动隐瞒其内部存储的知识。解决方案的关键在于训练分类器以识别模型在面对特定提示时表现出的“掩盖行为”(concealment),通过分析模型输出中的可量化特征来判断其是否在故意回避敏感话题。研究发现,基于梯度的掩盖方式比基于提示的掩盖更容易被识别,但分类器在面对不同模型架构或未见过的知识主题时泛化能力有限,且随着模型规模扩大(超过700亿参数),掩盖痕迹显著减弱,导致分类器性能退化至随机水平,揭示了仅依赖黑盒审计方法的局限性,并强调需发展更鲁棒的检测机制以应对大规模模型的潜在隐蔽行为。
链接: https://arxiv.org/abs/2603.14672
作者: Dhananjay Ashok,Ruth-Ann Armstrong,Jonathan May
机构: Information Sciences Institute, University of Southern California (南加州大学信息科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Language Models (LMs) may acquire harmful knowledge, and yet feign ignorance of these topics when under audit. Inspired by the recent discovery of deception-related behaviour patterns in LMs, we aim to train classifiers that detect when a LM is actively concealing knowledge. Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods. However, contrary to prior work, we find that the classifiers do not reliably generalize to unseen model architectures and topics of hidden knowledge. Most concerningly, the identifiable traces associated with concealment become fainter as the models increase in scale, with the classifiers achieving no better than random performance on any model exceeding 70 billion parameters. Our results expose a key limitation in black-box-only auditing of LMs and highlight the need to develop robust methods to detect models that are actively hiding the knowledge they contain.
[NLP-59] Punctuated Equilibria in Artificial Intelligence: The Institutional Scaling Law and the Speciation of Sovereign AI
【速读】: 该论文旨在解决当前人工智能(Artificial Intelligence, AI)发展过程中对“能力随模型规模单调增长”这一主流假设的过度依赖问题,指出AI演进并非线性连续过程,而是由间断性突变事件驱动的阶段性跃迁。其解决方案的关键在于提出“制度适应度流形”(Institutional Fitness Manifold)这一数学框架,从能力(capability)、制度信任(institutional trust)、可负担性(affordability)和主权合规性(sovereign compliance)四个维度量化AI系统的综合适应度,并由此推导出“制度缩放定律”(Institutional Scaling Law),揭示制度适应度在模型规模上呈非单调变化——超过特定环境最优值后,进一步扩大模型规模反而因信任损耗与成本上升而降低整体适应度。该理论表明,在多数制度部署场景中,由多个小型、领域适配模型构成的协同系统可数学上优于单一前沿通用模型。
链接: https://arxiv.org/abs/2603.14664
作者: Mark Baciak,Thomas A. Cellucci,Deanna M. Falkowski
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:The dominant narrative of artificial intelligence development assumes that progress is continuous and that capability scales monotonically with model size. We challenge both assumptions. Drawing on punctuated equilibrium theory from evolutionary biology, we show that AI development proceeds not through smooth advancement but through extended periods of stasis interrupted by rapid phase transitions that reorganize the competitive landscape. We identify five such eras since 1943 and four epochs within the current Generative AI Era, each initiated by a discontinuous event – from the transformer architecture to the DeepSeek Moment – that rendered the prior paradigm subordinate. To formalize the selection pressures driving these transitions, we develop the Institutional Fitness Manifold, a mathematical framework that evaluates AI systems along four dimensions: capability, institutional trust, affordability, and sovereign compliance. The central result is the Institutional Scaling Law, which proves that institutional fitness is non-monotonic in model scale. Beyond an environment-specific optimum, scaling further degrades fitness as trust erosion and cost penalties outweigh marginal capability gains. This directly contradicts classical scaling laws and carries a strong implication: orchestrated systems of smaller, domain-adapted models can mathematically outperform frontier generalists in most institutional deployment environments. We derive formal conditions under which this inversion holds and present supporting empirical evidence spanning frontier laboratory dynamics, post-training alignment evolution, and the rise of sovereign AI as a geopolitical selection pressure.
[NLP-60] Argumentation for Explainable and Globally Contestable Decision Support with LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域部署时因决策不透明和不可预测而导致的信任与可控性问题。现有方法虽通过计算论证(computational argumentation)实现对单个实例的后验推理以提供可解释性,但受限于预定义的二元选择且仅支持局部争议,无法修改底层决策逻辑,导致错误重复发生。其解决方案的关键在于提出ArgEval框架,该框架将焦点从个体实例的推理转向对任务特定决策空间的结构化评估:通过构建任务相关的选项本体(ontology)和通用论证框架(General Argumentation Frameworks, AFs),实现对每个决策选项的系统化建模;这些共享的AFs可在具体案例中实例化以生成可解释推荐,同时支持全局层面的争议与修改,从而提升模型决策的透明度、可控性和一致性。
链接: https://arxiv.org/abs/2603.14643
作者: Adam Dejl,Matthew Williams,Francesca Toni
机构: Imperial College London (帝国理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) exhibit strong general capabilities, but their deployment in high-stakes domains is hindered by their opacity and unpredictability. Recent work has taken meaningful steps towards addressing these issues by augmenting LLMs with post-hoc reasoning based on computational argumentation, providing faithful explanations and enabling users to contest incorrect decisions. However, this paradigm is limited to pre-defined binary choices and only supports local contestation for specific instances, leaving the underlying decision logic unchanged and prone to repeated mistakes. In this paper, we introduce ArgEval, a framework that shifts from instance-specific reasoning to structured evaluation of general decision options. Rather than mining arguments solely for individual cases, ArgEval systematically maps task-specific decision spaces, builds corresponding option ontologies, and constructs general argumentation frameworks (AFs) for each option. These frameworks can then be instantiated to provide explainable recommendations for specific cases while still supporting global contestability through modification of the shared AFs. We investigate the effectiveness of ArgEval on treatment recommendation for glioblastoma, an aggressive brain tumour, and show that it can produce explainable guidance aligned with clinical practice.
[NLP-61] Nudging Hidden States: Training-Free Model Steering for Chain-of-Thought Reasoning in Large Audio-Language Models
【速读】: 该论文旨在解决大音频语言模型(Large Audio-Language Models, LALMs)在推理阶段缺乏有效推理能力的问题,尤其是在不进行训练的前提下如何提升链式思维(Chain-of-thought, CoT)提示的效果。其解决方案的关键在于引入推理时模型引导(inference-time model steering)这一无需训练的方法,通过利用多种信息源设计三种引导策略,并验证其在四个LALMs和四个基准测试上的有效性,结果显示准确率最高提升4.4%。特别地,研究发现从少量文本样本中提取的引导向量可有效指导基于语音的推理任务,展现出优异的数据效率,表明该方法在跨模态迁移中的潜力。
链接: https://arxiv.org/abs/2603.14636
作者: Lok-Lam Ieong,Chia-Chien Chen,Chih-Kai Yang,Yu-Han Huang,An-Yu Cheng,Hung-yi Lee
机构: National Taiwan University (国立台湾大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 6 pages, 4 figures, 2 tables
Abstract:Chain-of-thought (CoT) prompting has been extended to large audio-language models (LALMs) to elicit reasoning, yet enhancing its effectiveness without training remains challenging. We study inference-time model steering as a training-free approach to improve LALM reasoning. We introduce three strategies using diverse information sources and evaluate them across four LALMs and four benchmarks. Results show general accuracy gains up to 4.4% over CoT prompting. Notably, we identify a cross-modal transfer where steering vectors derived from few text samples effectively guide speech-based reasoning, demonstrating high data efficiency. We also examine hyperparameter sensitivity to understand the robustness of these approaches. Our findings position model steering as a practical direction for strengthening LALM reasoning.
[NLP-62] Anteriors Approach to Fairness Evaluation of Automated Prior Authorization System
【速读】: 该论文旨在解决医疗人工智能系统中先期授权(Prior Authorization, PA)模型的公平性评估难题,尤其针对传统以审批率一致性为标准的公平性指标在临床实践中不适用的问题——因为合法的临床指南和医疗必要性标准常因人口学特征(如性别、年龄、种族/族裔、社会经济地位)而异。其解决方案的关键在于提出一种基于模型错误率(error rates)而非审批结果的公平性评估框架,通过误差率比较、容差带分析(±5个百分点)、统计功效检验及协议控制的逻辑回归等方法,对多种人口学维度进行系统评估,从而实现更科学、合规且可监管的公平性判断。
链接: https://arxiv.org/abs/2603.14631
作者: Sai P. Selvaraj,Khadija Mahmoud,Anuj Iravane
机构: Anterior, Inc.
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Increasing staffing constraints and turnaround-time pressures in Prior authorization (PA) have led to increasing automation of decision systems to support PA review. Evaluating fairness in such systems poses unique challenges because legitimate clinical guidelines and medical necessity criteria often differ across demographic groups, making parity in approval rates an inappropriate fairness metric. We propose a fairness evaluation framework for prior authorization models based on model error rates rather than approval outcomes. Using 7,166 human-reviewed cases spanning 27 medical necessity guidelines, we assessed consistency in sex, age, race/ethnicity, and socioeconomic status. Our evaluation combined error-rate comparisons, tolerance-band analysis with a predefined \pm 5 percentage-point margin, statistical power evaluation, and protocol-controlled logistic regression. Across most demographics, model error rates were consistent, and confidence intervals fell within the predefined tolerance band, indicating no meaningful performance differences. For race/ethnicity, point estimates remain small, but subgroup sample sizes were limited, resulting in wide confidence intervals and underpowered tests, with inconclusive evidence within the dataset we explored. These findings illustrate a rigorous and regulator-aligned approach to fairness evaluation in administrative healthcare AI systems.
[NLP-63] PA3: textbfPolicy-textbfAware textbfAgent textbfAlignment through Chain-of-Thought
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行工具使用任务时难以遵守复杂业务规则的问题。现有方法通过将完整业务策略嵌入上下文来引导模型推理,但会导致高延迟、计算资源浪费以及因长上下文引发的“针 haystack”问题,从而降低整体性能。解决方案的关键在于提出一种多阶段对齐方法,使模型在推理过程中能够主动回忆并应用相关业务规则,而无需在每次输入中包含全部政策内容;同时引入基于Jaccard相似度的PolicyRecall奖励和针对幻觉的惩罚项用于GRPO训练,有效提升规则遵循能力与推理效率。
链接: https://arxiv.org/abs/2603.14602
作者: Shubhashis Roy Dipta,Daniel Bis,Kun Zhou,Lichao Wang,Benjamin Z. Yao,Chenlei Guo,Ruhi Sarikaya
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the “needle-in-the-haystack” problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.
[NLP-64] Parameter-Efficient Quality Estimation via Frozen Recursive Models EACL2026
【速读】: 该论文旨在解决低资源语言场景下质量评估(Quality Estimation, QE)任务中模型性能受限的问题,特别是探索Tiny Recursive Models (TRM) 的递归机制是否适用于QE任务。其关键解决方案在于:采用冻结预训练嵌入(frozen pretrained embeddings)与权重共享策略相结合的方法,在不显著增加参数量的前提下提升模型表现。实验表明,使用XLM-R的冻结嵌入使TRM-QE在Spearman相关性上达到0.370,媲美全参数微调版本(0.369),且显著优于同等深度的标准Transformer(0.336);在印地语和泰米尔语上,该方法以80倍更少的可训练参数超越MonoTransQuest(560M参数),验证了参数效率与性能之间的平衡可行性。
链接: https://arxiv.org/abs/2603.14593
作者: Umar Abubacar,Roman Bauer,Diptesh Kanojia
机构: University of Surrey, UK (萨里大学, 英国); Surrey Institute for People-Centred AI (PAI), University of Surrey, UK (萨里大学人本AI研究所, 英国)
类目: Computation and Language (cs.CL)
备注: Accepted to LowResLM Workshop @ EACL 2026
Abstract:Tiny Recursive Models (TRM) achieve strong results on reasoning tasks through iterative refinement of a shared network. We investigate whether these recursive mechanisms transfer to Quality Estimation (QE) for low-resource languages using a three-phase methodology. Experiments on 8 language pairs on a low-resource QE dataset reveal three findings. First, TRM’s recursive mechanisms do not transfer to QE. External iteration hurts performance, and internal recursion offers only narrow benefits. Next, representation quality dominates architectural choices, and lastly, frozen pretrained embeddings match fine-tuned performance while reducing trainable parameters by 37 \times (7M vs 262M). TRM-QE with frozen XLM-R embeddings achieves a Spearman’s correlation of 0.370, matching fine-tuned variants (0.369) and outperforming an equivalent-depth standard transformer (0.336). On Hindi and Tamil, frozen TRM-QE outperforms MonoTransQuest (560M parameters) with 80 \times fewer trainable parameters, suggesting that weight sharing combined with frozen embeddings enables parameter efficiency for QE. We release the code publicly for further research. Code is available at this https URL.
[NLP-65] CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad
【速读】: 该论文旨在解决现有基于进化(evolve-based)的智能体在科学问题求解中效率下降及逼近性能边界时出现振荡行为的问题,其根源在于缺乏针对性的进化引导机制和对历史进化经验的有效组织与利用。解决方案的关键在于提出CausalEvolve框架,其核心创新是引入因果草稿板(causal scratchpad),通过大语言模型(LLMs)识别并推理影响目标优化的因果因素:初期识别结果层面的互补性启发因子,在进化过程中进一步通过检测意外模式(surprise patterns)和溯因推理(abductive reasoning)假设新因子,从而提供新的进化方向,显著提升进化效率并在四项挑战性的开放科学任务中发现更优解。
链接: https://arxiv.org/abs/2603.14575
作者: Yongqiang Chen,Chenxi Liu,Zhenhao Chen,Tongliang Liu,Bo Han,Kun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Preprint of ongoing work; Yongqiang and Chenxi contributed equally;
Abstract:Evolve-based agent such as AlphaEvolve is one of the notable successes in using Large Language Models (LLMs) to build AI Scientists. These agents tackle open-ended scientific problems by iteratively improving and evolving programs, leveraging the prior knowledge and reasoning capabilities of LLMs. Despite the success, existing evolve-based agents lack targeted guidance for evolution and effective mechanisms for organizing and utilizing knowledge acquired from past evolutionary experience. Consequently, they suffer from decreasing evolution efficiency and exhibit oscillatory behavior when approaching known performance boundaries. To mitigate the gap, we develop CausalEvolve, equipped with a causal scratchpad that leverages LLMs to identify and reason about guiding factors for evolution. At the beginning, CausalEvolve first identifies outcome-level factors that offer complementary inspirations in improving the target objective. During the evolution, CausalEvolve also inspects surprise patterns during the evolution and abductive reasoning to hypothesize new factors, which in turn offer novel directions. Through comprehensive experiments, we show that CausalEvolve effectively improves the evolutionary efficiency and discovers better solutions in 4 challenging open-ended scientific tasks.
[NLP-66] op-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
【速读】: 该论文旨在解决传统解码策略(如Top-k、Top-p)因采用静态截断规则而无法适应自然语言动态信息密度的问题,这种不匹配导致生成过程在高熵创造性任务中过于受限或在低熵逻辑推理任务中过于宽松。其解决方案的关键在于提出Top-b(Adaptive Relative Band Sampling),该方法将生成过程建模为在相对概率流形上的轨迹,并通过一个与模型当前分布的香农熵(Shannon entropy)严格耦合的动态带宽系数来调节候选词集合,从而实现对生成熵和解码方差的最小化控制,有效逼近自调节的自动回归生成控制系统。
链接: https://arxiv.org/abs/2603.14567
作者: Deepon Halder,Raj Dabre
机构: AI4Bharat; IIEST Shibpur; IIT Madras
类目: Computation and Language (cs.CL)
备注:
Abstract:Probabilistic language generators are theoretically modeled as discrete stochastic processes, yet standard decoding strategies (Top-k, Top-p) impose static truncation rules that fail to accommodate the dynamic information density of natural language. This misalignment often forces a suboptimal trade-off: static bounds are either too restrictive for high-entropy creative generation or too permissive for low-entropy logical reasoning. In this work, we formalize the generation process as a trajectory through a relative probability manifold. We introduce Top-b (Adaptive Relative Band Sampling), a decoding strategy that regulates the candidate set via a dynamic bandwidth coefficient coupled strictly to the instantaneous Shannon entropy of the model’s distribution. We provide a theoretical framework demonstrating that Top-b acts as a variance-minimizing operator on the tail distribution. Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating control system for autoregressive generation.
[NLP-67] Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Childrens Stories for Training Small Language Models
【速读】: 该论文旨在解决低资源语言(Low-resource languages)在构建鲁棒语言模型时面临的高质量、连贯且领域适配的训练语料稀缺问题。解决方案的关键在于提出了一种混合数据采集流程:首先利用Sarvam-M语言模型结合新颖的组合式提示工程框架实现17种印度语言儿童故事的原生生成,随后通过Google Translate API实现大规模跨语言扩展,并辅以严格的程序化过滤机制,最终构建出包含132,942篇故事和超过9390万词元的Multilingual TinyStories数据集,为小型语言模型(Small Language Models, SLMs)在印地语系语言中的训练与迁移学习提供了基础资源。
链接: https://arxiv.org/abs/2603.14563
作者: Deepon Halder,Angira Mukherjee
机构: AI4Bharat(人工智能 for 印度); IIEST Shibpur(印度理工学院谢布拉普分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children’s stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.
[NLP-68] MALicious INTent Dataset and Inoculating LLM s for Enhanced Disinformation Detection EACL2026
【速读】: 该论文旨在解决现有英文虚假信息(disinformation)研究中普遍忽视其恶意意图(malicious intent)的问题,从而提升对虚假信息的识别能力。其核心解决方案是构建首个由专家事实核查员共同标注的英文语料库MALINT,用于捕捉虚假信息及其背后的意图,并基于此语料库对12种语言模型(包括小语言模型SLMs和大语言模型LLMs)进行二分类与多标签意图分类任务的基准测试。进一步地,受心理学“免疫理论”(inoculation theory)启发,论文提出“意图增强推理”(intent-based inoculation),即通过引入恶意意图分析来增强大语言模型的推理能力,以降低虚假信息的说服力。实验证明,该方法在六组虚假信息数据集、五种大语言模型及七种语言环境下均能有效提升零样本(zero-shot)检测性能,为意图感知的虚假信息检测提供了可复现的数据资源与方法框架。
链接: https://arxiv.org/abs/2603.14525
作者: Arkadiusz Modzelewski,Witold Sosnowski,Eleni Papadopulos,Elisa Sartori,Tiziano Labruna,Giovanni Da San Martino,Adam Wierzbicki
机构: University of Padua, Italy (帕多瓦大学, 意大利); Polish-Japanese Academy of Information Technology, Poland (波兰-日本信息学院, 波兰); NASK National Research Institute, Poland (NASK国家研究所, 波兰); Politecnico di Torino, Italy (都灵理工大学, 意大利)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper accepted to EACL 2026 Main Conference
Abstract:The intentional creation and spread of disinformation poses a significant threat to public discourse. However, existing English datasets and research rarely address the intentionality behind the disinformation. This work presents MALINT, the first human-annotated English corpus developed in collaboration with expert fact-checkers to capture disinformation and its malicious intent. We utilize our novel corpus to benchmark 12 language models, including small language models (SLMs) such as BERT and large language models (LLMs) like Llama 3.3, on binary and multilabel intent classification tasks. Moreover, inspired by inoculation theory from psychology and communication studies, we investigate whether incorporating knowledge of malicious intent can improve disinformation detection. To this end, we propose intent-based inoculation, an intent-augmented reasoning for LLMs that integrates intent analysis to mitigate the persuasive impact of disinformation. Analysis on six disinformation datasets, five LLMs, and seven languages shows that intent-augmented reasoning improves zero-shot disinformation detection. To support research in intent-aware disinformation detection, we release the MALINT dataset with annotations from each annotation step.
[NLP-69] CangjieBench: Benchmarking LLM s on a Low-Resource General-Purpose Programming Language
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源通用编程语言(low-resource general-purpose programming languages)中性能显著下降的问题,尤其是现有研究多集中于领域特定语言(Domain-Specific Languages, DSLs),而对数据稀缺的通用语言缺乏系统性探索。其解决方案的关键在于构建了一个无污染的基准测试集 CangjieBench,该基准包含 248 个高质量人工翻译样本,覆盖从文本到代码(Text-to-Code)和代码到代码(Code-to-Code)两类任务,并基于此对多种生成策略(包括直接生成、语法约束生成、检索增强生成(Retrieval-Augmented Generation, RAG)和智能体代理(Agent))进行了系统评估。实验表明,语法约束生成在准确率与计算成本之间取得了最佳平衡,而代码到代码翻译可能因负迁移效应导致性能低于文本到代码生成,为 LLM 在未见过的低资源编程语言中的泛化能力提供了关键洞见。
链接: https://arxiv.org/abs/2603.14501
作者: Junhang Cheng,Fang Liu,Jia Li,Chengru Wu,Nanxiang Jiang,Li Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 20 figures
Abstract:Large Language Models excel in high-resource programming languages but struggle with low-resource ones. Existing research related to low-resource programming languages primarily focuses on Domain-Specific Languages (DSLs), leaving general-purpose languages that suffer from data scarcity underexplored. To address this gap, we introduce CangjieBench, a contamination-free benchmark for Cangjie, a representative low-resource general-purpose language. The benchmark comprises 248 high-quality samples manually translated from HumanEval and ClassEval, covering both Text-to-Code and Code-to-Code tasks. We conduct a systematic evaluation of diverse LLMs under four settings: Direct Generation, Syntax-Constrained Generation, Retrieval-Augmented Generation (RAG), and Agent. Experiments reveal that Direct Generation performs poorly, whereas Syntax-Constrained Generation offers the best trade-off between accuracy and computational cost. Agent achieve state-of-the-art accuracy but incur high token consumption. Furthermore, we observe that Code-to-Code translation often underperforms Text-to-Code generation, suggesting a negative transfer phenomenon where models overfit to the source language patterns. We hope that our work will offer valuable insights into LLM generalization to unseen and low-resource programming languages. Our code and data are available at this https URL.
[NLP-70] Fine-tuning MLLM s Without Forgetting Is Easier Than You Think
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在微调过程中出现的灾难性遗忘(catastrophic forgetting)问题,尤其是针对不同分布输入(in-distribution 和 out-of-distribution)下的性能退化现象。其解决方案的关键在于:首先通过适当的正则化策略(如限制可训练参数数量或采用低学习率)有效缓解因分布外图像输入导致的遗忘;其次,识别出在分布内图像与分布外文本组合下存在任务特定过拟合(task-specific overfitting),并通过引入数据混合训练策略(data-hybrid training strategy)融合多种数据集和任务来应对;最终,该方法在持续学习场景中展现出优于依赖复杂辅助机制的现有方法的性能,表明MLLM本身具有较强的内在鲁棒性,且简单调整微调配方即可实现能力保留与适应性的平衡。
链接: https://arxiv.org/abs/2603.14493
作者: He Li,Yuhui Zhang,Xiaohan Wang,Kaifeng Lyu,Serena Yeung-Levy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The paper demonstrate that simple adjustments of the fine-tuning recipes of multimodal large language models (MLLM) are sufficient to mitigate catastrophic forgetting. On visual question answering, we design a 2x2 experimental framework to assess model performance across in-distribution and out-of-distribution image and text inputs. Our results show that appropriate regularization, such as constraining the number of trainable parameters or adopting a low learning rate, effectively prevents forgetting when dealing with out-of-distribution images. However, we uncover a distinct form of forgetting in settings with in-distribution images and out-of-distribution text. We attribute this forgetting as task-specific overfitting and address this issue by introducing a data-hybrid training strategy that combines datasets and tasks. Finally, we demonstrate that this approach naturally extends to continual learning, outperforming existing methods with complex auxiliary mechanisms. In general, our findings challenge the prevailing assumptions by highlighting the inherent robustness of MLLMs and providing practical guidelines for adapting them while preserving their general capabilities.
[NLP-71] Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agent ic Workflows
【速读】: 该论文旨在解决大语言模型在复杂推理任务中因高质量、可验证数据稀缺而导致的训练瓶颈问题,尤其是在物理等学科中,传统文本增强方法易引入幻觉,而静态基准测试又缺乏用于微调所需的推理轨迹。其解决方案的关键在于提出一种名为“无限问题生成器”(Infinite Problem Generator, IPG)的代理框架,通过“公式即代码”(Formula-as-Code)范式生成保证可解性的物理问题:将问题求解过程编码为可执行的 Python 程序,从而强制数学一致性,避免生成不可靠或矛盾的内容。这一方法不仅提升了问题的真实性与多样性,还揭示了公式数量与验证代码长度之间高度线性相关(R² ≈ 0.95)的“复杂度蓝图”,为无需代理指标的问题难度量化和可控课程生成提供了新路径。
链接: https://arxiv.org/abs/2603.14486
作者: Aditya Sharan,Sriram Hebbale,Dhruv Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Training large language models for complex reasoning is bottlenecked by the scarcity of verifiable, high-quality data. In domains like physics, standard text augmentation often introduces hallucinations, while static benchmarks lack the reasoning traces required for fine-tuning. We introduce the Infinite Problem Generator (IPG), an agentic framework that synthesizes physics problems with guaranteed solvability through a Formula-as-Code paradigm. Unlike probabilistic text generation, IPG constructs solutions as executable Python programs, enforcing strict mathematical consistency. As a proof-of-concept, we release ClassicalMechanicsV1, a high-fidelity corpus of 1,335 classical mechanics problems expanded from 165 expert seeds. The corpus demonstrates high structural diversity, spanning 102 unique physical formulas with an average complexity of 3.05 formulas per problem. Furthermore, we identify a Complexity Blueprint, demonstrating a strong linear correlation ( R^2 \approx 0.95 ) between formula count and verification code length. This relationship establishes code complexity as a precise, proxy-free metric for problem difficulty, enabling controllable curriculum generation. We release the full IPG pipeline, the ClassicalMechanicsV1 dataset, and our evaluation report to support reproducible research in reasoning-intensive domains.
[NLP-72] AI Can Learn Scientific Taste
【速读】: 该论文旨在解决当前人工智能科学家研究中普遍存在的问题:即多数研究聚焦于提升AI的执行能力(executive capability),而忽视了对其“科学品味”(scientific taste)——即判断和提出具有高潜在影响力研究想法的能力——的增强。解决方案的关键在于提出一种基于社区反馈的强化学习方法(Reinforcement Learning from Community Feedback, RLCF),将科学品味的学习建模为偏好建模与对齐问题。具体而言,首先训练一个名为Scientific Judge的判别模型,利用70万对领域与时间匹配的高引用与低引用论文进行偏好学习;随后,以Scientific Judge作为奖励模型,训练一个策略模型Scientific Thinker,用于生成高潜力的研究想法。实验证明,该框架不仅使Scientific Judge在多个测试场景下超越现有大语言模型(LLMs),且Scientific Thinker所提出的创意相较于基线更具潜在影响力,表明AI具备学习科学品味的能力,是迈向人类水平AI科学家的重要一步。
链接: https://arxiv.org/abs/2603.14473
作者: Jingqi Tong,Mingzhe Li,Hangcheng Li,Yongzhuo Yang,Yurong Mou,Weijie Ma,Zhiheng Xi,Hongji Chen,Xiaoran Liu,Qinyuan Cheng,Ming Zhang,Qiguang Chen,Weifeng Ge,Qipeng Guo,Tianlei Ying,Tianxiang Sun,Yining Zheng,Xinchi Chen,Jun Zhao,Ning Ding,Xuanjing Huang,Yugang Jiang,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai AI Lab (上海人工智能实验室); Tsinghua University (清华大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: 44 pages, 4 figures
Abstract:Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist’s executive capability, while enhancing an AI’s scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.
[NLP-73] An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险垂直领域(如保险)中应用时面临的两大核心挑战:一是如何在严格遵守复杂法规和业务逻辑的前提下避免幻觉(hallucination),二是如何在提升领域专业能力的同时不牺牲通用智能。现有方法常陷入“能力权衡”困境,即为获得领域专精而削弱泛化性能,或依赖检索增强生成(Retrieval-Augmented Generation, RAG)但缺乏内在推理能力。解决方案的关键在于提出INS-S1保险专用大模型家族,其创新性体现在两个方面:一是构建了可验证的数据合成系统,用于生成分层的精算推理与合规数据集;二是设计了一种渐进式监督微调与强化学习(SFT-RL)课程框架,融合验证推理(Verified Reasoning, RLVR)与AI反馈(RLAIF),通过动态调整数据比例和奖励信号,在强化领域约束的同时防止灾难性遗忘。该方案实现了保险任务上的最先进性能,并保持了顶级通用能力,同时将幻觉率降至0.6%(HHEM)。
链接: https://arxiv.org/abs/2603.14463
作者: Qian Zhu,Xinnan Guo,Jingjing Huo,Jun Li,Pan Liu,Wenyan Yang,Wanqing Xu,Xuan Lin
机构: Ant Group(蚂蚁集团)
类目: Computation and Language (cs.CL)
备注: 21 pages, 12 figures, 17 tables
Abstract:Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.
信息检索
[IR-0] Financial Transaction Retrieval and Contextual Evidence for Knowledge-Grounded Reasoning
【速读】:该论文旨在解决金融领域中低监督环境下交易数据分析的难题,即如何在缺乏大量标注数据的情况下,有效利用客户生成的数字痕迹(如交易历史)提升用户建模性能。传统通用大语言模型(Large Language Models, LLMs)难以处理时间分布的表格数据,而现有专用表格与序列模型又存在迁移能力弱和依赖标签数据的问题。解决方案的关键在于提出FinTRACE架构——一种以检索为核心的新型框架,其通过将原始交易转化为可复用的特征表示、应用规则驱动的检测器,并将结果信号存储于具有任务目标关联度的行為知识库中,从而实现高效、可迁移的低监督交易分析。该方法在多个公开与工业基准上显著提升性能,例如零样本场景下 churn 预测的 Matthews Correlation Coefficient (MCC) 从 0.19 提升至 0.38,且进一步通过指令微调使 LLM 基于检索到的行为模式实现交易分析任务的最先进表现。
链接: https://arxiv.org/abs/2603.15459
作者: Artem Sakhno,Daniil Tomilov,Yuliana Shakhvalieva,Inessa Fedorova,Daria Ruzanova,Omar Zoloev,Andrey Savchenko,Maksim Makarenko
机构: Sber AI Lab(斯贝AI实验室); Sber AIM(斯贝AI); Sber AI Lab(斯贝AI实验室)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Nowadays, success of financial organizations heavily depends on their ability to process digital traces generated by their clients, e.g., transaction histories, gathered from various sources to improve user modeling pipelines. As general-purpose LLMs struggle with time-distributed tabular data, production stacks still depend on specialized tabular and sequence models with limited transferability and need for labeled data. To address this, we introduce FinTRACE, a retrieval-first architecture that converts raw transactions into reusable feature representations, applies rule-based detectors, and stores the resulting signals in a behavioral knowledge base with graded associations to the objectives of downstream tasks. Across public and industrial benchmarks, FinTRACE substantially improves low-supervision transaction analytics, doubling zero-shot MCC on churn prediction performance from 0.19 to 0.38 and improving 16-shot MCC from 0.25 to 0.40. We further use FinTRACE to ground LLMs via instruction tuning on retrieved behavioral patterns, achieving state-of-the-art LLM results on transaction analytics problems.
[IR-1] Multi-Scenario User Profile Construction via Recommendation Lists
【速读】:该论文旨在解决推荐系统(Recommender Systems, RS)中用户画像构建的难题,特别是在信息条件受限的情况下如何准确推断用户个人特征。其核心挑战在于如何利用易获取的推荐列表来有效建模用户属性,而无需直接访问用户的历史行为数据。解决方案的关键在于提出了一种名为RAPI的通用用户属性分析框架:首先通过预训练BERT模型提取物品嵌入(item embeddings),构建代理推荐模型以模拟原始推荐逻辑;其次引入样本增强模块,基于模型输出与物品嵌入间的相似性生成扩展推荐列表;最后采用自适应权重分类模型动态分配权重,从而提升用户特征推理的准确性。实验表明,该方法在四个数据集上分别实现了0.764和0.6477的推理准确率。
链接: https://arxiv.org/abs/2603.15357
作者: Hui Zhang,Jiayu Liu
机构: Huazhong University of Science and Technology(华中科技大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Recommender systems (RS) play a core role in various domains, including business analytics, helping users and companies make appropriate decisions. To optimize service quality, related technologies focus on constructing user profiles by analyzing users’ historical behavior information. This paper considers four analytical scenarios to evaluate user profiling capabilities under different information conditions. A generic user attribute analysis framework named RAPI is proposed, which infers users’ personal characteristics by exploiting easily accessible recommendation lists. Specifically, a surrogate recommendation model is established to simulate the original model, leveraging content embedding from a pre-trained BERT model to obtain item embeddings. A sample augmentation module generates extended recommendation lists by considering similarity between model outputs and item embeddings. Finally, an adaptive weight classification model assigns dynamic weights to facilitate user characteristic inference. Experiments on four collections show that RAPI achieves inference accuracy of 0.764 and 0.6477, respectively.
[IR-2] OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)流水线评估中缺乏具备可验证真实答案(ground truth)、时间结构清晰且跨文档属性一致的数据集问题。现有数据集如Enron语料库存在法律模糊性、人口统计偏差且无结构化真实标签,而纯大语言模型(Large Language Model, LLM)生成的合成数据虽规避法律风险,却引入了自相矛盾的事实幻觉问题。其解决方案的关键在于提出OrgForge——一个开源多智能体仿真框架,通过严格区分物理世界与认知世界的边界:由确定性Python引擎维护SimEvent事件总线作为唯一真实来源,LLM仅负责生成表层文本并受已验证提案约束;同时引入局部时钟机制确保所有文档类型的时间戳因果一致性,避免独立采样导致的时间线不一致问题;并通过图动态子系统(基于介数中心性、时间边权衰减和Dijkstra升级路由)建模组织行为,实现对故障传播、重复失败模式识别及邮件流因果链的精确追踪,从而构建出结构严谨、可追溯、可验证的RAG评估环境。
链接: https://arxiv.org/abs/2603.14997
作者: Jeffrey Flynt
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Evaluating retrieval-augmented generation (RAG) pipelines requires corpora where ground truth is knowable, temporally structured, and cross-artifact properties that real-world datasets rarely provide cleanly. Existing resources such as the Enron corpus carry legal ambiguity, demographic skew, and no structured ground truth. Purely LLM-generated synthetic data solves the legal problem but introduces a subtler one: the generating model cannot be prevented from hallucinating facts that contradict themselves across this http URL present OrgForge, an open-source multi-agent simulation framework that enforces a strict physics-cognition boundary: a deterministic Python engine maintains a SimEvent ground truth bus; large language models generate only surface prose, constrained by validated proposals. An actor-local clock enforces causal timestamp correctness across all artifact types, eliminating the class of timeline inconsistencies that arise when timestamps are sampled independently per document. We formalize three graph-dynamic subsystems stress propagation via betweenness centrality, temporal edge-weight decay, and Dijkstra escalation routing that govern organizational behavior independently of any LLM. Running a configurable N-day simulation, OrgForge produces interleaved Slack threads, JIRA tickets, Confluence pages, Git pull requests, and emails, all traceable to a shared, immutable event log. We additionally describe a causal chain tracking subsystem that accumulates cross-artifact evidence graphs per incident, a hybrid reciprocal-rank-fusion recurrence detector for identifying repeated failure classes, and an inbound/outbound email engine that routes vendor alerts, customer complaints, and HR correspondence through gated causal chains with probabilistic drop simulation. OrgForge is available under the MIT license.
[IR-3] Mitigating KG Quality Issues: A Robust Multi-Hop GraphRAG Retrieval Framework
【速读】:该论文旨在解决基于知识图谱(Knowledge Graph, KG)的多跳推理任务中因KG质量不佳导致的检索漂移(retrieval drift)和检索幻觉(retrieval hallucination)问题。现有方法往往忽视KG固有的噪声与不完整性,从而在复杂推理过程中产生错误路径或缺乏依据的推理链。其解决方案的关键在于提出C2RAG框架:首先通过约束分解与细粒度约束锚定(constraint anchoring)进行约束驱动的检索,有效过滤噪声候选以抑制检索漂移;其次引入充分性检查机制(sufficiency check),显式判断当前证据是否足以支持结构化传播,若不足则触发文本恢复(textual recovery)以避免检索幻觉。实验表明,C2RAG在多个多跳推理基准上显著优于最新基线,平均提升EM 3.4%、F1 3.9%,且对KG质量问题更具鲁棒性。
链接: https://arxiv.org/abs/2603.14828
作者: Yizhuo Ma,Shuang Liang,Rongzheng Wang,Jiakai,Qizhi Chen,Muquan Li,Ke Qin
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:Graph Retrieval-Augmented Generation enhances multi-hop reasoning but relies on imperfect knowledge graphs that frequently suffer from inherent quality issues. Current approaches often overlook these issues, consequently struggling with retrieval drift driven by spurious noise and retrieval hallucinations stemming from incomplete information. To address these challenges, we propose C2RAG (Constraint-Checked Retrieval-Augmented Generation), a framework aimed at robust multi-hop retrieval over the imperfect KG. First, C2RAG performs constraint-based retrieval by decomposing each query into atomic constraint triples, with using fine-grained constraint anchoring to filter candidates for suppressing retrieval drift. Second, C2RAG introduces a sufficiency check to explicitly prevent retrieval hallucinations by deciding whether the current evidence is sufficient to justify structural propagation, and activating textual recovery otherwise. Extensive experiments on multi-hop benchmarks demonstrate that C2RAG consistently outperforms the latest baselines by 3.4% EM and 3.9% F1 on average, while exhibiting improved robustness under KG issues.
[IR-4] Compute Allocation for Reasoning -Intensive Retrieval Agents
【速读】:该论文旨在解决长时程智能体(agent)在持续记忆存储下,如何高效实现推理密集型检索(reasoning-intensive retrieval)的问题。这类检索任务中,查询与相关文档之间的关联通常隐含且需通过推理才能建立。现有基于大语言模型(LLM)增强的检索流水线依赖查询扩展(query expansion)和候选重排序(candidate re-ranking)来提升效果,但带来了显著的推理开销。论文的关键解决方案在于系统性地分析计算资源分配策略:通过BRIGHT基准和Gemini 2.5模型族实验发现,将计算资源集中于重排序阶段能获得最大收益——更强的模型(+7.5 NDCG@10)和更深的候选池(k从10增至100时提升21%),而查询扩展阶段则表现出边际效益递减,且推理时间思考(inference-time thinking)对两阶段均无明显增益。这表明优化应聚焦于重排序而非均匀分配算力。
链接: https://arxiv.org/abs/2603.14635
作者: Sreeja Apparaju,Nilesh Gupta
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:As agents operate over long horizons, their memory stores grow continuously, making retrieval critical to accessing relevant information. Many agent queries require reasoning-intensive retrieval, where the connection between query and relevant documents is implicit and requires inference to bridge. LLM-augmented pipelines address this through query expansion and candidate re-ranking, but introduce significant inference costs. We study computation allocation in reasoning-intensive retrieval pipelines using the BRIGHT benchmark and Gemini 2.5 model family. We vary model capacity, inference-time thinking, and re-ranking depth across query expansion and re-ranking stages. We find that re-ranking benefits substantially from stronger models (+7.5 NDCG@10) and deeper candidate pools (+21% from k =10 to 100), while query expansion shows diminishing returns beyond lightweight models (+1.1 NDCG@10 from weak to strong). Inference-time thinking provides minimal improvement at either stage. These results suggest that compute should be concentrated on re-ranking rather than distributed uniformly across pipeline stages.
[IR-5] ResearchPilot: A Local-First Multi-Agent System for Literature Synthesis and Related Work Drafting
【速读】:该论文旨在解决科研人员在文献综述过程中效率低下、信息碎片化以及缺乏结构化知识整合的问题(即如何高效地从海量学术文献中提取关键发现并生成可引用的综述段落)。其解决方案的核心在于构建一个开源、可本地部署的多智能体系统——ResearchPilot,该系统基于本地优先架构(local-first architecture),集成FastAPI、DSPy、SQLite与Qdrant等技术组件,实现从自然语言查询出发的端到端文献检索、结构化信息抽取、跨论文模式合成及带引用标注的相关工作段落自动生成。系统通过定义类型化的智能体接口和持久化历史搜索机制,提升了可复现性与透明度,从而为研究者提供了一个可控、可扩展且无需依赖云端服务的文献辅助工具。
链接: https://arxiv.org/abs/2603.14629
作者: Peng Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:ResearchPilot is an open-source, self-hostable multi-agent system for literature-review assistance. Given a natural-language research question, it retrieves papers from Semantic Scholar and arXiv, extracts structured findings from paper abstracts, synthesizes cross-paper patterns, and drafts a citation-aware related-work section. The system combines FastAPI, this http URL, DSPy, SQLite, and Qdrant in a local-first architecture that supports bring-your-own-key model access and remote-or-local embeddings. This paper describes the system design, typed agent interfaces, persistence and history-search mechanisms, and the engineering tradeoffs involved in building a transparent research assistant. Rather than claiming algorithmic novelty, we present ResearchPilot as a systems contribution and evaluate it through automated tests and end-to-end local runs. We discuss limitations including external API rate limits, abstract-only extraction, incomplete corpus coverage, and the lack of citation verification.
[IR-6] FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference
【速读】:该论文旨在解决语言模型在消费级设备上部署时因分类头(classification head)参数量大、计算开销高而导致的推理效率瓶颈问题,其占模型参数高达60%,推理计算达50%。解决方案的关键在于提出FlashHead——一种无需训练且硬件友好的密集分类头替代方案,通过将输出头的计算重构为信息检索问题,实现高效推理:核心创新包括基于平衡聚类的紧凑张量结构、支持数千簇并行评分的多探针检索扩展、推理时概率采样机制以覆盖全词汇表,以及选择性量化技术实现低比特计算,最终在Llama-3.2、Gemma-3和Qwen-3等模型上实现最高1.75倍的推理加速,同时保持输出准确性。
链接: https://arxiv.org/abs/2603.14591
作者: Wilhelm Tranheden,Shahnawaz Ahmed,Devdatt Dubhashi,Jonna Matthiesen,Hannes von Essen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: A collection of models with FlashHead optimization can be found at: this https URL
Abstract:Language models are increasingly adopting smaller architectures optimized for consumer devices. In this setting, inference efficiency is the primary constraint. Meanwhile, vocabulary sizes continue to grow rapidly, making the classification head a critical bottleneck that accounts for up to 60% of model parameters, and 50% of inference compute. We introduce FlashHead, the first efficient drop-in replacement for the dense classification head that is training-free and hardware-friendly. FlashHead builds on principles from information retrieval, reframing that computation at the output head as a retrieval problem rather than a dense classification over the full vocabulary. FlashHead introduces four key innovations: (1) a balanced clustering scheme that structures vocabulary partitions into compact hardware-efficient tensors, (2) extending multiprobe retrieval to language model heads, enabling thousands of clusters to be scored in parallel, (3) a novel inference-time sampling mechanism that extends retrieval beyond top tokens, enabling probabilistic sampling across the full vocabulary, and (4) selective quantization, enabling effective low-bit computation in the head. Experiments on Llama-3.2, Gemma-3, and Qwen-3 show that FlashHead delivers model-level inference speedups of up to \textbf1.75x which maintaining output accuracy compared to the original head. By overcoming the classification head bottleneck, FlashHead establishes a new benchmark for efficient inference and removes a key barrier to developing smaller, capable models for consumer hardware.
[IR-7] SuperLocalMemory V3: Information-Geometric Foundations for Zero-LLM Enterprise Agent Memory
【速读】:该论文旨在解决AI代理中持久记忆(persistent memory)系统的数学基础缺失问题,具体包括记忆检索、生命周期管理和一致性保障三大核心挑战。当前方法依赖启发式策略(如余弦相似度检索和人工衰减机制),缺乏形式化矛盾检测能力。其解决方案的关键在于引入三个信息几何(information-geometric)、层析同调(sheaf-theoretic)与随机动力学(stochastic-dynamical)的理论框架:首先基于对角高斯族的Fisher信息结构构建满足黎曼度量公理的检索度量;其次将记忆生命周期建模为黎曼朗之万动力学(Riemannian Langevin dynamics),通过Fokker-Planck方程证明稳态分布的存在唯一性,实现可证明收敛的自动衰减;最后利用细胞层叠模型(cellular sheaf)识别非平凡的一阶上同调类,精确对应跨上下文的记忆不一致矛盾。这一系列理论创新在LoCoMo基准测试中显著提升性能(+12.7个百分点),并支持零大语言模型(zero-LLM)架构以满足欧盟人工智能法案的数据主权要求。
链接: https://arxiv.org/abs/2603.14588
作者: Varun Pratap Bhardwaj
机构: Independent Researcher, Solution Architect; India
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 43 pages, 5 figures, 9 tables, 3 appendices. Code: this https URL . Zenodo DOI: https://doi.org/10.5281/zenodo.19038659
Abstract:Persistent memory is a central capability for AI agents, yet the mathematical foundations of memory retrieval, lifecycle management, and consistency remain unexplored. Current systems employ cosine similarity for retrieval, heuristic decay for salience, and provide no formal contradiction detection. We establish information-geometric foundations through three contributions. First, a retrieval metric derived from the Fisher information structure of diagonal Gaussian families, satisfying Riemannian metric axioms, invariant under sufficient statistics, and computable in O(d) time. Second, memory lifecycle formulated as Riemannian Langevin dynamics with proven existence and uniqueness of the stationary distribution via the Fokker-Planck equation, replacing hand-tuned decay with principled convergence guarantees. Third, a cellular sheaf model where non-trivial first cohomology classes correspond precisely to irreconcilable contradictions across memory contexts. On the LoCoMo benchmark, the mathematical layers yield +12.7 percentage points over engineering baselines across six conversations, reaching +19.9 pp on the most challenging dialogues. A four-channel retrieval architecture achieves 75% accuracy without cloud dependency. Cloud-augmented results reach 87.7%. A zero-LLM configuration satisfies EU AI Act data sovereignty requirements by architectural design. To our knowledge, this is the first work establishing information-geometric, sheaf-theoretic, and stochastic-dynamical foundations for AI agent memory systems. Comments: 43 pages, 5 figures, 9 tables, 3 appendices. Code: this https URL. Zenodo DOI: https://doi.org/10.5281/zenodo.19038659 Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG) ACMclasses: I.2.6; H.3.3 Cite as: arXiv:2603.14588 [cs.AI] (or arXiv:2603.14588v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.14588 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.19038659 Focus to learn more DOI(s) linking to related resources
[IR-8] Open to What End? A Capability-Theoretic Perspective on Open Search
【速读】:该论文旨在解决当前由少数大型科技公司主导搜索平台所引发的权力集中问题,尤其是在地缘政治紧张和威权势力通过意识形态操控影响公众舆论的背景下,如何有效推动“开放搜索”(open search)以挑战现有垄断格局。其解决方案的关键在于重新定义“开放”的内涵——不应仅关注被开放的技术或数据本身,而应从能力理论(capability theory)视角出发,聚焦于开放系统为使用者赋予的实际能力,从而确保开放运动能够真正赋能用户、抵御企业对开放倡议的收编与消解,避免重蹈开源软件和生成式 AI 等领域中因标准模糊而导致的权力固化困境。
链接: https://arxiv.org/abs/2603.14584
作者: Nicola Neophytou,Bhaskar Mitra
机构: 未知
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
备注:
Abstract:The hegemony of control over our search platforms by a few large corporations raises justifiable concerns, particularly in light of emerging geopolitical tensions and growing instances of ideological imposition by authoritarian actors to manipulate public opinion. Recent movement for promote open search has emerged in response. This follows from past and ongoing push for openness to challenge corporate oligopolies (e.g., open source and open AI models) which have seen significant ongoing negotiations and renegotiations to establish standards around what constitutes being open. These tensions have hindered these movements from effectively challenging power, in turn allowing powerful corporations to neutralize or co-opt these movements to further entrench their dominance. We argue that the push for open search will inevitably encounter similar conflicts, and should foreground these tensions to safefguard against similar challenges as these adjacent movements. In particular, we argue that the concept of open should be understood not with respect to what is being made open but through a capability-theoretic lens, in terms of the capabilities it affords to the actors the system is being opened to.
[IR-9] A comprehensive multimodal dataset and benchmark for ulcerative colitis scoring in endoscopy
【速读】:该论文旨在解决溃疡性结肠炎(Ulcerative Colitis, UC)内镜图像自动评分与描述的临床需求,当前缺乏公开的专家标注数据集及可靠的基准测试方法,导致生成式AI在UC内镜图像分析中的应用受限。其关键解决方案是构建了一个多中心、多分辨率的标准化数据集,包含经专家验证的 Mayo Endoscopic Score (MES) 和 Ulcerative Colitis Endoscopic Index of Severity (UCEIS) 标注,并配以临床医生撰写的图像描述(image captions),首次实现了双评分指标与临床语义描述的联合标注,为开发具有临床意义的多模态算法提供了高质量训练与评估资源。
链接: https://arxiv.org/abs/2603.14559
作者: Noha Ghatwary,Jiangbei Yue,Ahmed Elgendy,Hanna Nagdy,Ahmed Galal,Hayam Fathy,Hussein El-Amin,Venkataraman Subramanian,Noor Mohammed,Gilberto Ochoa-Ruiz,Sharib Ali
机构: Arab Academy for Science and Technology (阿拉伯科技学院); University of Leeds (利兹大学); Queen’s University (女王大学); Alexandria University (亚历山大大学); Assiut university (艾斯尤特大学); Leeds Teaching Hospital NHS Trust (利兹教学医院NHS信托); Tecnologico de Monterrey (蒙特雷科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 11
Abstract:Ulcerative colitis (UC) is a chronic mucosal inflammatory condition that places patients at increased risk of colorectal cancer. Colonoscopic surveillance remains the gold standard for assessing disease activity, and reporting typically relies on standardised endoscopic scoring metrics. The most widely used is the Mayo Endoscopic Score (MES), with some centres also adopting the Ulcerative Colitis Endoscopic Index of Severity (UCEIS). Both are descriptive assessments of mucosal inflammation (MES: 0 to 3; UCEIS: 0 to 8), where higher values indicate more severe disease. However, computational methods for automatically predicting these scores remain limited, largely due to the lack of publicly available expert-annotated datasets and the absence of robust benchmarking. There is also a significant research gap in generating clinically meaningful descriptions of UC images, despite image captioning being a well-established computer vision task. Variability in endoscopic systems and procedural workflows across centres further highlights the need for multi-centre datasets to ensure algorithmic robustness and generalisability. In this work, we introduce a curated multi-centre, multi-resolution dataset that includes expert-validated MES and UCEIS labels, alongside detailed clinical descriptions. To our knowledge, this is the first comprehensive dataset that combines dual scoring metrics for classification tasks with expert-generated captions describing mucosal appearance and clinically accepted reasoning for image captioning. This resource opens new opportunities for developing clinically meaningful multimodal algorithms. In addition to the dataset, we also provide benchmarking using convolutional neural networks, vision transformers, hybrid models, and widely used multimodal vision-language captioning algorithms.
[IR-10] Expert Mind: A Retrieval-Augmented Architecture for Expert Knowledge Preservation in the Energy Sector
【速读】:该论文旨在解决工业组织中领域专家离职导致的隐性知识(tacit knowledge)不可逆流失问题,尤其聚焦于能源行业中因员工老龄化而面临的经验传承危机。解决方案的关键在于构建一个名为Expert Mind的实验系统,该系统融合检索增强生成(Retrieval-Augmented Generation, RAG)、大语言模型(Large Language Models, LLMs)与多模态采集技术,通过结构化访谈、思维 aloud 会话和文本语料库摄入等方式获取专家知识,并将其嵌入向量存储中,最终以对话式接口实现可查询的知识表达。该方案不仅提升了知识转移效率与新员工入职速度,还将知情同意、知识产权和删除权等伦理维度作为核心设计约束纳入系统架构。
链接: https://arxiv.org/abs/2603.14541
作者: Diego Ezequiel Cervera
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 6 pages, 1 figure, conceptual architecture paper on retrieval-augmented expert knowledge systems
Abstract:The departure of subject-matter experts from industrial organizations results in the irreversible loss of tacit knowledge that is rarely captured through conventional documentation practices. This paper proposes Expert Mind, an experimental system that leverages Retrieval-Augmented Generation (RAG), large language models (LLMs), and multimodal capture techniques to preserve, structure, and make queryable the deep expertise of organizational knowledge holders. Drawing on the specific context of the energy sector, where decades of operational experience risk being lost to an aging workforce, we describe the system architecture, processing pipeline, ethical framework, and evaluation methodology. The proposed system addresses the knowledge elicitation problem through structured interviews, think-aloud sessions, and text corpus ingestion, which are subsequently embedded into a vector store and queried through a conversational interface. Preliminary design considerations suggest Expert Mind can significantly reduce knowledge transfer latency and improve onboarding efficiency. Ethical dimensions including informed consent, intellectual property, and the right to erasure are addressed as first-class design constraints.
[IR-11] LongVidSearch: An Agent ic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos
【速读】:该论文旨在解决长视频问答(Long-Video QA)中多跳证据检索(multi-hop evidence retrieval)规划能力评估的难题。现有基准普遍存在静态性、缺乏严格的多跳约束以及未标准化的证据访问接口,导致难以区分检索规划失败与答案生成失败。解决方案的关键在于提出 LongVidSearch 基准,其核心创新包括:1)强制执行“检索必要性”——每个 Hop-k 问题必须恰好依赖 k 个连续但不相邻的证据片段,移除任一片段即无法解答;2)提供统一工具接口以固定检索后端,从而隔离并量化代理在查询制定与迭代检索规划上的表现;3)引入工具调用成本指标,实现对准确率与效率权衡的可控分析。该设计显著提升了评估的严谨性和可比性,揭示了当前模型在复杂多跳推理中的瓶颈。
链接: https://arxiv.org/abs/2603.14468
作者: Rongyi Yu,Chenyuan Duan,Wentao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 12 pages, 2 figures, appendix included
Abstract:Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation. To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark contains 3,000 questions over 447 long videos (average length 26 minutes), covering four reasoning categories: State Mutation, Causal Inference, Global Summary, and Visual Tracking, with 2-hop, 3-hop, and 4-hop evidence requirements. To ensure fair and controlled evaluation, all agents interact with LongVidSearch through a unified tool interface, which fixes the retrieval backend and isolates the agent’s ability to formulate queries and plan iterative retrieval. In addition to answer accuracy, we measure tool-call cost to analyze the accuracy-efficiency trade-off under identical access conditions. We evaluate VideoAgent-style QA agents with multiple backbone LLMs using three-judge majority voting. GPT-5 achieves the highest accuracy (42.43), outperforming Gemini 3 Pro (30.97) and GPT-4o (19.20), yet remaining below 50 %, highlighting the difficulty of multi-hop retrieval planning. With gold evidence clips, performance becomes near-perfect, confirming retrieval planning as the primary bottleneck.
人机交互
[HC-0] Do Metrics for Counterfactual Explanations Align with User Perception?
【速读】:该论文试图解决的问题是:当前广泛使用的反事实解释(counterfactual explanations)评估指标多为算法性指标,但这些指标是否真正反映了用户对解释质量的主观感知尚缺乏实证验证。解决方案的关键在于通过一项实证研究,直接比较算法指标与人类判断之间的关系,涵盖三个数据集中的多个维度的人类评分,并系统分析单个指标与组合指标在预测人类感知方面的有效性。结果表明,算法指标与人类评分的相关性普遍较弱且高度依赖数据集,且增加指标数量并不能显著提升预测可靠性,揭示了现有评估体系在捕捉人类关注的核心解释质量维度上的结构性局限。
链接: https://arxiv.org/abs/2603.15607
作者: Felix Liedeker,Basil Ell,Philipp Cimiano,Christoph Düsing
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at the 4th World Conference on eXplainable Artificial Intelligence (XAI 2026)
Abstract:Explainability is widely regarded as essential for trustworthy artificial intelligence systems. However, the metrics commonly used to evaluate counterfactual explanations are algorithmic evaluation metrics that are rarely validated against human judgments of explanation quality. This raises the question of whether such metrics meaningfully reflect user perceptions. We address this question through an empirical study that directly compares algorithmic evaluation metrics with human judgments across three datasets. Participants rated counterfactual explanations along multiple dimensions of perceived quality, which we relate to a comprehensive set of standard counterfactual metrics. We analyze both individual relationships and the extent to which combinations of metrics can predict human assessments. Our results show that correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. Moreover, increasing the number of metrics used in predictive models does not lead to reliable improvements, indicating structural limitations in how current metrics capture criteria relevant for humans. Overall, our findings suggest that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, underscoring the need for more human-centered approaches to evaluating explainable artificial intelligence.
[HC-1] Can LLM s Model Incorrect Student Reasoning ? A Case Study on Distractor Generation
【速读】:该论文旨在解决生成式 AI 在教育场景中建模学生错误理解(student misconceptions)的问题,特别是如何让大型语言模型(Large Language Models, LLMs)生成高质量的干扰项(distractors),即看似合理但实际错误的多选题选项。其解决方案的关键在于:LLMs 通过先正确求解问题,再模拟多种潜在的错误理解,并最终筛选出一组具有可接受性的干扰项,这一过程与学习科学中的最佳实践高度一致。研究进一步发现,若在提示中明确提供正确答案,可使生成的干扰项更贴近人工编写的水平(提升8%),表明锚定于正确解是生成合理错误推理的核心前提。
链接: https://arxiv.org/abs/2603.15547
作者: Yanick Zengaffinen,Andreas Opedal,Donya Rooein,Kv Aditya Srivatsa,Shashank Sonkar,Mrinmaya Sachan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for analyzing the strategies used by state-of-the-art LLMs, examining their reasoning procedures and comparing them to established best practices in the learning sciences. Our structured analysis reveals a surprising alignment between their processes and best practices: the models typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. An analysis of failure modes reveals that errors arise primarily from failures in recovering the correct solution and selecting among response candidates, rather than simulating errors or structuring the process. Consistent with these results, we find that providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, our analysis offers a structured and interpretable lens into LLMs’ ability to model incorrect student reasoning and produce high-quality distractors.
[HC-2] Clinically Aware Synthetic Image Generation for Concept Coverag e in Chest X-ray Models
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在临床部署中面临的挑战:即现有胸部X光(chest radiograph)数据集系统性地低估了关键临床特征组合,导致模型在高风险场景下训练不足,进而影响其鲁棒性和可信度。解决方案的关键在于提出一种名为CARS(Clinically Aware and Anatomically Grounded Synthesis)的框架,该框架通过有原则的合成图像生成策略,在保持解剖结构完整性的前提下,对临床特征向量进行定向扰动,实现病理发现的可控插入与删除,从而提升特征空间的覆盖范围。实验表明,基于CARS生成的数据微调模型能显著改善精确率-召回率性能、降低预测不确定性并增强校准效果,同时通过专家放射科医师独立评估验证了合成图像的真实性和临床一致性,证明其在不损害临床完整性的情况下有效提升了胸片分类系统的性能与可信度。
链接: https://arxiv.org/abs/2603.15525
作者: Amy Rafferty,Rishi Ramaesh,Ajitha Rajan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:The clinical deployment of AI diagnostic models demands more than benchmark accuracy - it demands robustness across the full spectrum of disease presentations. However, publicly available chest radiographic datasets systematically underrepresent critical clinical feature combinations, leaving models under-trained precisely where clinical stakes are highest. We present CARS, a clinically aware and anatomically grounded framework that addresses this gap through principled synthetic image generation. CARS applies targeted perturbations to clinical feature vectors, enabling controlled insertion and deletion of pathological findings while explicitly preserving anatomical structure. We evaluate CARS across seven backbone architectures by fine-tuning models on synthetic subsets and testing on a held-out MIMIC-CXR benchmark. Compared to prior feature perturbation approaches, fine-tuning on CARS-generated images consistently improves precision-recall performance, reduces predictive uncertainty, and improves model calibration. Structural and semantic analyses demonstrate high anatomical fidelity, strong feature alignment, and low semantic uncertainty. Independent evaluation by two expert radiologists further confirms realism and clinical agreement. As the field moves toward regulated clinical AI, CARS demonstrates that anatomically faithful synthetic data generation for better feature space coverage is a viable and effective strategy for improving both the performance and trustworthiness of chest X-ray classification systems - without compromising clinical integrity.
[HC-3] he Social Sycophancy Scale: A psychometrically validated measure of sycophancy
【速读】:该论文旨在解决生成式 AI(Generative AI)在人际互动场景中表现出的社会谄媚行为(Social Sycophancy)缺乏可量化测量工具的问题。当前研究多聚焦于有明确对错答案的任务(如编程),而忽视了AI在情感支持等无客观标准情境下的行为倾向。其解决方案的关键在于开发并验证了一个基于LLM行为而非依赖地面真实(ground truth)的 psychometrically validated 量表——Social Sycophancy Scale(SSS),该量表通过三因子结构(无批判性附和、奉承迎合、兴奋感)刻画了AI谄媚的核心维度,并通过四次实证研究(N=877)确认其信效度,同时揭示了谄媚与共情之间的复杂关联,为后续研究AI设计伦理与用户感知提供了严谨的测量基础。
链接: https://arxiv.org/abs/2603.15448
作者: Jean Rehani,Victoria Oldemburgo de Mello,Dariya Ovsyannikova,Ashton Anderson,Michael Inzlicht
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 35 pages, 1 figure, 5 tables. For supplementary material, see this https URL Author Contributions: J.R and M.I conceived the study design and research questions. J.R and D.O programmed the experimental iterations, collected, and cleaned the data. J.R and V.O.M analyzed the data. J.R wrote the manuscript. All authors edited the manuscript and provided oversight of and feedback on the work
Abstract:Large Language Model (LLM) sycophancy is a growing concern. The current literature has largely examined sycophancy in contexts with clear right and wrong answers, like coding. However, AI is increasingly being used for emotional support and interpersonal conversation, where no such ground truth exists. Building on a previous conceptualization of Social Sycophancy, this paper provides a psychometrically validated measure of sycophancy that relies on LLM behavior rather than comparisons with ground truth. We developed and validated the Social Sycophancy Scale in three samples (N = 877) and tested its applicability with automated methods. In each study, participants read conversations between an LLM and a user and rated the chatbot on a battery of items. Study 1 investigated an initial item pool derived from dictionary definitions and previous literature, serving as the explorative base for the following studies. In Study 2, we used a revised item set to establish our scale, which was subsequently confirmed in Study 3 and tested using LLM raters in Study 4. Across studies, the data support a 3 factor structure (Uncritical Agreement, Obsequiousness, and Excitement) with an underlying sycophantic construct. LLMs prompt tuned to be highly sycophantic scored higher than their low sycophancy counterparts on both overall sycophancy and its three facets across Studies 2 to 4. The nomological network of sycophancy revealed a consistent link with empathy, a pairing that raises uncomfortable questions about AI design, and a multivalent pattern: one facet was associated with favorable perceptions (Excitement), another unfavorable (Obsequiousness), and a third ambiguous (Uncritical Agreement). The Social Sycophancy Scale gives researchers the means to study sycophancy rigorously, and confront a genuine design tension: the warmth and empathy we want from AI may be precisely what makes it sycophantic.
[HC-4] Exploring Human Quadruped Locomotion for Exergames
【速读】:该论文旨在解决当前健身游戏(exergame)中对非人类角色的具身交互(embodied interaction)和腹部肌肉锻炼的探索不足问题。其解决方案的关键在于设计并实现了一种基于计算机视觉的新型四足动物运动健身游戏系统,玩家仰卧于地面,通过四肢动作控制虚拟老虎角色移动,模拟如自行车卷腹等核心肌群训练动作;系统采用上方部署的Kinect传感器进行无穿戴式动作捕捉,适用于无人值守的商业场景(如室内活动公园),并通过用户研究验证了其直观性、控制精度与沉浸感,表明自然肢体运动可有效生成响应式的四足虚拟行为,并使玩家在高强度核心训练中产生显著的游戏沉浸体验,从而掩盖身体疲劳感。
链接: https://arxiv.org/abs/2603.15428
作者: Shamit Ahmed,Perttu Hämäläinen
机构: Aalto University (阿尔托大学)
类目: Human-Computer Interaction (cs.HC)
备注: 24 pages, 9 figures
Abstract:Embodying non-human characters and exercising abdominal muscles are both underexplored in exergames. We address this by describing the design and evaluation of a novel human quadruped locomotion exergame. In the game, the player lies supine on the ground and moves their arms and legs to control a quadrupedal character (a tiger), similar to common bodyweight abdominal muscle exercises such as the Bicycle Crunch. The motion tracking is computer vision-based, utilizing a Kinect sensor placed above the player, which makes our approach suitable for commercial premises such as indoor activity parks where a system needs to run unattended and without any wearable components. Our system extends embodied interaction beyond traditional bipedal or controller-based systems, demonstrating how natural limb movements can generate responsive and immersive quadrupedal motion within virtual environments. We conducted a user study (N=15) to evaluate the system’s intuitiveness, control, and overall player experience. The findings demonstrate the usability and potential of our system, highlighting its intense physical nature. Participants reported that gameplay immersion masked physical exertion, allowing them to perceive rigorous core training primarily as play.
[HC-5] Multimodal Cyber-physical Interaction in XR: Hybrid Doctoral Thesis Defense
【速读】:该论文旨在解决学术活动(如博士论文答辩)中参与方式受限的问题,传统模式仅支持物理现场或平面视频会议,导致参与形式僵化且存在空间割裂。其解决方案的关键在于提出一个多元模态框架,通过整合全身动作捕捉技术实现用户化身的动作与手势同步,从而在虚拟环境中支持自然交互,并借助WebXR技术提供跨平台、即开即用的访问能力,使参与者能够从实体到场、沉浸式虚拟现实(VR)到浏览器访问等多种方式无缝参与,有效提升了混合型学术活动的互动性与包容性。
链接: https://arxiv.org/abs/2603.15392
作者: Ahmad Alhilal,Kit Yung Lam,Lik-Hang Lee,Xuetong Wang,Sijia Li,Matti Siekkinen,Tristan Braud,Pan Hui
机构: Hong Kong University of Science and Technology, Hong Kong; Aalto University, Espoo, Finland; Lucerne School of Computer Science and Information Technology, Rotkreuz, Switzerland; Hong Kong Polytechnic University, Hong Kong; Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
类目: Multimedia (cs.MM); Human-Computer Interaction (cs.HC)
备注: 10 pages, 3 figures, magazine paper
Abstract:Academic events, such as a doctoral thesis defense, are typically limited to either physical co-location or flat video conferencing, resulting in rigid participation formats and fragmented presence. We present a multimodal framework that breaks this binary by supporting a spectrum of participation - from in-person attendance to immersive virtual reality (VR) or browser access - and report our findings from using it to organize the first ever hybrid doctoral thesis defense using extended reality (XR). The framework integrates full-body motion tracking to synchronize the user’s avatar motions and gestures, enabling natural interaction with onsite participants as well as body language and gestures with remote attendees in the virtual world. It leverages WebXR to provide cross-platform and instant accessibility with easy setup. User feedback analysis reveals positive VR experiences and demonstrates the framework’s effectiveness in supporting various hybrid event activities.
[HC-6] o be FAIR or RIGHT? Methodological [R]esearch [I]ntegrity [G]iven [H]uman-facing [T]echnologies using the example of Learning Technologies
【速读】:该论文旨在解决研究软件工程(Research Software Engineering, RSE)质量评估中对有效性(validity)维度缺乏系统框架的问题。现有研究主要聚焦于可靠性(reliability)和FAIR原则(可发现性、可访问性、可互操作性、可重用性),但未充分覆盖研究方法学上的有效性保障。为此,作者提出了RIGHT框架——一个基于理论迁移与过程建模构建的新型质量评估体系,其关键在于整合模拟研究、基于设计的研究、软件工程及实证社会科学等领域的成熟模型,从而为面向人类的RSE(如学习技术领域)提供可操作的有效性评估路径,并通过两个案例验证其实践价值。
链接: https://arxiv.org/abs/2603.15366
作者: Julian Dehne
机构: 未知
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:
Abstract:Quality assessment of Research Software Engineering (RSE) plays an important role in all scientific fields. From the canonical three criteria (reliability, validity, and objectivity) previous research has focussed on reliability and the FAIR principles. The RIGHT framework is introduced to fill the gap of existing frameworks for the validity aspect. The framework is constructed using the methods of theory transfer and process modelling. It is based on existing models of simulation research, design-based research, software engineering and empirical social sciences. The paper concludes with two case studies drawn from the field of learning technologies to illustrate the practical relevance of the framework for human-facing RSE. Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.15366 [cs.SE] (or arXiv:2603.15366v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.15366 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Julian Dehne [view email] [v1] Mon, 16 Mar 2026 14:41:05 UTC (204 KB) Full-text links: Access Paper: View a PDF of the paper titled To be FAIR or RIGHT? Methodological [R]esearch [I]ntegrity [G]iven [H]uman-facing [T]echnologies using the example of Learning Technologies, by Julian DehneView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2026-03 Change to browse by: cs cs.HC References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[HC-7] he Impact of AI-Assisted Development on Software Security: A Study of Gemini and Developer Experience
【速读】:该论文试图解决的问题是:在安全关键型软件开发中,由于熟练开发者短缺,组织日益依赖生成式 AI (Generative AI) 工具以提升生产力并降低对有限人类专家的依赖,但目前尚不明确开发者的通用编程经验与安全特定经验,以及所使用的 AI 工具类型(免费版 vs. 付费版)如何共同影响最终代码的安全性。解决方案的关键在于通过一项定量编程实验(n=159),系统评估 Google 的 AI 工具 Gemini 在不同版本下对代码安全性的影响,并发现尽管使用 Gemini 并未显著提升代码安全性,开发者的编程经验仍能显著提高代码安全性,且这种经验无法被 Gemini 完全替代。
链接: https://arxiv.org/abs/2603.15298
作者: Nadine Jost,Benjamin Berens,Manuel Karl,Stefan Albert Horstmann,Martin Johns,Alena Naiakshina
机构: Ruhr University Bochum(鲁尔大学波鸿分校); Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院); TU Braunschweig(不伦瑞克工业大学); University of Cologne(科隆大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:
Abstract:The ongoing shortage of skilled developers, particularly in security-critical software development, has led organizations to increasingly adopt AI-powered development tools to boost productivity and reduce reliance on limited human expertise. These tools, often based on large language models, aim to automate routine tasks and make secure software development more accessible and efficient. However, it remains unclear how developers’ general programming and security-specific experience, and the type of AI tool used (free vs. paid) affect the security of the resulting software. Therefore, we conducted a quantitative programming study with software developers (n=159) exploring the impact of Google’s AI tool Gemini on code security. Participants were assigned a security-related programming task using either no AI tools, the free version, or the paid version of Gemini. While we did not observe significant differences between using Gemini in terms of secure software development, programming experience significantly improved code security and cannot be fully substituted by Gemini.
[HC-8] Practicing with Language Models Cultivates Human Empathic Communication
【速读】:该论文旨在解决人类在情感交流中表达共情能力不足的问题,即人们虽然具备共情感受(empathy),但往往难以有效传达这种情感,导致接收方感知到的共情程度低于实际水平。研究通过构建一个名为“Lend an Ear”的实验对话平台,让参与者与扮演个人或职场困境的大型语言模型(LLM)进行自然对话,并基于33,938条消息提炼出一套数据驱动的共情表达分类体系。其解决方案的关键在于设计了一种简短的、个性化的LLM辅导干预措施——向参与者提供针对其共情表达模式的具体反馈,从而显著提升其沟通方式与规范性共情表达模式的一致性,优于对照组和非个性化视频反馈组。此外,研究还揭示了“沉默共情效应”(silent empathy effect),即个体虽感受到共情却未能充分表达,且能准确识别符合共情标准的回应,为利用AI技术规模化培养共情能力提供了实证支持与可操作路径。
链接: https://arxiv.org/abs/2603.15245
作者: Aakriti Kumar,Nalin Poungpeth,Diyi Yang,Bruce Lambert,Matthew Groh
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Empathy is central to human connection, yet people often struggle to express it effectively. In blinded evaluations, large language models (LLMs) generate responses that are often judged more empathic than human-written ones. Yet when a response is attributed to AI, recipients feel less heard and validated than when comparable responses are attributed to a human. To probe and address this gap in empathic communication skill, we built Lend an Ear, an experimental conversation platform in which participants are asked to offer empathic support to an LLM role-playing personal and workplace troubles. From 33,938 messages spanning 2,904 text-based conversations between 968 participants and their LLM conversational partners, we derive a data-driven taxonomy of idiomatic empathic expressions in naturalistic dialogue. Based on a pre-registered randomized experiment, we present evidence that a brief LLM coaching intervention offering personalized feedback on how to effectively communicate empathy significantly boosts alignment of participants’ communication patterns with normative empathic communication patterns relative to both a control group and a group that received video-based but non-personalized feedback. Moreover, we find evidence for a silent empathy effect that people feel empathy but systematically fail to express it. Nonetheless, participants reliably identify responses aligned with normative empathic communication criteria as more expressive of empathy. Together, these results advance the scientific understanding of how empathy is expressed and valued and demonstrate a scalable, AI-based intervention for scaffolding and cultivating it.
[HC-9] Where Digital Meets Place: Deriving Strategies for Curating Mixed Reality Exhibitions in Public Spaces
【速读】:该论文旨在解决如何在公共空间中有效嵌入混合现实(Mixed Reality, MR)展览并提升观众体验的策展策略问题。其解决方案的关键在于以情境主义(contextualism)为核心理念进行策展设计,通过跨学科专家焦点小组与普通用户研究相结合的方式,识别出MR展览在城市公共空间中的机遇、挑战及设计策略,从而揭示情境主义在构建沉浸式体验中的基础性作用。
链接: https://arxiv.org/abs/2603.15163
作者: Yawei Zhao,Jiaxin Liang,Hao Li,Pan Hui
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州) ); The Hong Kong University of Science and Technology(香港科技大学)
类目: Human-Computer Interaction (cs.HC)
备注: 23 pages, 14 figures, Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)
Abstract:Mixed Reality (MR) technologies are increasingly being used to enrich exhibitions and public spaces by blending digital content with the physical environment in real time. However, little is known about curatorial strategies for embedding MR exhibitions into public spaces or promoting audience experiences. To explore this, we designed and curated a campus-based MR art exhibition, using contextualism as the fundamental concept. We conducted an interdisciplinary expert focus group alongside exhibition viewing to identify opportunities, challenges, and design strategies from multiple perspectives. In parallel, we conducted user studies with general audiences to examine how curatorial strategies foster ex-periential qualities. Our findings reveal insights from both experts and general users along with strategies in curating MR exhibitions and highlight the foundational role of contextualism in curating MR art exhibitions in urban public spaces.
[HC-10] ReactMotion: Generating Reactive Listener Motions from Speaker Utterance
【速读】:该论文旨在解决生成式 AI (Generative AI) 中非语言反应行为建模难题,具体聚焦于从说话者话语中生成自然且恰当的听者身体动作(Reactive Listener Motion Generation),这一任务因人类反应的非确定性而极具挑战。解决方案的关键在于提出 ReactMotionNet 数据集与 ReactMotion 框架:前者通过多候选听者动作及其适当性标注捕捉“一因多果”的听者行为特性,并提供超越单一真值动作的监督信号;后者是一个统一的生成框架,联合建模文本、音频、情感与运动信息,采用偏好导向的目标函数训练,从而生成既合适又多样化的听者反应动作,在多项实验中显著优于检索基线和级联大语言模型(LLM)流水线。
链接: https://arxiv.org/abs/2603.15083
作者: Cheng Luo,Bizhu Wu,Bing Li,Jianfeng Ren,Ruibin Bai,Rong Qu,Linlin Shen,Bernard Ghanem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Sound (cs.SD)
备注: 42 pages, 11 tables, 8 figures
Abstract:In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker’s utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.
[HC-11] Policies frozen in silicon: using WPR to expose the politics of problem-solution configurations in technical artifacts
【速读】:该论文试图解决设计实践中过度依赖“设计即问题解决”这一意识形态所带来的局限性,尤其是其导致的技术方案主义(technological solutionism)倾向,即将复杂的社会与人类挑战简化为可通过技术手段直接解决的“易处理问题”。论文的核心解决方案在于引入来自批判政策研究的“What’s the Problem Represented to be?”(WPR)分析框架,将技术 artifacts 视为被物质化的“问题表征”,从而系统揭示嵌入其中的意识形态、文化与政治假设。通过这一方法,设计研究能够超越单纯的技术应对逻辑,转向对问题本身的建构过程进行反思性审视,进而推动更具批判性和伦理意识的设计实践,并深化设计理论与技术哲学的对话。
链接: https://arxiv.org/abs/2603.15027
作者: Jörgen Behrendtz,Lina Rahm
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Position paper for the CHIdeology workshop at CHI 2026, Barcelona. this https URL
Abstract:Design is often characterized as an act of problem-solving. This is a perspective that, while pervasive, risks reducing complex socio-technical conditions to easily fixable issues. This paper critiques the ideology of “design as problem-solving”, highlighting its culmination in technological solutionism, where societal and human challenges are reframed as technical problems awaiting technical answers. Drawing on critiques and the recognition of “wicked problems”, we argue that design must also be understood as a process of problem-framing, emphasizing the interpretive work involved in defining what counts as a problem and why. To advance this analytical perspective, we propose applying the What’s the Problem Represented to be? (WPR) approach from critical policy studies to design and technology. By treating artifacts as materialized problem representations, WPR allows for the systematic unpacking of the ideological, cultural, and political assumptions encoded in technological forms. This analytical lens can reveal hidden problematisations within artifacts, foster reflexive design practice, and empirically challenge techno-solutionism. Ultimately, integrating WPR into design research enriches both design theory and philosophy of technology by offering a method to interrogate how technologies shape, and are shaped by, the questions they claim to answer.
[HC-12] Customizing ChatGPT for Second Language Speaking Practice: Genuine Support or Just a Marketing Gimmick?
【速读】:该论文旨在解决如何通过定制化提示工程(prompt engineering)提升生成式 AI(Generative AI)在英语作为第二语言(ESL)口语练习中的教学效能问题。研究聚焦于 ChatGPT 的 Voice Mode 功能,对比四种版本(未定制标准模式、未定制高级模式、定制标准模式、定制高级模式)在反馈平衡性、情感支持和文化响应性等方面的差异。解决方案的关键在于运用动机理论(Motivation Theory)、文化响应式教学(Culturally Responsive Teaching, CRT)、交际语言教学法(Communicative Language Teaching, CLT)及情感过滤假说(Affective Filter Hypothesis)等教育理论指导提示设计,从而优化 AI 的交互质量与学习者体验。结果表明,定制化显著提升了反馈的均衡性和情感支持水平,有助于营造积极的学习环境,但文化响应性未达预期,凸显了提示工程与人工智能素养在最大化 AI 教学潜力中的核心作用。
链接: https://arxiv.org/abs/2603.14884
作者: Fanfei Meng
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Short paper accepted at the International Conference of the Learning Sciences (ICLS) 2025, International Society of the Learning Sciences
Abstract:ChatGPT, with its customization features and Voice Mode, has the potential for more engaging and peresonalized ESL (English as a Second Language) education. This study examines the efficacy of customized ChatGPT conversational features in facilitating ESL speaking practices, comparing the performance of four versions of ChatGPT Voice Mode: uncustomized Standard mode, uncustomized Advanced mode, customized Standard mode, and customized Advanced mode. Customization was guided by prompt engineering principles and grounded in relevant theories, including Motivation Theory, Culturally Responsive Teaching (CRT), Communicative Language Teaching (CLT), and the Affective Filter Hypothesis. Content analysis found that customized versions generally provided more balanced feedback and emotional support, contributing to a positive and motivating learning environment. However, cultural responsiveness did not show significant improvement despite targeted customization efforts. These initial findings suggest that customization could enhance ChatGPT’s capacity as a more effective language tutor, with the standard model already capable of meeting the learning needs. The study underscores the importance of prompt engineering and AI literacy in maximizaing AI’s potential in language learning.
[HC-13] Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agent ic Software Development
【速读】:该论文旨在解决企业软件组织中关键制度性知识(如架构决策、部署流程、合规政策、事件应对手册等)因存储于人类可读格式而难以被AI代理或新入职工程师有效利用的问题,从而导致任务执行依赖猜测、纠错链式反应及资深工程师负担过重。其解决方案的核心是提出“知识激活”(Knowledge Activation)框架,将AI技能(AI Skills)标准转化为结构化的、具备治理意识的原子知识单元(Atomic Knowledge Units, AKUs),这些AKUs以行动就绪的规范形式编码操作内容、工具选择、约束条件和下一步路径,使AI代理能直接执行任务,工程师获得基于组织上下文的精准指导,无需重构背景信息;AKUs构成可组合的知识图谱,在运行时由代理遍历,显著压缩入职周期、降低跨团队摩擦并消除纠错链。
链接: https://arxiv.org/abs/2603.14805
作者: Gal Bakal
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Preprint. 10 sections, 11 figures. Submitted March 2026
Abstract:Enterprise software organizations accumulate critical institutional knowledge - architectural decisions, deployment procedures, compliance policies, incident playbooks - yet this knowledge remains trapped in formats designed for human interpretation. The bottleneck to effective agentic software development is not model capability but knowledge architecture. When any knowledge consumer - an autonomous AI agent, a newly onboarded engineer, or a senior developer - encounters an enterprise task without institutional context, the result is guesswork, correction cascades, and a disproportionate tax on senior engineers who must manually supply what others cannot infer. This paper introduces Knowledge Activation, a framework that specializes AI Skills - the open standard for agent-consumable knowledge - into structured, governance-aware Atomic Knowledge Units (AKUs) for institutional knowledge delivery. Rather than retrieving documents for interpretation, AKUs deliver action - ready specifications encoding what to do, which tools to use, what constraints to respect, and where to go next - so that agents act correctly and engineers receive institutionally grounded guidance without reconstructing organizational context from scratch. AKUs form a composable knowledge graph that agents traverse at runtime - compressing onboarding, reducing cross - team friction, and eliminating correction cascades. The paper formalizes the resource constraints that make this architecture necessary, specifies the AKU schema and deployment architecture, and grounds long - term maintenance in knowledge commons practice. Organizations that architect their institutional knowledge for the agentic era will outperform those that invest solely in model capability. Comments: Preprint. 10 sections, 11 figures. Submitted March 2026 Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE) MSC classes: 68T05, 68N01, 91B42 ACMclasses: D.2.6; H.3.3; K.6.1; I.2.1 Cite as: arXiv:2603.14805 [cs.AI] (or arXiv:2603.14805v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.14805 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-14] ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos
【速读】:该论文试图解决盲人及低视力(Blind and Low Vision, BLV)用户在获取视频内容时面临的可访问性问题,即传统人工编写的音频描述(Audio Description, AD)成本高、难以规模化,而现有基于人工智能(AI)的自动AD系统缺乏个性化适配能力且多在受控环境下评估,无法满足BLV用户在真实场景中的多样化需求。解决方案的关键在于提出ViDscribe平台,该平台集成生成式AI驱动的AD与六类用户自定义选项,并引入对话式视频问答(Conversational Video Question Answering, VQA)接口,使BLV用户能够根据自身偏好动态调整音频描述内容并实时交互,从而提升视频内容的可及性、有效性与沉浸感。
链接: https://arxiv.org/abs/2603.14662
作者: Maryam Cheema,Sina Elahimanesh,Pooyan Fazli,Hasti Seifi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: CHI EA 2026
Abstract:Advances in multimodal large language models enable automatic video narration and question answering (VQA), offering scalable alternatives to labor-intensive, human-authored audio descriptions (ADs) for blind and low vision (BLV) viewers. However, prior AI-driven AD systems rarely adapt to the diverse needs and preferences of BLV individuals across videos and are typically evaluated in controlled, single-session settings. We present ViDscribe, a web-based platform that integrates AI-generated ADs with six types of user customizations and a conversational VQA interface for YouTube videos. Through a longitudinal, in-the-wild study with eight BLV participants, we examine how users engage with customization and VQA features over time. Our results show sustained engagement with both features and that customized ADs improve effectiveness, enjoyment, and immersion compared to default ADs, highlighting the value of personalized, interactive video access for BLV users.
[HC-15] he Scenic Route to Deception: Dark Patterns and Explainability Pitfalls in Conversational Navigation
【速读】:该论文旨在解决生成式 AI(Generative AI)在行人导航场景中引入的新型风险问题,即传统可验证的几何路径规划任务正逐渐被不可解释、具说服力的对话交互所取代,从而引发操纵风险和用户信任错位。解决方案的关键在于提出一种“有缝设计”(seamful design)策略,并主张通过神经符号架构(neuro-symbolic architecture)来实现可信的对话式导航:该架构将可验证的路径规划算法作为基础,赋予生成式 AI 的说服能力以明确的边界与透明性,确保系统不仅能清晰说明导航路线,也能如实揭示其自身局限性和激励机制。
链接: https://arxiv.org/abs/2603.14586
作者: Ilya Ilyankou,Stefano Cavazzi,James Haworth
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:As pedestrian navigation increasingly experiments with Generative AI, and in particular Large Language Models, the nature of routing risks transforming from a verifiable geometric task into an opaque, persuasive dialogue. While conversational interfaces promise personalisation, they introduce risks of manipulation and misplaced trust. We categorise these risks using a 2x2 framework based on intent and origin, distinguishing between intentional manipulations (dark patterns) and unintended harms (explainability pitfalls). We propose seamful design strategies to mitigate these harms. We suggest that one robust way to operationalise trustworthy conversational navigation is through neuro-symbolic architecture, where verifiable pathfinding algorithms ground GenAI’s persuasive capabilities, ensuring systems explain their limitations and incentives as clearly as they explain the route.
计算机视觉
[CV-0] owards Generalizable Robotic Manipulation in Dynamic Environments
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在动态环境中的操作性能不足问题,其核心挑战在于现有数据集缺乏动态操作场景的覆盖,且主流VLA模型依赖单帧观测,限制了其时空推理能力。解决方案的关键在于提出DOMINO——一个大规模、多任务、多层次的动态操作数据集与基准测试平台,并设计PUMA架构:该架构通过引入场景中心的历史光流信息和专用世界查询机制,隐式预测物体中心的未来状态,从而实现历史感知与短时预测的耦合,显著提升模型对动态环境的适应性与泛化能力。实验表明,PUMA在动态任务上相比基线模型成功率达6.3%绝对提升,且动态训练获得的时空表征可迁移至静态任务,增强整体鲁棒性。
链接: https://arxiv.org/abs/2603.15620
作者: Heng Fang,Shangru Li,Shuhan Wang,Xuanyang Xi,Dingkang Liang,Xiang Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at this https URL.
[CV-1] Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中,因缺乏对视觉信息如何被有效融入动作生成过程的理解而导致的动作预测准确性不足的问题。现有方法多将大语言模型(Large Language Model, LLM)视为黑箱,难以实现视觉信息与语言指令的精准对齐。解决方案的关键在于提出DeepVision-VLA框架,其核心创新包括:一是基于视觉-语言混合Transformer(Vision-Language Mixture-of-Transformers, VL-MoT)架构,通过共享注意力机制将视觉专家模型的多层次特征注入VLA主干网络的深层,增强视觉表征能力;二是引入动作引导的视觉剪枝(Action-Guided Visual Pruning, AGVP)策略,利用浅层注意力机制剔除无关视觉token、保留任务相关线索,在不显著增加计算开销的前提下强化关键视觉提示,从而提升复杂操作任务中的精度与鲁棒性。
链接: https://arxiv.org/abs/2603.15618
作者: Yulin Luo,Hao Chen,Zhuangzhe Wu,Bowen Sui,Jiaming Liu,Chenyang Gu,Zhuoyang Liu,Qiuxuan Feng,Jiale Yu,Shuo Gu,Peng Jia,Pheng-Ann Heng,Shanghang Zhang
机构: Peking University (北京大学); The Chinese University of Hong Kong (香港中文大学); Simplexity Robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbfDeepVision-VLA, built on a \textbfVision-Language Mixture-of-Transformers (VL-MoT) framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbfAction-Guided Visual Pruning (AGVP), which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0% and 7.5% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.
[CV-2] GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering CVPR2026
【速读】:该论文旨在解决视觉文本渲染中字形(glyph)准确性不足的问题,尤其针对现有方法因训练数据覆盖有限或过度风格化导致复杂或域外字符字形错误频发的缺陷。其解决方案的关键在于提出 GlyphPrinter,一种基于偏好优化的文本渲染方法,摒弃了对显式奖励模型的依赖;进一步构建了带有区域级字形偏好标注的 GlyphCorrector 数据集,并设计 Region-Grouped DPO (R-GDPO) 目标函数,通过优化标注区域内样本间的跨样本与内样本偏好关系,显著提升局部字形准确性;同时引入区域奖励引导(Regional Reward Guidance)推理策略,实现可控精度的生成采样,从而在保持良好风格化的同时大幅提升字形准确率。
链接: https://arxiv.org/abs/2603.15616
作者: Xincheng Shuai,Ziye Li,Henghui Ding,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project Page: this https URL
Abstract:Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.
[CV-3] ri-Prompting: Video Diffusion with Unified Control over Scene Subject and Motion
【速读】:该论文旨在解决当前视频扩散模型在精细控制方面的瓶颈问题,尤其是场景构图、多视角主体一致性以及相机位姿或物体运动调整等关键维度难以协同实现的问题。现有方法通常孤立处理这些控制维度,缺乏对多视角下主体身份保持和3D一致性的支持,导致生成视频的可控性与真实性难以兼顾。解决方案的关键在于提出一种名为Tri-Prompting的统一框架及两阶段训练范式,其核心创新包括:(1) 引入双条件运动模块,分别利用3D跟踪点驱动背景场景运动、低分辨率RGB线索驱动前景主体运动;(2) 设计推理阶段的ControlNet尺度调度策略,以平衡控制精度与视觉真实感。该方法显著提升了多视角主体身份一致性、3D一致性及运动准确性,支持如3D感知主体插入任意场景等新型工作流。
链接: https://arxiv.org/abs/2603.15614
作者: Zhenghong Zhou,Xiaohang Zhan,Zhiqin Chen,Soo Ye Kim,Nanxuan Zhao,Haitian Zheng,Qing Liu,He Zhang,Zhe Lin,Yuqian Zhou,Jiebo Luo
机构: Adobe(Adobe); University of California, Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.
[CV-4] HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions
【速读】:该论文旨在解决现有三维人体-场景交互(Human-Scene Interaction, HSI)重建方法中存在的感知-仿真鸿沟问题:即视觉上逼真的重建结果往往违反物理约束,导致在物理引擎中不稳定,无法用于具身人工智能(Embodied AI)应用。解决方案的关键在于提出了一种基于物理的双向优化流程——HSImul3R,其中将物理模拟器作为主动监督者,联合优化人体动态与场景几何。正向路径采用面向场景的强化学习(Scene-targeted Reinforcement Learning),在运动保真度和接触稳定性双重监督下优化人体运动;反向路径则引入直接仿真奖励优化(Direct Simulation Reward Optimization),利用仿真反馈中的重力稳定性和交互成功率来优化场景几何。这一机制首次实现了可直接部署于真实人形机器人中的稳定、仿真就绪的HSI重建。
链接: https://arxiv.org/abs/2603.15612
作者: Yukang Cao,Haozhe Xie,Fangzhou Hong,Long Zhuo,Zhaoxi Chen,Liang Pan,Ziwei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: this https URL
Abstract:We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.
[CV-5] Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery
【速读】:该论文旨在解决SAM 3D Body (3DB) 在单目三维人体网格恢复任务中推理延迟较高(每张图像数秒)的问题,从而限制了其在实时场景中的应用。解决方案的关键在于提出了一种无需训练的加速框架Fast SAM 3D Body,通过解耦串行空间依赖关系并引入架构感知剪枝策略,实现了多裁剪区域特征提取的并行化和Transformer解码流程的优化;同时,为兼容现有拟人控制与策略学习框架,将迭代式网格拟合替换为直接前馈映射,使关节级运动学(SMPL)提取速度提升超过10,000倍。整体上,该框架在保持与3DB相当重建精度的前提下,实现了高达10.9倍的端到端加速效果。
链接: https://arxiv.org/abs/2603.15603
作者: Timing Yang,Sicheng He,Hongyi Jing,Jiawei Yang,Zhijian Liu,Chuhang Zou,Yue Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that-unlike methods reliant on wearable IMUs-enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.
[CV-6] AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer ICLR2026
【速读】:该论文旨在解决现有视频到音频(Video-to-Audio, V2A)生成方法中两个关键瓶颈问题:一是训练数据中存在的语义粒度差距,例如将声学上 distinct 的声音归类于粗粒度标签;二是文本描述在表达微音频特征时的模糊性,导致难以实现细粒度的声音合成控制。解决方案的关键在于提出 AC-Foley 模型,该模型采用音频条件(Audio-Conditioned)机制,直接利用参考音频作为控制信号,从而绕过文本描述的语义歧义,实现对声学属性的精确操控,显著提升音色迁移、零样本声音生成及音频质量等能力。
链接: https://arxiv.org/abs/2603.15597
作者: Pengjun Fang,Yingqing He,Yazhou Xing,Qifeng Chen,Ser-Nam Lim,Harry Yang
机构: The Hong Kong University of Science and Technology(香港科技大学); University of Central Florida(中佛罗里达大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted at ICLR 2026. 15 pages, 5 figures
Abstract:Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning.
[CV-7] Grounding World Simulation Models in a Real-World Metropolis
【速读】:该论文旨在解决现有生成式世界模型(Generative World Models)无法真实还原现实城市环境的问题,即传统方法通常基于虚构内容合成视觉上合理但非真实的场景,缺乏对实际城市空间的精确建模与动态一致性。其核心挑战包括:检索参考图像与目标场景的时间错位、相机轨迹多样性不足、以及车载采集数据稀疏导致的训练样本匮乏。解决方案的关键在于三个创新:一是通过跨时间配对(cross-temporal pairing)缓解时序不一致问题;二是构建大规模合成数据集以增强相机轨迹多样性;三是设计视图插值流水线,从稀疏街景图像中合成连贯训练视频。此外,引入虚拟前瞻Sink(Virtual Lookahead Sink)机制,在长时程生成中持续将每一段输出重新锚定至未来位置的检索图像,从而显著提升生成视频的空间真实性与长期时序稳定性。
链接: https://arxiv.org/abs/2603.15583
作者: Junyoung Seo,Hyunwook Choi,Minkyung Kwon,Jinhyeok Choi,Siyoon Jin,Gayoung Lee,Junho Kim,JoungBin Lee,Geonmo Gu,Dongyoon Han,Sangdoo Yun,Seungryong Kim,Jin-Hwa Kim
机构: 1: Korea University (韩国大学); 2: Samsung Research (三星研究院); 3: Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.
[CV-8] Severe Domain Shift in Skeleton-Based Action Recognition:A Study of Uncertainty Failure in Real-World Gym Environments
【速读】:该论文旨在解决骨架驱动的动作识别模型在从受控的多视角3D骨骼捕捉场景迁移到非约束的单目2D姿态估计场景时所面临的严重领域偏移(domain shift)问题,其核心挑战在于现有方法对分布外(Out-Of-Distribution, OOD)不确定性的建模不足以保障实际部署的安全性。解决方案的关键在于:首先通过构建Gym2D(风格/视角偏移)和UCF101(语义偏移)两个新数据集系统性地量化了这种复合域偏移的影响;其次发现传统不确定性估计方法(如能量评分和马氏距离)虽能获得高AUROC指标,却无法有效降低模型在分布外数据上的错误置信度(99.6%风险仍存在);最终提出一种轻量级微调门控机制(gating mechanism),恢复模型校准能力并实现优雅的拒绝决策(graceful abstention),显著减少自信错误预测的发生率,从而为骨架识别模型在真实世界环境中的安全部署提供了可验证的分析框架与实践路径。
链接: https://arxiv.org/abs/2603.15574
作者: Aaditya Khanal,Junxiu Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 7 figures
Abstract:The practical deployment gap – transitioning from controlled multi-view 3D skeleton capture to unconstrained monocular 2D pose estimation – introduces a compound domain shift whose safety implications remain critically underexplored. We present a systematic study of this severe domain shift using a novel Gym2D dataset (style/viewpoint shift) and the UCF101 dataset (semantic shift). Our Skeleton Transformer achieves 63.2% cross-subject accuracy on NTU-120 but drops to 1.6% under zero-shot transfer to the Gym domain and 1.16% on UCF101. Critically, we demonstrate that high Out-Of-Distribution (OOD) detection AUROC does not guarantee safe selective classification. Standard uncertainty methods fail to detect this performance drop: the model remains confidently incorrect with 99.6% risk even at 50% coverage across both OOD datasets. While energy-based scoring (AUROC = 0.91) and Mahalanobis distance provide reliable distributional detection signals, such high AUROC scores coexist with poor risk-coverage behavior when making decisions. A lightweight finetuned gating mechanism restores calibration and enables graceful abstention, substantially reducing the rate of confident wrong predictions. Our work challenges standard deployment assumptions, providing a principled safety analysis of both semantic and geometric skeleton recognition deployment. Comments: 6 pages, 7 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.15574 [cs.CV] (or arXiv:2603.15574v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.15574 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aaditya Khanal [view email] [v1] Mon, 16 Mar 2026 17:37:17 UTC (3,166 KB) Full-text links: Access Paper: View a PDF of the paper titled Severe Domain Shift in Skeleton-Based Action Recognition:A Study of Uncertainty Failure in Real-World Gym Environments, by Aaditya Khanal and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-9] Panoramic Affordance Prediction
【速读】:该论文旨在解决传统具身智能中 affordance prediction(可操作性预测)任务受限于针孔相机模型(pinhole camera models)所导致的视野狭窄(Field of View, FoV)和场景感知碎片化问题,从而难以捕捉全局空间关系与完整环境上下文。为应对这一挑战,作者提出全景可操作性预测(Panoramic Affordance Prediction),并构建了首个大规模基准数据集 PAP-12K(含超高清 360° 图像及 12k 精细标注的 QA 对与可操作性掩码)。其核心解决方案是 PAP 框架——一种无需训练、受人类中央凹视觉系统启发的粗粒度到细粒度处理流程:通过网格提示(grid prompting)实现递归视觉路由以逐步定位目标,引入自适应凝视机制(adaptive gaze mechanism)校正局部几何失真,并采用级联定位管道(cascaded grounding pipeline)提取精确实例级掩码,显著优于现有基于标准透视图像的方法,在全景感知下展现出更强的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2603.15558
作者: Zixin Zhang,Chenfei Liao,Hongfei Zhang,Harold Haodong Chen,Kanghao Chen,Zichen Wen,Litao Guo,Bin Ren,Xu Zheng,Yinchuan Li,Xuming Hu,Nicu Sebe,Ying-Cong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.
[CV-10] Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)频繁产生“幻觉”(hallucination)的问题,即生成看似合理但事实错误的陈述,这对模型的可信部署构成重大挑战。其解决方案的核心在于将幻觉从静态输出错误重新定义为模型计算认知过程中的动态病理状态,基于计算理性规范原则,将VLM的生成过程建模为一个动态的认知轨迹。关键创新是提出几何-信息对偶性(geometric-information duality)原理:认知轨迹在低维可解释的认知状态空间(Cognitive State Space)中的几何异常等价于其高信息论意外度(surprisal)。由此,幻觉检测转化为几何异常检测问题,并通过信息论探针实现高效、鲁棒的诊断,同时支持对失败进行因果归因,区分感知不稳定性、逻辑因果失效和决策模糊三种病理状态。
链接: https://arxiv.org/abs/2603.15557
作者: Lexiang Xiong,Qi Li,Jingwen Ye,Xinchao Wang
机构: National University of Singapore(新加坡国立大学); Monash University(莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) frequently “hallucinate” - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model’s computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM’s generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory’s geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.
[CV-11] Learning Latent Proxies for Controllable Single-Image Relighting CVPR2026
【速读】:该论文旨在解决单图重光照(Single-image relighting)任务中因光照变化导致的阴影、高光和明暗关系剧烈非线性变化问题,而几何与材质信息又不可观测所带来的高度欠约束难题。现有基于扩散模型的方法要么依赖密集且脆弱的内在图像或G-buffer监督,要么完全在潜在空间中操作缺乏物理依据,难以实现对光照方向、强度和颜色的精细控制。其解决方案的关键在于:提出LightCtrl框架,通过引入两个层次的物理先验来提升可控性和真实性——一是采用少样本潜在代理编码器(few-shot latent proxy encoder),从有限的基于物理渲染(PBR)数据中提取紧凑的材质-几何线索;二是设计光照感知掩码(lighting-aware mask),识别易受光照影响的区域并引导去噪器聚焦于与阴影相关的像素。此外,利用DPO优化目标增强预测线索的物理一致性,并构建大规模对象级数据集ScaLight以支持物理一致且可控的训练。该方法在物体和场景级别基准测试中均实现了更精确的连续控制和保真度更高的重光照效果,显著优于先前扩散模型及基于内在分解的基线方法。
链接: https://arxiv.org/abs/2603.15555
作者: Haoze Zheng,Zihao Wang,Xianfeng Wu,Yajing Bai,Yexin Liu,Yun Li,Xiaogang Xu,Harry Yang
机构: HKUST (香港科技大学); HKPolyU (香港理工大学); CUHK (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic or G-buffer pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that a full intrinsic decomposition is unnecessary and redundant for accurate relighting. Instead, sparse but physically meaningful cues, indicating where illumination should change and how materials should respond, are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates physical priors at two levels: a few-shot latent proxy encoder that extracts compact material-geometry cues from limited PBR supervision, and a lighting-aware mask that identifies sensitive illumination regions and steers the denoiser toward shading relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that enforces physical consistency in the predicted cues. We also present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera-light metadata, enabling physically consistent and controllable training. Across object and scene level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.
[CV-12] Self-Distillation of Hidden Layers for Self-Supervised Representation Learning
【速读】:该论文旨在解决自监督学习(Self-Supervised Learning, SSL)中生成式方法与预测式方法之间的局限性问题:生成式方法(如MAE)虽能提供强数据重建能力,但计算效率低且难以捕捉高层语义特征;而预测式方法(如I-JEPA)虽聚焦于高层抽象表示,却因依赖最终层自蒸馏目标的非平稳性而导致训练不稳定。其解决方案的关键在于提出Bootleg方法,通过让模型预测教师网络多个隐藏层的潜在表示(latent representations),构建分层预测目标,从而迫使模型同时学习不同抽象层级的特征表示,实现高效且稳定的表征学习。
链接: https://arxiv.org/abs/2603.15553
作者: Scott C. Lowe,Anthony Fuller,Sageev Oore,Evan Shelhamer,Graham W. Taylor
机构: Vector Institute; Carleton University; Dalhousie University; University of British Columbia; University of Guelph
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.
[CV-13] Kimodo: Scaling Controllable Human Motion Generation
【速读】:该论文旨在解决当前生成式AI在人体运动合成中面临的质量不高、控制精度不足以及泛化能力有限的问题,这些问题主要源于公共动作捕捉(motion capture, mocap)数据集规模较小。解决方案的关键在于提出Kimodo模型——一个基于700小时光学动捕数据训练的可表达且可控的运动扩散模型,其核心创新包括:精心设计的动作表示方式与两阶段去噪器架构,该架构通过分离根部与身体运动的预测来减少运动伪影,并支持多种类型的运动约束条件(如全身关键帧、稀疏关节位置/旋转、2D路径点等),从而实现高质量、高可控性的运动生成。
链接: https://arxiv.org/abs/2603.15546
作者: Davis Rempe,Mathis Petrovich,Ye Yuan,Haotian Zhang,Xue Bin Peng,Yifeng Jiang,Tingwu Wang,Umar Iqbal,David Minor,Michael de Ruyter,Jiefeng Li,Chen Tessler,Edy Lim,Eugene Jeong,Sam Wu,Ehsan Hassani,Michael Huang,Jin-Bey Yu,Chaeyeon Chung,Lina Song,Olivier Dionne,Jan Kautz,Simon Yuen,Sanja Fidler
机构: NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.
[CV-14] FreeTalk: Emotional Topology-Free 3D Talking Heads
【速读】:该论文旨在解决两个核心问题:一是现有语音驱动的3D人脸动画方法通常依赖于注册的模板网格(registered template meshes),难以直接应用于原始3D扫描数据(具有任意拓扑结构);二是如何在不依赖模板参数化的情况下,建模超出唇部运动之外的可控情感动态。解决方案的关键在于提出一个两阶段框架FreeTalk:第一阶段Audio-To-Sparse(ATS)从语音音频中预测时序一致的3D关键点位移序列,该序列由情绪类别和强度条件控制,从而捕捉发音与情感双重运动特征,且与网格拓扑无关;第二阶段Sparse-To-Mesh(STM)通过结合表面固有特征与关键点到顶点的条件约束,将稀疏关键点运动映射至目标网格,生成稠密顶点级形变,无需测试时进行模板拟合或对应监督。这一设计使模型具备对未见身份和不同网格拓扑的强泛化能力。
链接: https://arxiv.org/abs/2603.15512
作者: Federico Nocentini,Thomas Besnier,Claudio Ferrari,Stefano Berretti,Mohamed Daoudi
机构: University of Florence (佛罗伦萨大学); University of Copenhagen (哥本哈根大学); IMT Nord Europe, Institut Mines-Télécom, Centre for Digital Systems (IMT Nord Europe, 法国矿业电信学院数字系统中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Speech-driven 3D facial animation has advanced rapidly, yet most approaches remain tied to registered template meshes, preventing effective deployment on raw 3D scans with arbitrary topology. At the same time, modeling controllable emotional dynamics beyond lip articulation remains challenging, and is often tied to template-based parameterizations. We address these challenges by proposing FreeTalk, a two-stage framework for emotion-conditioned 3D talking-head animation that generalizes to unregistered face meshes with arbitrary vertex count and connectivity. First, Audio-To-Sparse (ATS) predicts a temporally coherent sequence of 3D landmark displacements from speech audio, conditioned on an emotion category and intensity. This sparse representation captures both articulatory and affective motion while remaining independent of mesh topology. Second, Sparse-To-Mesh (STM) transfers the predicted landmark motion to a target mesh by combining intrinsic surface features with landmark-to-vertex conditioning, producing dense per-vertex deformations without template fitting or correspondence supervision at test time. Extensive experiments show that FreeTalk matches specialized baselines when trained in-domain, while providing substantially improved robustness to unseen identities and mesh topologies. Code and pre-trained models will be made publicly available.
[CV-15] Federated Learning of Binary Neural Networks: Enabling Low-Cost Inference
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在边缘设备上部署深度神经网络(Deep Neural Networks, DNNs)时面临的资源受限问题,即如何在保证模型精度的同时显著降低内存占用和计算复杂度。传统方法如后训练二值化虽能压缩模型尺寸,但因量化误差导致性能大幅下降。其解决方案的关键在于提出FedBNN框架,该框架在本地训练阶段直接学习旋转感知的二值权重表示(即每个权重编码为±1而非32位浮点数),从而在不牺牲模型性能的前提下,大幅减少推理时的浮点运算次数(FLOPs)与内存需求,实现高效且隐私保护的边缘智能部署。
链接: https://arxiv.org/abs/2603.15507
作者: Nitin Priyadarshini Shankar,Soham Lahiri,Sheetal Kalyani,Saurav Prakash
机构: Indian Institute of Technology Madras (印度理工学院马德拉斯分校); Jadavpur University (贾达普大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 13 figures
Abstract:Federated Learning (FL) preserves privacy by distributing training across devices. However, using DNNs is computationally intensive at the low-powered edge during inference. Edge deployment demands models that simultaneously optimize memory footprint and computational efficiency, a dilemma where conventional DNNs fail by exceeding resource limits. Traditional post-training binarization reduces model size but suffers from severe accuracy loss due to quantization errors. To address these challenges, we propose FedBNN, a rotation-aware binary neural network framework that learns binary representations directly during local training. By encoding each weight as a single bit +1, -1\ instead of a 32 -bit float, FedBNN shrinks the model footprint, significantly reducing runtime (during inference) FLOPs and memory requirements in comparison to federated methods using real models. Evaluations across multiple benchmark datasets demonstrate that FedBNN significantly reduces resource consumption while performing similarly to existing federated methods using real-valued models.
[CV-16] Real-Time Oriented Object Detection Transformer in Remote Sensing Images
【速读】:该论文旨在解决实时检测变压器在遥感图像中对旋转目标检测时存在的角度表示不明确、匹配代价模糊及训练不稳定等问题。其关键解决方案包括:(1)提出角度分布精炼机制,将角度回归转化为概率分布的迭代 refine,以捕捉旋转不确定性并实现更细粒度的角度表征;(2)引入 Chamfer 距离作为二分图匹配中的代价函数,通过顶点集度量框间几何距离,提升匹配精度并消除歧义匹配;(3)设计面向方向的对比去噪策略,增强训练稳定性,并分析四种噪声模式,揭示真值标签在不同解码层中可能被分配至不同查询索引的现象,进而提出不稳定性度量进行量化分析与改进。
链接: https://arxiv.org/abs/2603.15497
作者: Zeyu Ding,Yong Zhou,Jiaqi Zhao,Wen-Liang Du,Xixi Li,Rui Yao,Abdulmotaleb El Saddik
机构: China University of Mining and Technology (中国矿业大学); Mine Digitization Engineering Research Center of the Ministry of Education (教育部矿山数字化工程研究中心); Jiangsu Provincial Industrial Technology Engineering Center for Intelligent Sensing and Emergency IoT in Underground Space (江苏省地下空间智能感知与应急物联网产业技术工程中心); University of Ottawa (渥太华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Transactions on Geoscience and Remote Sensing, 2026, doi https://doi.org/10.1109/TGRS.2026.3671683
Abstract:Recent real-time detection transformers have gained popularity due to their simplicity and efficiency. However, these detectors do not explicitly model object rotation, especially in remote sensing imagery where objects appear at arbitrary angles, leading to challenges in angle representation, matching cost, and training stability. In this paper, we propose a real-time oriented object detection transformer, the first real-time end-to-end oriented object detector to the best of our knowledge, that addresses the above issues. Specifically, angle distribution refinement is proposed to reformulate angle regression as an iterative refinement of probability distributions, thereby capturing the uncertainty of object rotation and providing a more fine-grained angle representation. Then, we incorporate a Chamfer distance cost into bipartite matching, measuring box distance via vertex sets, enabling more accurate geometric alignment and eliminating ambiguous matches. Moreover, we propose oriented contrastive denoising to stabilize training and analyze four noise modes. We observe that a ground truth can be assigned to different index queries across different decoder layers, and analyze this issue using the proposed instability metric. We design a series of model variants and experiments to validate the proposed method. Notably, our O2-DFINE-L, O2-RTDETR-R50 and O2-DEIM-R50 achieve 77.73%/78.45%/80.15% AP50 on DOTA1.0 and 132/119/119 FPS on the 2080ti GPU. Code is available at this https URL.
[CV-17] RSGen: Enhancing Layout-Driven Remote Sensing Image Generation with Diverse Edge Guidance
【速读】:该论文旨在解决当前基于扩散模型的遥感图像生成方法在布局驱动图像合成(Layout-to-Image, L2I)中面临的细粒度控制不足以及无法严格遵守边界框约束的问题。其解决方案的关键在于提出一个即插即用的框架RSGen,通过分阶段增强策略实现:首先利用图像到图像生成技术从训练样本中复合并丰富边缘图多样性;随后将这些多样化的边缘图作为条件输入至现有L2I模型,以在边界框内实现像素级控制,从而确保生成实例严格遵循布局要求。
链接: https://arxiv.org/abs/2603.15484
作者: Xianbao Hou,Yonghao He,Zeyd Boukhers,John See,Hu Su,Wei Sui,Cong Yang
机构: D-Robotics-AI-Lab (D-机器人人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have significantly mitigated the impact of annotated data scarcity in remote sensing (RS). Although recent approaches have successfully harnessed these models to enable diverse and controllable Layout-to-Image (L2I) synthesis, they still suffer from limited fine-grained control and fail to strictly adhere to bounding box constraints. To address these limitations, we propose RSGen, a plug-and-play framework that leverages diverse edge guidance to enhance layout-driven RS image generation. Specifically, RSGen employs a progressive enhancement strategy: 1) it first enriches the diversity of edge maps composited from retrieved training instances via Image-to-Image generation; and 2) subsequently utilizes these diverse edge maps as conditioning for existing L2I models to enforce pixel-level control within bounding boxes, ensuring the generated instances strictly adhere to the layout. Extensive experiments across three baseline models demonstrate that RSGen significantly boosts the capabilities of existing L2I models. For instance, with CC-Diff on the DOTA dataset for oriented object detection, we achieve remarkable gains of +9.8/+12.0 in YOLOScore mAP50/mAP50-95 and +1.6 in mAP on the downstream detection task. Our code will be publicly available: this https URL
[CV-18] ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer
【速读】:该论文旨在解决视频可控生成与编辑任务中因缺乏成对视频数据及训练视频扩散模型计算成本高昂而导致的进展受限问题。其解决方案的关键在于提出一种无需视频训练数据的微调框架ViFeEdit,通过架构重参数化将空间独立性从3D注意力机制中解耦,从而在仅使用2D图像进行少量训练的情况下,实现视觉保真度高且时序一致的视频编辑;该设计采用双路径流水线结构并分别引入时间步嵌入以优化噪声调度,展现出对多种条件信号的强大适应能力。
链接: https://arxiv.org/abs/2603.15478
作者: Ruonan Yu,Zhenxiong Tan,Zigeng Chen,Songhua Liu,Xinchao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Working in progress, code is at this https URL
Abstract:Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available this https URL.
[CV-19] Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation CVPR2026
【速读】:该论文旨在解决跨域全景语义分割(cross-domain panoramic semantic segmentation)中的两大挑战:一是不同视角下视场角(Field of View, FoV)引起的严重几何畸变,二是跨域间开放集语义不一致导致的语义不确定性问题。其核心解决方案为提出一种名为“外推式域自适应全景分割”(Extrapolative Domain Adaptive Panoramic Segmentation, EDA-PSeg)的新框架,关键创新在于两个模块:一是Euler-Margin Attention(EMA),通过引入角度边界(angular margin)增强视角不变的语义表示,并结合幅度与相位调制提升对未见类别的泛化能力;二是Graph Matching Adapter(GMA),构建高阶图关系以对齐因FoV变化而偏移的共享语义,同时通过结构自适应机制有效分离新类别。实验表明,该方法在多种相机位移、天气条件和开放集场景下均实现最优性能,具备强鲁棒性和跨几何视图泛化能力。
链接: https://arxiv.org/abs/2603.15475
作者: Yuanfan Zheng,Kunyu Peng,Xu Zheng,Kailun Yang
机构: Hunan University (湖南大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); INSAIT, Sofia University “St. Kliment Ohridski” (INSAIT,索非亚大学“圣克莱门特·奥赫里德斯基”); HKUST(GZ) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Accepted to CVPR 2026. The code is available at this https URL
Abstract:Cross-domain panoramic semantic segmentation has attracted growing interest as it enables comprehensive 360° scene understanding for real-world applications. However, it remains particularly challenging due to severe geometric Field of View (FoV) distortions and inconsistent open-set semantics across domains. In this work, we formulate an open-set domain adaptation setting, and propose Extrapolative Domain Adaptive Panoramic Segmentation (EDA-PSeg) framework that trains on local perspective views and tests on full 360° panoramic images, explicitly tackling both geometric FoV shifts across domains and semantic uncertainty arising from previously unseen classes. To this end, we propose the Euler-Margin Attention (EMA), which introduces an angular margin to enhance viewpoint-invariant semantic representation, while performing amplitude and phase modulation to improve generalization toward unseen classes. Additionally, we design the Graph Matching Adapter (GMA), which builds high-order graph relations to align shared semantics across FoV shifts while effectively separating novel categories through structural adaptation. Extensive experiments on four benchmark datasets under camera-shift, weather-condition, and open-set scenarios demonstrate that EDA-PSeg achieves state-of-the-art performance, robust generalization to diverse viewing geometries, and resilience under varying environmental conditions. The code is available at this https URL.
[CV-20] Anchor then Polish for Low-light Enhancement
【速读】:该论文旨在解决低光照图像增强中因退化因素交织(包括照明不足、色彩偏移和纹理干扰)而导致的挑战,现有方法常采用复杂架构联合处理这些问题,但易过拟合简单物理约束,引发全局失真。其解决方案的关键在于提出一种“锚定-精修”(anchor-then-polish, ATP)框架,从根本上将全局能量对齐与局部细节优化解耦:首先通过宏观锚定(macro anchoring)学习仅具12自由度的场景自适应投影矩阵,以线性操作稳定亮度分布并校正颜色;随后在微观层面进行精修,于小波域和色度空间内基于矩阵引导细化细节,并设计约束亮度更新策略确保全局一致性,使网络聚焦于精细化处理。
链接: https://arxiv.org/abs/2603.15472
作者: Tianle Du,Mingjia Li,Hainuo Wang,Xiaojie Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-light image enhancement is challenging due to entangled degradations, mainly including poor illumination, color shifts, and texture interference. Existing methods often rely on complex architectures to address these issues jointly but may overfit simple physical constraints, leading to global distortions. This work proposes a novel anchor-then-polish (ATP) framework to fundamentally decouple global energy alignment from local detail refinement. First, macro anchoring is customized to (greatly) stabilize luminance distribution and correct color by learning a scene-adaptive projection matrix with merely 12 degrees of freedom, revealing that a simple linear operator can effectively align global energy. The macro anchoring then reduces the task to micro polishing, which further refines details in the wavelet domain and chrominance space under matrix guidance. A constrained luminance update strategy is designed to ensure global consistency while directing the network to concentrate on fine-grained polishing. Extensive experiments on multiple benchmarks show that our method achieves state-of-the-art performance, producing visually natural and quantitatively superior low-light enhancements.
[CV-21] Automated Counting of Stacked Objects in Industrial Inspection ICCV25
【速读】:该论文旨在解决工业检测中对堆叠三维(3D)制造零件进行准确、高效视觉计数的问题,尤其针对因严重遮挡导致仅少数物体可见的场景。现有方法在处理容器、托盘或料箱中的不规则堆叠物体时表现不佳,难以实现可靠计数。其解决方案的关键在于将计数任务分解为两个互补的子问题:从多视角图像中估计堆叠物体的3D几何结构及其占据比(occupancy ratio),并通过融合几何重建与基于深度学习的深度分析技术,实现对被部分遮挡的相同零件的精确计数。该方法在大规模合成数据和多样真实世界数据上验证了鲁棒性,适用于实际工业检测环境。
链接: https://arxiv.org/abs/2603.15470
作者: Corentin Dumery,Noa Etté,Aoxiang Fan,Ren Li,Jingyi Xu,Hieu Le,Pascal Fua
机构: EPFL(瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This preprint is a journal extension of our ICCV25 Oral paper: this https URL
Abstract:Visual object counting is a fundamental computer vision task in industrial inspection, where accurate, high-throughput inventory tracking and quality assurance are critical. Moreover, manufactured parts are often too light to reliably deduce their count from their weight, or too heavy to move the stack on a scale safely and practically, making automated visual counting the more robust solution in many scenarios. However, existing methods struggle with stacked 3D items in containers, pallets, or bins, where most objects are heavily occluded and only a few are directly visible. To address this important yet underexplored challenge, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems: estimating the 3D geometry of the stack and its occupancy ratio from multi-view images. By combining geometric reconstruction with deep learning-based depth analysis, our method can accurately count identical manufactured parts inside containers, even when they are irregularly stacked and partially hidden. We validate our 3D counting pipeline on large-scale synthetic and diverse real-world data with manually verified total counts, demonstrating robust performance under realistic inspection conditions.
[CV-22] Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理时变、不可逆场景下的选择性跨模态感知与时间意识能力不足的问题,尤其针对音频信号的时间依赖性以及不同模态间可能产生的互补或干扰信息缺乏有效整合机制。其解决方案的关键在于构建了一个可定制的4D环境——EscapeCraft-4D,该环境包含基于触发的听觉源、时间瞬态证据和位置依赖线索,从而要求智能体在时间约束下执行时空推理并主动进行跨模态融合,进而系统评估模型在复杂动态环境中对多模态信息的选择性整合与时间敏感决策能力。
链接: https://arxiv.org/abs/2603.15467
作者: Yurui Dong,Ziyue Wang,Shuyun Lu,Dairu Liu,Xuechen Liu,Fuwen Luo,Peng Li,Yang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have recently made rapid progress toward unified Omni models that integrate vision, language, and audio. However, existing environments largely focus on 2D or 3D visual context and vision-language tasks, offering limited support for temporally dependent auditory signals and selective cross-modal integration, where different modalities may provide complementary or interfering information, which are essential capabilities for realistic multimodal reasoning. As a result, whether models can actively coordinate modalities and reason under time-varying, irreversible conditions remains underexplored. To this end, we introduce \textbfEscapeCraft-4D, a customizable 4D environment for assessing selective cross-modal perception and time awareness in Omni models. It incorporates trigger-based auditory sources, temporally transient evidence, and location-dependent cues, requiring agents to perform spatio-temporal reasoning and proactive multimodal integration under time constraints. Building on this environment, we curate a benchmark to evaluate corresponding abilities across powerful models. Evaluation results suggest that models struggle with modality bias, and reveal significant gaps in current model’s ability to integrate multiple modalities under time constraints. Further in-depth analysis uncovers how multiple modalities interact and jointly influence model decisions in complex multimodal reasoning environments.
[CV-23] MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts
【速读】:该论文旨在解决3D资产纹理生成中现存方法的两大核心问题:一是多视角(multiview)纹理生成方法存在多视角不一致性(multiview inconsistency)以及未见区域缺失纹理的问题;二是基于UV空间的修补(UV inpainting)方法因UV数据不足而泛化能力差,且难以有效利用2D图像扩散先验(diffusion priors)。解决方案的关键在于提出一种名为MV2UV的新方法,其核心思想是构建一个在UV空间中运行的生成模型,该模型能够同时完成两个任务:一是对多视角图像中的未见区域进行修补,二是修正多视角图像之间的不一致性。通过这一设计,MV2UV成功融合了多视角生成的2D先验与UV空间的修补能力,从而显著提升纹理质量,尤其是在遮挡和多视角不一致区域表现更优。
链接: https://arxiv.org/abs/2603.15436
作者: Zheng Zhang,Qinchuan Zhang,Yuteng Ye,Zhi Chen,Penglei Ji,Mengfei Li,Wenxiao Zhang,Yuan Liu
机构: Hisilicon Linx Lab, Huawei(Huawei海思实验室,华为); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating high-quality textures for 3D assets is a challenging task. Existing multiview texture generation methods suffer from the multiview inconsistency and missing textures on unseen parts, while UV inpainting texture methods do not generalize well due to insufficient UV data and cannot well utilize 2D image diffusion priors. In this paper, we propose a new method called MV2UV that combines 2D generative priors from multiview generation and the inpainting ability of UV refinement to get high-quality texture maps. Our key idea is to adopt a UV space generative model that simultaneously inpaints unseen parts of multiview images while resolving the inconsistency of multiview images. Experiments show that our method enables a better texture generation quality than existing methods, especially in unseen occluded and multiview-inconsistent parts.
[CV-24] Real-Time Human Frontal View Synthesis from a Single Image
【速读】:该论文旨在解决单图生成逼真人类新视角图像时存在的两大核心问题:一是现有以渲染为中心的方法在面部和手部等复杂区域难以保持几何一致性,导致时间上的不稳定性;二是以人为中心的框架因依赖外部模型提供结构先验信息而面临内存瓶颈,限制了实时性能。解决方案的关键在于提出 PrismMirror,一个基于几何引导的即时前视图合成框架,其创新性地采用级联学习策略实现从粗到细的几何特征学习——首先直接学习粗粒度几何表示(如 SMPL-X 网格和点云),再通过渲染监督细化纹理;同时将整个统一框架蒸馏为轻量级线性注意力模型,从而在保证视觉真实性和结构准确性的同时,首次实现单目人类前视图合成的实时推理(24 FPS)。
链接: https://arxiv.org/abs/2603.15433
作者: Fangyu Lin,Yingdong Hu,Lunjie Zhu,Zhening Liu,Yushi Huang,Zehong Lin,Jun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Photorealistic human novel view synthesis from a single image is crucial for democratizing immersive 3D telepresence, eliminating the need for complex multi-camera setups. However, current rendering-centric methods prioritize visual fidelity over explicit geometric understanding and struggle with intricate regions like faces and hands, leading to temporal instability. Meanwhile, human-centric frameworks suffer from memory bottlenecks since they typically rely on an auxiliary model to provide informative structural priors for geometric modeling, which limits real-time performance. To address these challenges, we propose PrismMirror, a geometry-guided framework for instant frontal view synthesis from a single image. By avoiding external geometric modeling and focusing on frontal view synthesis, our model optimizes visual integrity for telepresence. Specifically, PrismMirror introduces a novel cascade learning strategy that enables coarse-to-fine geometric feature learning. It first directly learns coarse geometric features, such as SMPL-X meshes and point clouds, and then refines textures through rendering supervision. To achieve real-time efficiency, we distill this unified framework into a lightweight linear attention model. Notably, PrismMirror is the first monocular human frontal view synthesis model that achieves real-time inference at 24 FPS, significantly outperforming previous methods in both visual authenticity and structural accuracy.
[CV-25] Gym-V: A Unified Vision Environment System for Agent ic Vision Research
【速读】:该论文旨在解决视觉代理(Vision Agent)缺乏标准化评估基础设施的问题,从而阻碍了对驱动其学习机制的系统性研究以及当前模型局限性的准确识别。解决方案的关键在于提出一个统一平台Gym-V,该平台包含179个程序生成的视觉环境,覆盖10个领域且难度可控,支持受控实验设计。通过该平台,研究发现观察 scaffolding(如图像描述和游戏规则)比强化学习算法的选择更关键,是决定训练成败的核心因素;同时跨域迁移实验表明,多样化任务训练具有广泛泛化能力,而窄域训练可能导致负迁移,多轮交互进一步放大这些效应。
链接: https://arxiv.org/abs/2603.15432
作者: Fanqing Meng Lingxiao Du Jiawei Gu Jiaqi Liao Linjie Li Zijian Wu Xiangyan Liu Ziqi Zhao Mengkang Hu Yue Zhang Zichen Liu Jiaheng Zhang Michael Qizhe Shieh
机构: National University of Singapore (新加坡国立大学); Hong Kong University of Science and Technology (香港科技大学); University of Washington (华盛顿大学); The Hong Kong Polytechnic University (香港理工大学); The University of Hong Kong (香港大学); Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym’’ infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbfGym-V, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.
[CV-26] AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation
【速读】:该论文旨在解决多角色动画生成中因角色数量增加而导致的潜在身份混淆(latent identity entanglement)和身份-姿态绑定不一致(identity-pose mis-binding)问题,这些问题会显著降低动画的可控性和时空一致性。解决方案的关键在于提出一个基于扩散Transformer(Diffusion Transformer, DiT)的框架AnyCrowd,其核心创新包括:(1)引入实例隔离的潜在表示(Instance-Isolated Latent Representation, IILR),在DiT处理前独立编码每个角色实例以避免身份纠缠;(2)设计三阶段解耦注意力机制(Tri-Stage Decoupled Attention, TSDA),将自注意力分解为实例感知前景注意力、背景中心交互与全局前景-背景协调,从而实现身份到驱动姿态的精准绑定;(3)集成自适应门控融合模块(Adaptive Gated Fusion, AGF),在重叠区域动态预测身份感知权重,有效融合竞争性token组并生成身份一致的表示。
链接: https://arxiv.org/abs/2603.15415
作者: Zhenyu Xie,Ji Xia,Michael Kampffmeyer,Panwen Hu,Zehua Ma,Yujian Zheng,Jing Wang,Zheng Chong,Xujie Zhang,Xianhang Cheng,Xiaodan Liang,Hao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations…
[CV-27] Detection of Autonomous Shuttles in Urban Traffic Images Using Adaptive Residual Context
【速读】:该论文旨在解决自动驾驶车辆在新增检测目标时,传统微调方法易引发的灾难性遗忘(catastrophic forgetting)问题,从而导致场景理解能力下降,影响道路安全。解决方案的关键在于提出自适应残差上下文(Adaptive Residual Context, ARC)架构,其通过一个上下文引导桥(Context-Guided Bridge)将冻结的上下文分支与可训练的任务特定分支相连接,利用注意力机制实现空间特征的迁移,同时保留预训练模型的表征能力,从而在不显著牺牲原有知识的前提下高效扩展新车辆类别的识别能力。
链接: https://arxiv.org/abs/2603.15404
作者: Mohamed Aziz Younes,Nicolas Saunier,Guillaume-Alexandre Bilodeau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures
Abstract:The progressive automation of transport promises to enhance safety and sustainability through shared mobility. Like other vehicles and road users, and even more so for such a new technology, it requires monitoring to understand how it interacts in traffic and to evaluate its safety. This can be done with fixed cameras and video object detection. However, the addition of new detection targets generally requires a fine-tuning approach for regular detection methods. Unfortunately, this implementation strategy will lead to a phenomenon known as catastrophic forgetting, which causes a degradation in scene understanding. In road safety applications, preserving contextual scene knowledge is of the utmost importance for protecting road users. We introduce the Adaptive Residual Context (ARC) architecture to address this. ARC links a frozen context branch and trainable task-specific branches through a Context-Guided Bridge, utilizing attention to transfer spatial features while preserving pre-trained representations. Experiments on a custom dataset show that ARC matches fine-tuned baselines while significantly improving knowledge retention, offering a data-efficient solution to add new vehicle categories for complex urban environments.
[CV-28] Pointing-Based Object Recognition
【速读】:该论文旨在解决在人机交互中通过RGB图像识别人类指向手势所目标物体的问题,这是实现更直观人机交互界面的关键挑战。解决方案的核心在于构建一个集成多个前沿技术的流水线系统,包括目标检测、人体姿态估计、单目深度估计以及视觉-语言模型(vision-language models),其中关键创新点在于利用从单张图像重建的三维空间信息提升目标识别准确性,尤其在存在重叠物体的复杂场景中表现显著;同时,通过图像描述生成模型纠正分类错误,进一步增强鲁棒性。该方法模块化设计使得其可在无专用深度传感器的环境中部署。
链接: https://arxiv.org/abs/2603.15403
作者: Lukáš Hajdúch,Viktor Kocur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to InnovAIte conference
Abstract:This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.
[CV-29] AI Evasion and Impersonation Attacks on Facial Re-Identification with Activation Map Explanations
【速读】:该论文旨在解决面部识别系统在监控场景中面临的对抗规避(adversarial evasion)和身份冒充(impersonation)攻击问题,这类攻击可导致深度重识别模型在跨摄像头部署时失效。其解决方案的关键在于提出一种新型端到端的对抗补丁生成框架,通过条件编码-解码网络实现单次前向传播即可生成针对目标模型的对抗补丁,并结合多尺度特征引导与双对抗目标(拉项与推项)优化策略,提升攻击效果;同时引入预训练潜在扩散模型以增强补丁的自然性与物理可部署性,从而在多个标准行人重识别(Market-1501、DukeMTMCreID)和人脸识别(CelebA-HQ、PubFig)数据集上实现了显著的攻击性能,包括白盒环境下mAP从90%降至0.4%,黑盒环境下从72%降至0.4%,并验证了良好的跨模型泛化能力。
链接: https://arxiv.org/abs/2603.15396
作者: Noe Claudel,Weisi Guo,Yang Xing
机构: Cranfield University (克兰菲尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Facial identification systems are increasingly deployed in surveillance and yet their vulnerability to adversarial evasion and impersonation attacks pose a critical risk. This paper introduces a novel framework for generating adversarial patches capable of both evasion and impersonation attacks against deep re-identification models across non-overlapping cameras. Unlike prior approaches that require iterative patch optimisation for each target, our method employs a conditional encoder-decoder network to synthesize adversarial patches in a single forward pass, guided by multi-scale features from source and target images. The patches are optimised with a dual adversarial objective comprising of pull and push terms. To enhance imperceptibility and aid physical deployment, we further integrate naturalistic patch generation using pre-trained latent diffusion models. Experiments on standard pedestrian (Market-1501, DukeMTMCreID) and facial recognition benchmarks (CelebA-HQ, PubFig) datasets demonstrate the effectiveness of the proposed method. Our adversarial evasion attacks reduce mean Average Precision from 90% to 0.4% in white-box settings and from 72% to 0.4% in black-box settings, showing strong cross-model generalization. In targeted impersonation attacks, our framework achieves a success rate of 27% on CelebA-HQ, competing with other patch-based methods. We go further to use clustering of activation maps to interpret which features are most used by adversarial attacks and propose a pathway for future countermeasures. The results highlight the practicality of adversarial patch attacks on retrieval-based systems and underline the urgent need for robust defense strategies.
[CV-30] RieMind: Geometry-Grounded Spatial Agent for Scene Understanding
【速读】:该论文旨在解决视觉语言模型(Visual Language Models, VLMs)在室内场景理解中普遍存在的度量与空间推理能力不足的问题。现有方法通常依赖端到端的视频理解或大规模空间问答微调,导致感知与推理模块耦合,限制了推理性能的提升。其解决方案的关键在于将感知与推理解耦:构建一个基于真实标注数据的显式三维场景图(3D Scene Graph, 3DSG),并引入一个代理(agent)框架,使大型语言模型(LLM)通过结构化的几何工具与场景交互,这些工具可访问对象尺寸、距离、位姿及空间关系等基础属性。实验表明,在VSI-Bench静态分割数据集上,该方法在理想感知条件下实现了显著优于先前工作的空间推理性能(最高提升16%),且无需任务特定微调;相比基线VLMs,平均性能提升达33%至50%,验证了显式几何建模对空间推理能力的增强作用。
链接: https://arxiv.org/abs/2603.15386
作者: Fernando Ropero,Erkin Turkoz,Daniel Matos,Junqing Du,Antonio Ruiz,Yanfeng Zhang,Lu Liu,Mingwei Sun,Yongliang Wang
机构: Huawei(华为)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning leads to improved spatial reasoning. We propose an agentic framework for static 3D indoor scene reasoning that grounds an LLM in an explicit 3D scene graph (3DSG). Rather than ingesting videos directly, each scene is represented as a persistent 3DSG constructed by a dedicated perception module. To isolate reasoning performance, we instantiate the 3DSG from ground-truth annotations. The agent interacts with the scene exclusively through structured geometric tools that expose fundamental properties such as object dimensions, distances, poses, and spatial relationships. The results we obtain on the static split of VSI-Bench provide an upper bound under ideal perceptual conditions on the spatial reasoning performance, and we find that it is significantly higher than previous works, by up to 16%, without task specific fine-tuning. Compared to base VLMs, our agentic variant achieves significantly better performance, with average improvements between 33% to 50%. These findings indicate that explicit geometric grounding substantially improves spatial reasoning performance, and suggest that structured representations offer a compelling alternative to purely end-to-end visual reasoning.
[CV-31] Spectral Rectification for Parameter-Efficient Adaptation of Foundation Models in Colonoscopy Depth Estimation
【速读】:该论文旨在解决视觉基础模型(vision foundation models)在结肠镜图像中无法直接泛化的问题,其核心挑战并非语义差异,而是频域统计分布的偏移:结肠镜图像缺乏基础模型依赖的高频边缘和纹理梯度,导致几何推理能力下降。解决方案的关键在于提出SpecDepth框架,其创新性地引入一个可学习的小波分解模块——自适应频谱校正模块(adaptive spectral rectification module),通过显式建模并增强特征图中衰减的高频成分,实现对低层信号的精准调整,从而在不破坏高层语义特征的前提下,使预训练模型重新对齐原始归纳偏置。该方法在C3VD和SimCol3D数据集上取得了当前最优性能,验证了针对频谱失配进行针对性修正的有效性。
链接: https://arxiv.org/abs/2603.15374
作者: Xiaoxian Zhang,Minghai Shi,Lei Li
机构: National University of Singapore (新加坡国立大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages
Abstract:Accurate monocular depth estimation is critical in colonoscopy for lesion localization and navigation. Foundation models trained on natural images fail to generalize directly to colonoscopy. We identify the core issue not as a semantic gap, but as a statistical shift in the frequency domain: colonoscopy images lack the strong high-frequency edge and texture gradients that these models rely on for geometric reasoning. To address this, we propose SpecDepth, a parameter-efficient adaptation framework that preserves the robust geometric representations of the pre-trained models while adapting to the colonoscopy domain. Its key innovation is an adaptive spectral rectification module, which uses a learnable wavelet decomposition to explicitly model and amplify the attenuated high-frequency components in feature maps. Different from conventional fine-tuning that risks distorting high-level semantic features, this targeted, low-level adjustment realigns the input signal with the original inductive bias of the foundational model. On the public C3VD and SimCol3D datasets, SpecDepth achieved state-of-the-art performance with an absolute relative error of 0.022 and 0.027, respectively. Our work demonstrates that directly addressing spectral mismatches is a highly effective strategy for adapting vision foundation models to specialized medical imaging tasks. The code will be released publicly after the manuscript is accepted for publication.
[CV-32] rajectory-Diversity-Driven Robust Vision-and-Language Navigation
【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)中现有基于模仿学习的方法在泛化能力不足和对执行扰动鲁棒性差的问题。其解决方案的关键在于提出一种基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习框架——NavGRPO,通过探索多样化轨迹并利用组内性能比较进行优化,使智能体能够在不依赖额外价值网络的情况下,识别出超越专家路径的有效导航策略,从而显著提升在未见环境中的鲁棒性和导航成功率(SPL)。
链接: https://arxiv.org/abs/2603.15370
作者: Jiangyang Li,Cong Wan,SongLin Dong,Chenhao Ding,Qiang Wang,Zhiheng Ma,Yihong Gong
机构: Xi’an Jiaotong University (西安交通大学); Faculty of Computility Microelectronics, Shenzhen University of Advanced Technology (深圳先进技术研究院微电子学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17pages, 5 figures
Abstract:Vision-and-Language Navigation (VLN) requires agents to navigate photo-realistic environments following natural language instructions. Current methods predominantly rely on imitation learning, which suffers from limited generalization and poor robustness to execution perturbations. We present NavGRPO, a reinforcement learning framework that learns goal-directed navigation policies through Group Relative Policy Optimization. By exploring diverse trajectories and optimizing via within-group performance comparisons, our method enables agents to distinguish effective strategies beyond expert paths without requiring additional value networks. Built on ScaleVLN, NavGRPO achieves superior robustness on R2R and REVERIE benchmarks with +3.0% and +1.71% SPL improvements in unseen environments. Under extreme early-stage perturbations, we demonstrate +14.89% SPL gain over the baseline, confirming that goal-directed RL training builds substantially more robust navigation policies. Code and models will be released.
[CV-33] IRIS: Intersection-aware Ray-based Implicit Editable Scenes
【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在训练和渲染过程中计算成本高昂,而3D高斯点绘制(3D Gaussian Splatting)虽具备实时性能但难以实现灵活编辑的问题。现有融合两者优势的方法通常依赖随机体素采样来聚合特征,导致渲染效率受限。其解决方案的关键在于提出一种名为IRIS(Intersection-aware Ray-based Implicit Editable Scenes)的新框架:通过引入解析式采样策略精确计算射线与场景几何体的交点,从而避免无效空空间处理;同时设计了一种沿射线连续特征聚合机制,在排序后的交点处插值潜在属性,替代昂贵的三维邻域搜索,既保障几何一致性,又实现高保真、实时渲染与灵活形状编辑。
链接: https://arxiv.org/abs/2603.15368
作者: Grzegorz Wilczyński,Mikołaj Zieliński,Krzysztof Byrski,Joanna Waczyńska,Dominik Belter,Przemysław Spurek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural Radiance Fields achieve high-fidelity scene representation but suffer from costly training and rendering, while 3D Gaussian splatting offers real-time performance with strong empirical results. Recently, solutions that harness the best of both worlds by using Gaussians as proxies to guide neural field evaluations, still suffer from significant computational inefficiencies. They typically rely on stochastic volumetric sampling to aggregate features, which severely limits rendering performance. To address this issue, a novel framework named IRIS (Intersection-aware Ray-based Implicit Editable Scenes) is introduced as a method designed for efficient and interactive scene editing. To overcome the limitations of standard ray marching, an analytical sampling strategy is employed that precisely identifies interaction points between rays and scene primitives, effectively eliminating empty space processing. Furthermore, to address the computational bottleneck of spatial neighbor lookups, a continuous feature aggregation mechanism is introduced that operates directly along the ray. By interpolating latent attributes from sorted intersections, costly 3D searches are bypassed, ensuring geometric consistency, enabling high-fidelity, real-time rendering, and flexible shape editing. Code can be found at this https URL.
[CV-34] A PPO-Based Bitrate Allocation Conditional Diffusion Model for Remote Sensing Image Compression
【速读】:该论文旨在解决高分辨率无人机遥感图像在压缩过程中难以兼顾高压缩率与细节信息保留的问题,尤其是在城市监测和灾害评估等任务中对结构细节和任务相关性信息的保持需求。其解决方案的关键在于提出一种基于近端策略优化(Proximal Policy Optimization, PPO)的比特率分配策略与条件扩散解码器相结合的框架——PPO-based bitrate allocation Conditional Diffusion Compression (PCDC),通过块级比特率自适应分配和生成式重建机制,在实现高达21.2倍压缩比的同时,显著提升了图像的感知质量和下游目标检测任务的性能保真度。
链接: https://arxiv.org/abs/2603.15365
作者: Yuming Han,Jooho Kim,Anish Shakya
机构: Texas A\M University (德州农工大学); Texas A\M University (德州农工大学); Texas A\M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing remote sensing image compression methods still explore to balance high compression efficiency with the preservation of fine details and task-relevant information. Meanwhile, high-resolution drone imagery offers valuable structural details for urban monitoring and disaster assessment, but large-area datasets can easily reach hundreds of gigabytes, creating significant challenges for storage and long-term management. In this paper, we propose a PPO-based bitrate allocation Conditional Diffusion Compression (PCDC) framework. PCDC integrates a conditional diffusion decoder with a PPO-based block-wise bitrate allocation strategy to achieve high compression ratios while maintaining strong perceptual performance. We also release a high-resolution drone image dataset with richer structural details at a consistent low altitude over residential neighborhoods in coastal urban areas. Experimental results show compression ratios of 19.3x on DIV2K and 21.2x on the drone image dataset. Moreover, downstream object detection experiments demonstrate that the reconstructed images preserve task-relevant information with negligible performance loss.
[CV-35] Oscillating Dispersion for Maximal Light-throughput Spectral Imaging
【速读】:该论文旨在解决现有计算光谱成像系统因使用编码孔径和分束器导致大量入射光被阻挡,从而在低光照条件下重建质量显著下降的问题。其核心解决方案是提出了一种新型的振荡色散光谱成像仪(Oscillating Dispersion Imaging Spectrometer, ODIS),通过轴向平移色散元件在共轭像面与离焦位置之间切换,实现单光路下顺序采集全色(PAN)图像与色散测量数据,从而近似实现全光通量利用;进一步设计了基于PAN结构引导的色散感知深度展开网络(PAN-guided Dispersion-Aware Deep Unfolding Network, PDAUN),其中数据保真步采用基于FFT-Woodbury预条件的求解器以利用ODIS前向模型的循环卷积特性,同时引入色散感知可变形卷积模块(DADC)校正亚像素级光谱错位,最终实现了高保真光谱重构,在标准基准测试和跨系统对比中均取得当前最优性能,并在物理原型上验证了低照度下的优越性。
链接: https://arxiv.org/abs/2603.15348
作者: Jiuyun Zhang,Zhan Shi,Linsen Chen,Xun Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing computational spectral imaging systems typically rely on coded aperture and beam splitters that block a substantial fraction of incident light, degrading reconstruction quality under light-starved conditions. To address this limitation, we develop the Oscillating Dispersion Imaging Spectrometer (ODIS), which for the first time achieves near-full light throughput by axially translating a disperser between the conjugate image plane and a defocused position, sequentially capturing a panchromatic (PAN) image and a dispersed measurement along a single optical path. We further propose a PAN-guided Dispersion-Aware Deep Unfolding Network (PDAUN) that recovers high-fidelity spectral information from maskless dispersion under PAN structural guidance. Its data-fidelity step derives an FFT-Woodbury preconditioned solver by exploiting the cyclic-convolution property of the ODIS forward model, while a Dispersion-Aware Deformable Convolution module (DADC) corrects sub-pixel spectral misalignment using PAN features. Experiments show state-of-the-art performance on standard benchmarks, and cross-system comparisons confirm that ODIS yields decisive gains under low illumination. High-fidelity reconstruction is validated on a physical prototype.
[CV-36] MeMix: Writing Less Remembering More for Streaming 3D Reconstruction
【速读】:该论文旨在解决流式3D重建(streaming 3D reconstruction)中因状态漂移(state drift)和遗忘(forgetting)导致的长期序列性能退化问题,尤其针对现有递归在线模型在长时间运行时出现的精度下降难题。其解决方案的关键在于提出一个无需训练、即插即用的模块MeMix,通过将递归状态重构为“记忆混合”(Memory Mixture),将状态划分为多个独立的记忆块(memory patches),仅更新与当前输入对齐程度最低的记忆块,同时精确保留其余部分,从而有效缓解灾难性遗忘(catastrophic forgetting),并保持O(1)的推理内存复杂度,无需微调或引入额外可学习参数,即可直接应用于现有递归重建模型。
链接: https://arxiv.org/abs/2603.15330
作者: Jiacheng Dong,Huan Li,Sicheng Zhou,Wenhao Hu,Weili Xu,Yan Wang
机构: Zhejiang University (浙江大学); Institute for AI Industry Research, Tsinghua University (清华大学人工智能产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstruction is a fundamental task in 3D vision and a fundamental capability for spatial intelligence. Particularly, streaming 3D reconstruction is central to real-time spatial perception, yet existing recurrent online models often suffer from progressive degradation on long sequences due to state drift and forgetting, motivating inference-time remedies. We present MeMix, a training-free, plug-and-play module that improves streaming reconstruction by recasting the recurrent state into a Memory Mixture. MeMix partitions the state into multiple independent memory patches and updates only the least-aligned memory patches while exactly preserving others. This selective update mitigates catastrophic forgetting while retaining O(1) inference memory, and requires no fine-tuning or additional learnable parameters, making it directly applicable to existing recurrent reconstruction models. Across standard benchmarks (ScanNet, 7-Scenes, KITTI, etc.), under identical backbones and inference settings, MeMix reduces reconstruction completeness error by 15.3% on average (up to 40.0%) across 300–500 frame streams on 7-Scenes. The code is available at this https URL
[CV-37] UE5-Forest: A Photorealistic Synthetic Stereo Dataset for UAV Forestry Depth Estimation
【速读】:该论文旨在解决林业环境中难以获取密集真实深度图(ground-truth disparity maps)的问题,这一瓶颈限制了监督式立体匹配网络在自主无人机修剪任务中的训练效果。其解决方案的关键在于构建一个完全基于虚幻引擎5(Unreal Engine 5, UE5)的高保真合成立体数据集UE5-Forest,该数据集包含115棵通过摄影测量扫描的树木模型,在虚拟场景中以模拟ZED Mini相机参数(63 mm基线、2.8 mm焦距、3.84 mm传感器宽度)采集5,520对像素级精确视差标签的立体图像,从而为基于立体视觉的林木深度估计提供可直接使用的基准和训练资源。
链接: https://arxiv.org/abs/2603.15304
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dense ground-truth disparity maps are practically unobtainable in forestry environments, where thin overlapping branches and complex canopy geometry defeat conventional depth sensors – a critical bottleneck for training supervised stereo matching networks for autonomous UAV-based pruning. We present UE5-Forest, a photorealistic synthetic stereo dataset built entirely in Unreal Engine 5 (UE5). One hundred and fifteen photogrammetry-scanned trees from the Quixel Megascans library are placed in virtual scenes and captured by a simulated stereo rig whose intrinsics – 63 mm baseline, 2.8 mm focal length, 3.84 mm sensor width – replicate the ZED Mini camera mounted on our drone. Orbiting each tree at up to 2 m across three elevation bands (horizontal, +45 degrees, -45 degrees) yields 5,520 rectified 1920 x 1080 stereo pairs with pixel-perfect disparity labels. We provide a statistical characterisation of the dataset – covering disparity distributions, scene diversity, and visual fidelity – and a qualitative comparison with real-world Canterbury Tree Branches imagery that confirms the photorealistic quality and geometric plausibility of the rendered data. The dataset will be publicly released to provide the community with a ready-to-use benchmark and training resource for stereo-based forestry depth estimation.
[CV-38] Generative Video Compression with One-Dimensional Latent Representation CVPR2026
【速读】:该论文旨在解决生成式视频压缩(Generative Video Compression, GVC)中因采用二维(2D)潜在表示而导致的空间-时间冗余难以充分消除的问题。具体而言,2D潜在网格在空间上会保留帧内冗余(intra-frame redundancy),导致比特率偏高;在时间上则难以高效建模长期相关性,限制了跨帧语义内容的聚合能力。解决方案的关键在于提出一种基于一维(1D)潜在表示的生成式视频压缩方法(GVC1D),其将视频编码为条件依赖于短时与长时上下文的紧凑1D潜在标记(latent tokens)。该设计摆脱了2D结构的刚性空间对应关系,使标记能自适应地关注语义区域并自然实现令牌压缩,从而显著降低空间冗余;同时,所提出的1D记忆机制可提供语义丰富的长期上下文且计算开销低,进一步减少时间冗余,最终在HEVC Class B数据集上实现了60.4%(LPIPS指标)和68.8%(DISTS指标)的比特率压缩提升。
链接: https://arxiv.org/abs/2603.15302
作者: Zihan Zheng,Zhaoyang Jia,Naifu Xue,Jiahao Li,Bin Li,Zongyu Guo,Xiaoyi Zhang,Zhenghao Chen,Houqiang Li,Yan Lu
机构: University of Science and Technology of China (中国科学技术大学); Communication University of China (中国传媒大学); Microsoft Research Asia (微软亚洲研究院); University of Newcastle (纽卡斯尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026
Abstract:Recent advancements in generative video codec (GVC) typically encode video into a 2D latent grid and employ high-capacity generative decoders for reconstruction. However, this paradigm still leaves two key challenges in fully exploiting spatial-temporal redundancy: Spatially, the 2D latent grid inevitably preserves intra-frame redundancy due to its rigid structure, where adjacent patches remain highly similar, thereby necessitating a higher bitrate. Temporally, the 2D latent grid is less effective for modeling long-term correlations in a compact and semantically coherent manner, as it hinders the aggregation of common contents across frames. To address these limitations, we introduce Generative Video Compression with One-Dimensional (1D) Latent Representation (GVC1D). GVC1D encodes the video data into extreme compact 1D latent tokens conditioned on both short- and long-term contexts. Without the rigid 2D spatial correspondence, these 1D latent tokens can adaptively attend to semantic regions and naturally facilitate token reduction, thereby reducing spatial redundancy. Furthermore, the proposed 1D memory provides semantically rich long-term context while maintaining low computational cost, thereby further reducing temporal redundancy. Experimental results indicate that GVC1D attains superior compression efficiency, where it achieves bitrate reductions of 60.4% under LPIPS and 68.8% under DISTS on the HEVC Class B dataset, surpassing the previous video compression this http URL: this https URL
[CV-39] GATE-AD: Graph Attention Network Encoding For Few-Shot Industrial Visual Anomaly Detection
【速读】:该论文旨在解决少样本工业视觉异常检测(Few-Shot Industrial Visual Anomaly Detection, FS-IVAD)问题,即在仅有少量正常样本(1–8张)的情况下实现高精度的缺陷识别。其解决方案的关键在于提出了一种基于重构的新型框架GATE-AD,该框架采用掩码的、表示对齐的图注意力网络(Masked, Representation-Aligned Graph Attention Network, GAT)编码机制,将图像划分为密集的patch级视觉特征token作为图节点,并通过堆叠的自注意力层建模复杂、非欧几里得结构的局部关系;同时引入可学习潜在空间中的表示对齐组件,利用缩放余弦误差(Scaled Cosine Error, SCE)目标函数精准定位高重建残差区域(即缺陷),从而在多个工业缺陷检测基准(MVTec AD、VisA、MPDD)上实现了最先进的检测精度与最低的单图推理延迟。
链接: https://arxiv.org/abs/2603.15300
作者: Aggelos Psiris,Yannis Panagakis,Maria Vakalopoulou,Georgios Th. Papadopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-Shot Industrial Visual Anomaly Detection (FS-IVAD) comprises a critical task in modern manufacturing settings, where automated product inspection systems need to identify rare defects using only a handful of normal/defect-free training samples. In this context, the current study introduces a novel reconstruction-based approach termed GATE-AD. In particular, the proposed framework relies on the employment of a masked, representation-aligned Graph Attention Network (GAT) encoding scheme to learn robust appearance patterns of normal samples. By leveraging dense, patch-level, visual feature tokens as graph nodes, the model employs stacked self-attentional layers to adaptively encode complex, irregular, non-Euclidean, local relations. The graph is enhanced with a representation alignment component grounded on a learnable, latent space, where high reconstruction residual areas (i.e., defects) are assessed using a Scaled Cosine Error (SCE) objective function. Extensive comparative evaluation on the MVTec AD, VisA, and MPDD industrial defect detection benchmarks demonstrates that GATE-AD achieves state-of-the-art performance across the 1 - to 8 -shot settings, combining the highest detection accuracy (increase up to 1.8% in image AUROC in the 8-shot case in MPDD) with the lowest per-image inference latency (at least 25.05% faster), compared to the best-performing literature methods. In order to facilitate reproducibility and further research, the source code of GATE-AD is available at this https URL.
[CV-40] Faster Inference of Flow-Based Generative Models via Improved Data-Noise Coupling ICLR2025 ATC
【速读】:该论文旨在解决Conditional Flow Matching (CFM) 在大规模数据集上训练时因 minibatch optimal transport (OT) 仅在单个批次内优化而导致的性能瓶颈问题,即其无法有效跨批次保留和优化噪声-数据配对关系,从而限制了采样速度与质量的平衡。解决方案的关键在于提出 LOOM-CFM(Looking Out Of Minibatch-CFM),通过在训练过程中跨批次保持并优化这些配对关系,显著扩展了 minibatch OT 的作用范围,从而实现了多数据集上的采样效率提升和高质量生成结果,并支持高分辨率潜在空间合成与蒸馏初始化优化。
链接: https://arxiv.org/abs/2603.15279
作者: Aram Davtyan,Leello Tadesse Dadi,Volkan Cevher,Paolo Favaro
机构: University of Bern (伯尔尼大学); EPFL (洛桑联邦理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Patched from ICLR2025. Code: this https URL
Abstract:Conditional Flow Matching (CFM), a simulation-free method for training continuous normalizing flows, provides an efficient alternative to diffusion models for key tasks like image and video generation. The performance of CFM in solving these tasks depends on the way data is coupled with noise. A recent approach uses minibatch optimal transport (OT) to reassign noise-data pairs in each training step to streamline sampling trajectories and thus accelerate inference. However, its optimization is restricted to individual minibatches, limiting its effectiveness on large datasets. To address this shortcoming, we introduce LOOM-CFM (Looking Out Of Minibatch-CFM), a novel method to extend the scope of minibatch OT by preserving and optimizing these assignments across minibatches over training time. Our approach demonstrates consistent improvements in the sampling speed-quality trade-off across multiple datasets. LOOM-CFM also enhances distillation initialization and supports high-resolution synthesis in latent space training.
[CV-41] Dataset Diversity Metrics and Impact on Classification Models
【速读】:该论文旨在解决训练数据集多样性(dataset diversity)在模型鲁棒性提升中的量化与评估问题,尤其是在图像、文本和元数据等多模态场景下,现有多样性指标缺乏统一定义且常被忽视。其关键解决方案在于系统性地评估多种无参考(reference-free)和基于特征的多样性度量方法(如AUC、FID和语义多样性指标),并结合临床专家直觉与下游任务性能验证这些指标的有效性。研究发现,FID和语义多样性指标与模型表现及专家认知更具一致性,而简单增加设备来源(如添加新扫描仪)反而可能引发捷径学习(shortcut learning),揭示了多样性并非越多越好,需考虑其质量与结构特性。
链接: https://arxiv.org/abs/2603.15276
作者: Théo Sourget,Niclas Claßen,Jack Junchi Xu,Rob van der Goot,Veronika Cheplygina
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The diversity of training datasets is usually perceived as an important aspect to obtain a robust model. However, the definition of diversity is often not defined or differs across papers, and while some metrics exist, the quantification of this diversity is often overlooked when developing new algorithms. In this work, we study the behaviour of multiple dataset diversity metrics for image, text and metadata using MorphoMNIST, a toy dataset with controlled perturbations, and PadChest, a publicly available chest X-ray dataset. We evaluate whether these metrics correlate with each other but also with the intuition of a clinical expert. We also assess whether they correlate with downstream-task performance and how they impact the training dynamic of the models. We find limited correlations between the AUC and image or metadata reference-free diversity metrics, but higher correlations with the FID and the semantic diversity metrics. Finally, the clinical expert indicates that scanners are the main source of diversity in practice. However, we find that the addition of another scanner to the training set leads to shortcut learning. The code used in this study is available at this https URL
[CV-42] Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models CVPR2026
【速读】:该论文旨在解决原生统一多模态模型(Native unified multimodal models)在实际部署中面临的显著计算开销问题,尤其针对生成任务(如图像生成)与理解任务(如视觉问答 VQA)之间存在本质计算差异时,现有静态、单一的加速策略无法有效优化性能的问题。其解决方案的关键在于首次系统性揭示了统一模型中的参数专业化现象——即不同神经元集合对生成和理解任务分别至关重要,从而表明模型内部已隐式构建了独立的推理路径。基于此发现,作者提出无需训练的任务感知加速框架 FlashU,核心包括:任务特定网络剪枝(Task-Specific Network Pruning)与动态层跳过(Dynamic Layer Skipping)以消除跨层及任务冗余;针对视觉生成任务引入时变引导尺度控制与扩散头缓存机制(Diffusion Head Cache)进行时间近似;针对多模态理解任务则通过 V-Norm 代理实现动态 Token 剪枝,利用视觉输入的空间冗余性。实验表明,FlashU 在保持最先进性能的同时,实现了 1.78× 至 2.01× 的推理加速。
链接: https://arxiv.org/abs/2603.15271
作者: Junlong Ke,Zichen Wen,Boxue Yang,Yantai Yang,Xuyang Liu,Chenfei Liao,Zhaorun Chen,Shaobo Wang,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室); Sichuan University (四川大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州) ); University of Chicago (芝加哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 Findings
Abstract:Native unified multimodal models, which integrate both generative and understanding capabilities, face substantial computational overhead that hinders their real-world deployment. Existing acceleration techniques typically employ a static, monolithic strategy, ignoring the fundamental divergence in computational profiles between iterative generation tasks (e.g., image generation) and single-pass understanding tasks (e.g., VQA). In this work, we present the first systematic analysis of unified models, revealing pronounced parameter specialization, where distinct neuron sets are critical for each task. This implies that, at the parameter level, unified models have implicitly internalized separate inference pathways for generation and understanding within a single architecture. Based on these insights, we introduce a training-free and task-aware acceleration framework, FlashU, that tailors optimization to each task’s demands. Across both tasks, we introduce Task-Specific Network Pruning and Dynamic Layer Skipping, aiming to eliminate inter-layer and task-specific redundancy. For visual generation, we implement a time-varying control signal for the guidance scale and a temporal approximation for the diffusion head via Diffusion Head Cache. For multimodal understanding, building upon the pruned model, we introduce Dynamic Token Pruning via a V-Norm Proxy to exploit the spatial redundancy of visual inputs. Extensive experiments on Show-o2 demonstrate that FlashU achieves 1.78 \times to 2.01 \times inference acceleration across both understanding and generation tasks while maintaining SOTA performance, outperforming competing unified models and validating our task-aware acceleration paradigm. Our code is publicly available at this https URL.
[CV-43] Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps
【速读】:该论文旨在解决角膜神经纤维弯曲度(tortuosity)分级中依赖昂贵分割图的问题,传统方法严重依赖高成本的神经纤维分割结果来评估疾病状态。其解决方案的关键在于利用ImageNet预训练的自监督特征(如DINO模型)进行迁移学习,通过精细微调后在无需分割图的情况下实现更优的分类性能——准确率提升至84.25%,灵敏度达77.97%,且模型聚焦于关键形态学特征,显著优于现有方法。
链接: https://arxiv.org/abs/2603.15269
作者: Kim Ouan,Noémie Moreau,Katarzyna Bozek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures
Abstract:The tortuosity of corneal nerve fibers are used as indication for different diseases. Current state-of-the-art methods for grading the tortuosity heavily rely on expensive segmentation maps of these nerve fibers. In this paper, we demonstrate that self-supervised pretrained features from ImageNet are transferable to the domain of in vivo confocal microscopy. We show that DINO should not be disregarded as a deep learning model for medical imaging, although it was superseded by two later versions. After careful fine-tuning, DINO improves upon the state-of-the-art in terms of accuracy (84,25%) and sensitivity (77,97%). Our fine-tuned model focuses on the key morphological elements in grading without the use of segmentation maps.
[CV-44] Exemplar Diffusion: Improving Medical Object Detection with Opportunistic Labels MICCAI2026
【速读】:该论文旨在解决医学图像中目标检测性能受限于训练数据标注不足或标注质量不高的问题,尤其在测试阶段缺乏有效利用已有标注信息(称为exemplars)的机制。解决方案的关键在于提出一种名为exemplar diffusion的方法,该方法基于现有的扩散模型(diffusion models)框架,在推理阶段无需重新训练即可引入已知边界框(bounding boxes)信息,从而提升检测的平均精度(average precision)和召回率(recall),并展现出对exemplar质量的鲁棒性,使得非专家标注也能有效增强检测性能。此外,该方法还可用于量化扩散检测模型的预测不确定性。
链接: https://arxiv.org/abs/2603.15267
作者: Victor Wåhlstrand,Jennifer Alvén,Ida Häggström
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to MICCAI 2026
Abstract:We present a framework to take advantage of existing labels at inference, called \textitexemplars, in order to improve the performance of object detection in medical images. The method, \textitexemplar diffusion, leverages existing diffusion methods for object detection to enable a training-free approach to adding information of known bounding boxes at test time. We demonstrate that for medical image datasets with clear spatial structure, the method yields an across-the-board increase in average precision and recall, and a robustness to exemplar quality, enabling non-expert annotation. Moreover, we demonstrate how our method may also be used to quantify predictive uncertainty in diffusion detection methods. Source code and data splits openly available online: this https URL
[CV-45] IConE: Batch Independent Collapse Prevention for Self-Supervised Representation Learning
【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)中基于联合嵌入架构(Joint-Embedding Architectures, JEAs)的表示学习在小批量(small-batch)场景下易发生表示坍塌(representation collapse)的问题,尤其是在高维科学数据(如生物医学影像)中,由于内存限制和类别不平衡导致难以构建大而平衡的批次。解决方案的关键在于提出 IConE(Instance-Contrasted Embeddings)框架,其核心创新是将防止坍塌的机制从依赖批次统计量(batch statistics)转移到一个全局可学习的辅助实例嵌入集合(learnable auxiliary instance embeddings),并通过显式的多样性目标进行正则化,从而实现不依赖批大小的稳定训练,即使在 batch size=1 时也能保持高性能与鲁棒性。
链接: https://arxiv.org/abs/2603.15263
作者: Konstantinos Almpanakis,Anna Kreshuk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Self-supervised learning (SSL) has revolutionized representation learning, with Joint-Embedding Architectures (JEAs) emerging as an effective approach for capturing semantic features. Existing JEAs rely on implicit or explicit batch interaction – via negative sampling or statistical regularization – to prevent representation collapse. This reliance becomes problematic in regimes where batch sizes must be small, such as high-dimensional scientific data, where memory constraints and class imbalance make large, well-balanced batches infeasible. We introduce IConE (Instance-Contrasted Embeddings), a framework that decouples collapse prevention from the training batch size. Rather than enforcing diversity through batch statistics, IConE maintains a global set of learnable auxiliary instance embeddings regularized by an explicit diversity objective. This transfers the anti-collapse mechanism from the transient batch to a dataset-level embedding space, allowing stable training even when batch statistics are unreliable, down to batch size 1. Across diverse 2D and 3D biomedical modalities, IConE outperforms strong contrastive and non-contrastive baselines throughout the small-batch regime (from B=1 to B=64) and demonstrates marked robustness to severe class imbalance. Geometric analysis shows that IConE preserves high intrinsic dimensionality in the learned representations, preventing the collapse observed in existing JEAs as batch sizes shrink.
[CV-46] AGCD: Agent -Guided Cross-Modal Decoding for Weather Forecasting
【速读】:该论文旨在解决天气预报模型在自回归滚动预测中因局部误差累积导致的结构偏差问题,即现有方法难以同时保持气象场的协同 synoptic 结构(synoptic structures)和物理一致性。传统基于物理先验(physics-priors)的方法通常采用全局、一次性约束策略(如架构设计、正则化或与数值天气预报 NWP 耦合),缺乏部署时对状态自适应和样本特异性控制的能力。其解决方案的关键在于提出一种可插拔的“解码时先验注入”范式——Agent-Guided Cross-modal Decoding (AGCD),该方法通过多智能体气象叙述流水线从当前多变量大气状态中提取条件感知的物理先验,并借助跨模态区域交互解码机制,在不改变骨干网络接口的前提下,实现对视觉特征的高效、可控且可复用的物理先验注入,从而显著提升短期至中期预报的准确性与稳定性。
链接: https://arxiv.org/abs/2603.15260
作者: Jing Wu,Yang Liu,Lin Zhang,Junbo Zeng,Jiabin Wang,Zi Ye,Guowen Li,Shilei Cao,Jiashun Cheng,Fang Wang,Meng Jin,Yerong Feng,Hong Cheng,Yutong Lu,Haohuan Fu,Juepeng Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate weather forecasting is more than grid-wise regression: it must preserve coherent synoptic structures and physical consistency of meteorological fields, especially under autoregressive rollouts where small one-step errors can amplify into structural bias. Existing physics-priors approaches typically impose global, once-for-all constraints via architectures, regularization, or NWP coupling, offering limited state-adaptive and sample-specific controllability at deployment. To bridge this gap, we propose Agent-Guided Cross-modal Decoding (AGCD), a plug-and-play decoding-time prior-injection paradigm that derives state-conditioned physics-priors from the current multivariate atmosphere and injects them into forecasters in a controllable and reusable way. Specifically, We design a multi-agent meteorological narration pipeline to generate state-conditioned physics-priors, utilizing MLLMs to extract various meteorological elements effectively. To effectively apply the priors, AGCD further introduce cross-modal region interaction decoding that performs region-aware multi-scale tokenization and efficient physics-priors injection to refine visual features without changing the backbone interface. Experiments on WeatherBench demonstrate consistent gains for 6-hour forecasting across two resolutions (5.625 degree and 1.40625 degree) and diverse backbones (generic and weather-specialized), including strictly causal 48-hour autoregressive rollouts that reduce early-stage error accumulation and improve long-horizon stability.
[CV-47] HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在图像描述中产生幻觉(hallucination)时,缺乏一个全面、可解释且具备良好泛化能力的评估基准问题。当前对幻觉检测(HalDec)的研究受限于现有基准无法有效区分不同类型的幻觉以及跨不同生成模型和幻觉类型的表现差异。为此,作者提出 HalDec-Bench,这是一个结构化的基准测试平台,包含由多种VLM生成的图像描述、人工标注的幻觉存在性标签、细粒度的幻觉类型分类及像素级分割标签,从而支持多难度层级的任务设计。其解决方案的关键在于:通过构建具有高标注质量与多样性的真实世界幻觉数据集,揭示了现有VLM作为幻觉检测器时存在的系统性偏差(如优先信任开头句子),并验证了利用强VLM作为过滤器可显著降低训练数据噪声,进而提升图像-文本对的质量,为高质量数据筛选提供新范式。
链接: https://arxiv.org/abs/2603.15253
作者: Kuniaki Saito,Risa Shinoda,Shohei Tanaka,Tosho Hirasawa,Fumio Okura,Yoshitaka Ushiku
机构: University of Osaka (大阪大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hallucination detection in captions (HalDec) assesses a vision-language model’s ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination detection is also essential for curating high-quality image-caption pairs used to train VLMs. However, the generalizability of VLMs as hallucination detectors across different captioning models and hallucination types remains unclear due to the lack of a comprehensive benchmark. In this work, we introduce HalDec-Bench, a benchmark designed to evaluate hallucination detectors in a principled and interpretable manner. HalDec-Bench contains captions generated by diverse VLMs together with human annotations indicating the presence of hallucinations, detailed hallucination-type categories, and segment-level labels. The benchmark provides tasks with a wide range of difficulty levels and reveals performance differences across models that are not visible in existing multimodal reasoning or alignment benchmarks. Our analysis further uncovers two key findings. First, detectors tend to recognize sentences appearing at the beginning of a response as correct, regardless of their actual correctness. Second, our experiments suggest that dataset noise can be substantially reduced by using strong VLMs as filters while employing recent VLMs as caption generators. Our project page is available at this https URL.
[CV-48] Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection ICASSP2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在物理 grounded 异常检测任务中表现不佳的问题,尤其是其缺乏对动态因果关系的理解,导致无法有效识别如异常旋转或违反力学规律的运动。解决方案的关键在于提出一种物理信息引导的指令微调框架,通过结构化提示显式编码物体属性、运动范式和动力学约束,并借助多轮对话形式将这些物理先验传递给模型,从而将因果推理分解为渐进步骤,构建对正常与异常动态的鲁棒内部表征。该方法在Phys-AD基准上实现了96.7%的AUROC,显著优于先前最先进方法(66.9%),并提供更优的因果解释能力(LLM评分0.777)。
链接: https://arxiv.org/abs/2603.15237
作者: Yao Gu,Xiaohao Xu,Yingna Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE ICASSP2026
Abstract:Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection–substantially outperforming prior SOTA (66.9%)–and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.
[CV-49] HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization DATE
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在视觉理解与生成任务之间存在的根本性鸿沟问题,即抽象表征与细节生成原语之间的不匹配。现有方法通常通过解耦编码器、堆叠VAE结构或使用离散量化来折中处理,但这些策略常导致信息连贯性破坏和优化冲突。其解决方案的关键在于提出HYDRA-TOK——一种基于纯视觉Transformer(ViT)的表示和谐化架构,它将标准骨干网络重构为一个渐进式学习器:从生成导向的Gen-ViT(结构保持原语捕捉)逐步过渡到语义导向的Sem-ViT(语义编码),并通过一个生成-语义瓶颈(Generation-Semantic Bottleneck, GSB)实现特征压缩与恢复,以过滤噪声并增强语义理解能力。这一设计使HYDRA框架能够在单一参数空间内原生整合感知与生成任务,显著提升性能。
链接: https://arxiv.org/abs/2603.15228
作者: Xuerui Qiu,Yutao Cui,Guozhen Zhang,Junzhe Li,JiaKui Hu,Xiao Zhang,Yang Li,Songtao Liu,Miles Yang,Yu Shi,Zhao Zhong,Liefeng Bo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress: We are actively scaling up the models. More updates coming soon
Abstract:Unified Multimodal Models struggle to bridge the fundamental gap between the abstract representations needed for visual understanding and the detailed primitives required for generation. Existing approaches typically compromise by employing decoupled encoders, stacking representation encoder atop VAEs, or utilizing discrete quantization. However, these methods often disrupt information coherence and lead to optimization conflicts. To this end, we introduce HYDRA-TOK, a representation-harmonized pure ViT in the insight that visual modeling should evolve from generation to understanding. HYDRA-TOK reformulates the standard backbone into a progressive learner that transitions from a Gen-ViT, which captures structure-preserving primitives, to a Sem-ViT for semantic encoding. Crucially, this transition is mediated by a Generation-Semantic Bottleneck (GSB), which compresses features into a low-dimensional space to filter noise for robust synthesis, then restores dimensionality to empower complex semantic comprehension. Built upon this foundation, we present HYDRA, a native unified framework integrating perception and generation within a single parameter space. Extensive experiments establish HYDRA as a new state-of-the-art. It sets a benchmark in visual reconstruction (rFID 0.08) and achieves top-tier generation performance on GenEval (0.86), DPG-Bench (86.4), and WISE (0.53), while simultaneously outperforming previous native UMMs by an average of 10.0 points across eight challenging understanding benchmarks.
[CV-50] racking the Discriminative Axis: Dual Prototypes for Test-Time OOD Detection Under Covariate Shift
【速读】:该论文旨在解决深度学习系统在真实世界中部署时面临的分布外(Out-of-Distribution, OOD)检测难题,尤其是在测试阶段输入为分布内(In-Distribution, ID)与OOD样本的动态混合流、且存在协变量偏移(Covariate Shift) 的场景下,传统方法因假设ID分布静态不变而性能严重下降的问题。解决方案的关键在于:发现并利用协变量偏移下ID与OOD样本在特征空间中仍沿可分离判别轴(discriminative axis)分布的规律,提出DART方法——一种测试时在线的OOD检测机制,通过动态追踪ID和OOD的双原型(dual prototypes)以恢复漂移的判别轴,并结合多层融合与翻转校正增强鲁棒性,从而实现对动态环境中的OOD样本高效准确识别。
链接: https://arxiv.org/abs/2603.15213
作者: Wooseok Lee,Jin Mo Yang,Saewoong Bahk,Hyung-Sin Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:For reliable deployment of deep-learning systems, out-of-distribution (OOD) detection is indispensable. In the real world, where test-time inputs often arrive as streaming mixtures of in-distribution (ID) and OOD samples under evolving covariate shifts, OOD samples are domain-constrained and bounded by the environment, and both ID and OOD are jointly affected by the same covariate factors. Existing methods typically assume a stationary ID distribution, but this assumption breaks down in such settings, leading to severe performance degradation. We empirically discover that, even under covariate shift, covariate-shifted ID (csID) and OOD (csOOD) samples remain separable along a discriminative axis in feature space. Building on this observation, we propose DART, a test-time, online OOD detection method that dynamically tracks dual prototypes – one for ID and the other for OOD – to recover the drifting discriminative axis, augmented with multi-layer fusion and flip correction for robustness. Extensive experiments on a wide range of challenging benchmarks, where all datasets are subjected to 15 common corruption types at severity level 5, demonstrate that our method significantly improves performance, yielding 15.32 percentage points (pp) AUROC gain and 49.15 pp FPR@95TPR reduction on ImageNet-C vs. Textures-C compared to established baselines. These results highlight the potential of the test-time discriminative axis tracking for dependable OOD detection in dynamically changing environments.
[CV-51] What Matters for Scalable and Robust Learning in End-to-End Driving Planners? CVPR
【速读】:该论文旨在解决当前端到端自动驾驶(end-to-end autonomous driving)架构在闭合回路(closed-loop)场景下难以实现可扩展鲁棒学习的问题。尽管现有方法在开环数据集上表现优异,且常采用感知与规划模块分离、通过鸟瞰图特征网格(bird’s eye view feature grids)连接的结构以保持端到端可微性,但这些设计在真实闭环驾驶中往往失效。论文通过系统分析三种常见架构模式——高分辨率感知表示、轨迹表示解耦以及生成式规划——揭示了它们在闭合回路中的局限性和潜在协同效应。解决方案的关键在于提出一种新型轻量级、高度可扩展的端到端驾驶架构BevAD,其在Bench2Drive基准上实现了72.7%的成功率,并展现出纯模仿学习下的良好数据扩展性。
链接: https://arxiv.org/abs/2603.15185
作者: David Holtz,Niklas Hanselmann,Simon Doll,Marius Cordts,Bernt Schiele
机构: Mercedes-Benz AG (梅赛德斯-奔驰集团); Max-Planck-Institute for Informatics, SIC (马克斯·普朗克信息研究所,SIC)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in CVPR Findings 2026
Abstract:End-to-end autonomous driving has gained significant attention for its potential to learn robust behavior in interactive scenarios and scale with data. Popular architectures often build on separate modules for perception and planning connected through latent representations, such as bird’s eye view feature grids, to maintain end-to-end differentiability. This paradigm emerged mostly on open-loop datasets, with evaluation focusing not only on driving performance, but also intermediate perception tasks. Unfortunately, architectural advances that excel in open-loop often fail to translate to scalable learning of robust closed-loop driving. In this paper, we systematically re-examine the impact of common architectural patterns on closed-loop performance: (1) high-resolution perceptual representations, (2) disentangled trajectory representations, and (3) generative planning. Crucially, our analysis evaluates the combined impact of these patterns, revealing both unexpected limitations as well as underexplored synergies. Building on these insights, we introduce BevAD, a novel lightweight and highly scalable end-to-end driving architecture. BevAD achieves 72.7% success rate on the Bench2Drive benchmark and demonstrates strong data-scaling behavior using pure imitation learning. Our code and models are publicly available here: this https URL
[CV-52] Multimodal Connectome Fusion via Cross-Attention for Autism Spectrum Disorder Classification Using Graph Learning
【速读】:该论文旨在解决多模态脑影像数据(功能磁共振成像 rs-fMRI 与结构磁共振成像 sMRI)在自闭症谱系障碍(ASD)分类中难以有效融合的问题,尤其针对来自多中心的数据异质性挑战。其关键解决方案是提出一种基于图神经网络的多模态图学习框架,其中通过引入一种新颖的不对称Transformer交叉注意力机制,使功能连接嵌入能够选择性地整合结构信息,同时保持功能主导性;该机制在保留各模态特异性特征的基础上实现了高效融合,并结合基于表型信息的配对关联编码器建模个体间关系,最终在ABIDE-I数据集上显著提升了跨站点ASD分类性能(LOSO-CV下平均准确率达82.0%,优于现有方法约7%)。
链接: https://arxiv.org/abs/2603.15168
作者: Ansar Rahman,Hassan Shojaee-Mend,Sepideh Hatamikia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 Pages; 5 Figures
Abstract:Autism spectrum disorder (ASD) is a complex neurodevelopmental condition characterized by atypical functional brain connectivity and subtle structural alterations. rs-fMRI has been widely used to identify disruptions in large-scale brain networks, while structural MRI provides complementary information about morphological organization. Despite their complementary nature, effectively integrating these heterogeneous imaging modalities within a unified framework remains challenging. This study proposes a multimodal graph learning framework that preserves the dominant role of functional connectivity while integrating structural imaging and phenotypic information for ASD classification. The proposed framework is evaluated on ABIDE-I dataset. Each subject is represented as a node within a population graph. Functional and structural features are extracted as modality-specific node attributes, while inter-subject relationships are modeled using a pairwise association encoder (PAE) based on phenotypic information. Two Edge Variational GCNs are trained to learn subject-level embeddings. To enable effective multimodal integration, we introduce a novel asymmetric transformer-based cross-attention mechanism that allows functional embeddings to selectively incorporate complementary structural information while preserving functional dominance. The fused embeddings are then passed to a MLP for ASD classification. Using stratified 10-fold cross-validation, the framework achieved an AUC of 87.3% and an accuracy of 84.4%. Under leave-one-site-out cross-validation (LOSO-CV), the model achieved an average cross-site accuracy of 82.0%, outperforming existing methods by approximately 3% under 10-fold cross-validation and 7% under LOSO-CV. The proposed framework effectively integrates heterogeneous multimodal data from the multi-site ABIDE-I dataset, improving automated ASD classification across imaging sites.
[CV-53] Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding CVPR2026
【速读】:该论文旨在解决长视频理解任务中因传统单向感知-记忆流程导致的事件完整性理解不足问题,尤其在需要全局时序推理的任务(如MLVU和VNBench中的时间排序)上表现不佳。其解决方案的关键在于提出一种基于记忆反馈的视觉压缩框架QViC-MF,核心创新是引入Question-guided Multimodal Selective Attention (QMSA),通过迭代式地融合当前视频片段与上下文记忆中相关历史帧的信息,实现对问题相关的视觉内容的动态保留与增强,从而显著提升模型对长视频中完整事件的理解能力。
链接: https://arxiv.org/abs/2603.15167
作者: Sosuke Yamao,Natsuki Miyahara,Yuankai Qi,Shun Takeuchi
机构: Fujitsu Research (富士通研究); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. The first two authors contributed equally to this work
Abstract:In the context of long-term video understanding with large multimodal models, many frameworks have been proposed. Although transformer-based visual compressors and memory-augmented approaches are often used to process long videos, they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead establish a feedbackdriven process in which past visual contexts stored in the context memory can benefit ongoing perception. To this end, we propose Question-guided Visual Compression with Memory Feedback (QViC-MF), a framework for long-term video understanding. At its core is a Question-guided Multimodal Selective Attention (QMSA), which learns to preserve visual information related to the given question from both the current clip and the past related frames from the memory. The compressor and memory feedback work iteratively for each clip of the entire video. This simple yet effective design yields large performance gains on longterm video understanding tasks. Extensive experiments show that our method achieves significant improvement over current state-of-the-art methods by 6.1% on MLVU test, 8.3% on LVBench, 18.3% on VNBench Long, and 3.7% on VideoMME Long. The code will be released publicly.
[CV-54] DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer
【速读】:该论文旨在解决大规模视觉-语言模型(VLMs)在细粒度视觉分类(FGVC)任务中因计算成本过高而难以部署于资源受限环境的问题。现有知识蒸馏方法直接从通用VLM向轻量级学生模型传递知识,常因架构不匹配和引入与任务无关的信息导致性能不佳。其解决方案的关键在于提出一种自适应中间教师蒸馏机制(DAIT),通过引入一个可训练的中间教师网络,在目标细粒度任务的显式监督下学习如何从冻结的VLM中提取并优化判别性视觉特征,从而生成紧凑且任务对齐的知识表示,有效提升轻量模型的迁移性能。
链接: https://arxiv.org/abs/2603.15166
作者: Zhengxu He,Jun Li,Zhijian Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale Vision-Language Models (VLMs) encode rich multimodal semantics that are highly beneficial for fine-grained visual categorization (FGVC). However, their prohibitive computational cost hinders practical deployment in resource-constrained environments. Although knowledge distillation contributes to transferring VLMs capacity to lightweight classifiers, conventional distillation mechanisms, which directly transfer from a generic VLM to a compact student, often yield suboptimal results due to severe architectural misalignment and introducing task-irrelevant information. To alleviate this limitation, we propose Distillation with Adaptive Intermediate Teacher transfer (DAIT) in this study, facilitating adaptive knowledge transfer from VLMs to lightweight students. DAIT introduces a trainable intermediate teacher that learns to transfer frozen VLMs representations under explicit supervision from the target fine-grained task. This intermediate teacher adaptively enhances discriminative visual cues, thereby producing compact and task-aligned knowledge that can be reliably distilled into lightweight models. Extensive evaluations on multiple FGVC benchmarks with diverse student architectures demonstrate that our method achieves respective performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, establishing DAIT as a principled paradigm for transferring from general-purpose VLMS to deployable fine-grained recognition models.
[CV-55] Vision-Language Model Based Multi-Expert Fusion for CT Image Classification
【速读】:该论文旨在解决多机构环境下新冠肺部CT图像分类中因源域偏移(source shift)、源域不平衡(source imbalance)及隐藏测试源身份(hidden test-source identities)导致的鲁棒性不足问题。解决方案的关键在于提出一种三阶段源感知多专家框架:首先构建基于原始CT与肺部提取CT融合的3D专家模型以增强肺部结构表征;其次设计两个MedSigLIP基础专家模块,分别实现切片级特征学习和跨切片上下文建模;最后训练一个源分类器预测测试扫描的潜在来源,并基于预测结果进行专家模型融合与投票决策。该方法通过层级化源感知建模与集成策略,在异构多源条件下显著提升了分类性能,验证了其在实际临床部署中的有效性。
链接: https://arxiv.org/abs/2603.15154
作者: Jianfa Bai,Kejin Lu,Runtian Yuan,Qingqiu Li,Jilan Xu,Junlin Hou,Yuejie Zhang,Rui Feng
机构: Fudan University (复旦大学); University of Oxford (牛津大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robust detection of COVID-19 from chest CT remains challenging in multi-institutional settings due to substantial source shift, source imbalance, and hidden test-source identities. In this work, we propose a three-stage source-aware multi-expert framework for multi-source COVID-19 CT classification. First, we build a lung-aware 3D expert by combining original CT volumes and lung-extracted CT volumes for volumetric classification. Second, we develop two MedSigLIP-based experts: a slice-wise representation and probability learning module, and a Transformer-based inter-slice context modeling module for capturing cross-slice dependency. Third, we train a source classifier to predict the latent source identity of each test scan. By leveraging the predicted source information, we perform model fusion and voting based on different experts. On the validation set covering all four sources, the Stage 1 model achieves the best macro-F1 of 0.9711, ACC of 0.9712, and AUC of 0.9791. Stage~2a and Stage~2b achieve the best AUC scores of 0.9864 and 0.9854, respectively. Stage~3 source classifier reaches 0.9107 ACC and 0.9114 F1. These results demonstrate that source-aware expert modeling and hierarchical voting provide an effective solution for robust COVID-19 CT classification under heterogeneous multi-source conditions.
[CV-56] xtOVSR: Text-Guided Real-World Opera Video Super-Resolution
【速读】:该论文旨在解决经典戏曲视频在低质量成像和长期存储退化背景下,现有真实世界视频超分辨率(Real-World Video Super-Resolution, RWVSR)方法难以有效重建细节纹理的问题。核心挑战在于:一是真实退化建模困难,传统简化退化核无法准确捕捉噪声分布,而外部噪声采样易引入风格不匹配导致视觉伪影;二是现有方法缺乏高层语义引导,仅依赖退化图像特征难以恢复逼真且精细的纹理。解决方案的关键在于提出一种文本引导的双分支戏曲视频超分辨率网络(Text-guided Dual-Branch Opera Video Super-Resolution, TextOVSR),通过引入两类文本提示实现跨模态引导:其一为退化描述文本嵌入负分支以约束解空间,其二为内容描述文本结合所提出的文本增强判别器(Text-Enhanced Discriminator, TED)提供语义先验,辅助纹理重建;同时设计退化鲁棒特征融合模块(Degradation-Robust Feature Fusion, DRF)以促进多模态特征融合并抑制退化干扰。
链接: https://arxiv.org/abs/2603.15153
作者: Hua Chang,Xin Xu,Wei Liu,Jiayi Wu,Kui Jiang,Fei Ma,Qi Tian
机构: Wuhan University of Science and Technology (武汉科技大学); Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System (湖北省智能信息处理与实时工业系统重点实验室); Harbin Institute of Technology Zhengzhou Research Institute (哈尔滨工业大学郑州研究院); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); Huawei Technologies Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many classic opera videos exhibit poor visual quality due to the limitations of early filming equipment and long-term degradation during storage. Although real-world video super-resolution (RWVSR) has achieved significant advances in recent years, directly applying existing methods to degraded opera videos remains challenging. The difficulties are twofold. First, accurately modeling real-world degradations is complex: simplistic combinations of classical degradation kernels fail to capture the authentic noise distribution, while methods that extract real noise patches from external datasets are prone to style mismatches that introduce visual artifacts. Second, current RWVSR methods, which rely solely on degraded image features, struggle to reconstruct realistic and detailed textures due to a lack of high-level semantic guidance. To address these issues, we propose a Text-guided Dual-Branch Opera Video Super-Resolution (TextOVSR) network, which introduces two types of textual prompts to guide the super-resolution process. Specifically, degradation-descriptive text, derived from the degradation process, is incorporated into the negative branch to constrain the solution space. Simultaneously, content-descriptive text is incorporated into a positive branch and our proposed Text-Enhanced Discriminator (TED) to provide semantic guidance for enhanced texture reconstruction. Furthermore, we design a Degradation-Robust Feature Fusion (DRF) module to facilitate cross-modal feature fusion while suppressing degradation interference. Experiments on our OperaLQ benchmark show that TextOVSR outperforms state-of-the-art methods both qualitatively and quantitatively. The code is available at this https URL.
[CV-57] SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation
【速读】:该论文旨在解决大规模量化码本(VQ codebook)在离散图像生成模型训练中的优化难题,尤其是在模型规模和训练时长受限的情况下难以收敛的问题。其解决方案的关键在于提出一种新颖的训练目标——随机邻域交叉熵最小化(Stochastic Neighbor Cross Entropy Minimization, SNCE),该方法不再使用硬标签(one-hot target)进行监督,而是构建一个基于邻近token的软类别分布,其中每个token的概率与其码本嵌入与真实图像嵌入之间的距离成反比,从而引导模型在量化嵌入空间中学习语义上有意义的几何结构,显著提升收敛速度和生成质量。
链接: https://arxiv.org/abs/2603.15150
作者: Shufan Li,Jiuxiang Gu,Kangning Liu,Zhe Lin,Aditya Grover,Jason Kuen
机构: Adobe; UCLA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 4 figures
Abstract:Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.
[CV-58] Context-Aware Sensor Modeling for Asynchronous Multi-Sensor Tracking in Stone Soup
【速读】:该论文旨在解决多传感器跟踪中因异步传感、部分覆盖及检测性能异质性导致的轨迹融合性能下降问题,尤其针对高频率传感器因重复非检测而侵蚀仅由低频传感器可见轨迹的现象。解决方案的关键在于提出DetectorContext这一抽象机制,将检测概率和杂波强度建模为状态相关函数,并在假设生成阶段动态评估,从而实现上下文感知的观测建模;该机制可无缝集成至现有概率跟踪器中,无需修改其更新方程,实验表明其能显著提升HOTA(Higher Order Tracking Accuracy)和GOSPA(Generalized Optimal Subpattern Assignment)指标,同时保持低虚警率。
链接: https://arxiv.org/abs/2603.15137
作者: Martin Vonheim Larsen,Kim Mathiassen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-sensor tracking in the real world involves asynchronous sensors with partial coverage and heterogeneous detection performance. Although probabilistic tracking methods permit detection probability and clutter intensity to depend on state and sensing context, many practical frameworks enforce globally uniform observability assumptions. Under multi-rate and partially overlapping sensing, this simplification causes repeated non-detections from high-rate sensors to erode tracks visible only to low-rate sensors, potentially degrading fusion performance. We introduce DetectorContext, an abstraction for the open-source multi-target tracking framework Stone Soup. DetectorContext exposes detection probability and clutter intensity as state-dependent functions evaluated during hypothesis formation. The abstraction integrates with existing probabilistic trackers without modifying their update equations. Experiments on asynchronous radar-lidar data demonstrate that context-aware modeling restores stable fusion and significantly improves HOTA and GOSPA performance without increasing false tracks. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.15137 [cs.CV] (or arXiv:2603.15137v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.15137 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-59] WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation
【速读】:该论文旨在解决像素空间生成模型中因像素流形缺乏语义连续性而导致的最优传输路径纠缠问题,这种纠缠在轨迹交汇处引发严重冲突,从而导致次优解。解决方案的关键在于提出Waypoint Diffusion Transformers (WiT),通过预训练视觉模型投影出的中间语义路标点(waypoints)对连续向量场进行因子分解,将最优传输路径拆分为“先验到路标”与“路标到像素”两个阶段,从而有效解耦生成轨迹;具体而言,在迭代去噪过程中,轻量级生成器从当前噪声状态动态推断这些中间路标点,并通过Just-Pixel AdaLN机制持续条件化主扩散Transformer,引导演化至下一状态,最终输出RGB像素。
链接: https://arxiv.org/abs/2603.15132
作者: Hainuo Wang,Mingjia Li,Xiaojie Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at this https URL.
[CV-60] Low-light Image Enhancement with Retinex Decomposition in Latent Space
【速读】:该论文旨在解决现有低光照图像增强方法在准确分解反射率(reflectance)与光照(illumination)成分方面存在的局限性问题。其解决方案的关键在于提出了一种两阶段的Retinex-Guided Transformer(RGT)模型:首先通过潜空间分解策略,结合对数变换与1像素偏移操作,将原本乘性关系转换为加性形式,从而提升分解的稳定性和精度;其次设计了一个U形结构的组件精炼模块,引入引导融合Transformer块(guidance fusion transformer block),有效优化光照分布并保留纹理细节,实现从低光输入到正常光照图像的高质量转换。
链接: https://arxiv.org/abs/2603.15131
作者: Bolun Zheng,Qingshan Lei,Quan Chen,Qianyu Zhang,Kainan Yu,Xu Jia,Lingyu Zhu
机构: Hangzhou Dianzi University (杭州电子科技大学); Jiaxing University (嘉兴大学); Dalian University of Technology (大连理工大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submit to IEEE TIP
Abstract:Retinex theory provides a principled foundation for low-light image enhancement, inspiring numerous learning-based methods that integrate its principles. However, existing methods exhibits limitations in accurately decomposing reflectance and illumination components. To address this, we propose a Retinex-Guided Transformer~(RGT) model, which is a two-stage model consisting of decomposition and enhancement phases. First, we propose a latent space decomposition strategy to separate reflectance and illumination components. By incorporating the log transformation and 1-pixel offset, we convert the intrinsically multiplicative relationship into an additive formulation, enhancing decomposition stability and precision. Subsequently, we construct a U-shaped component refiner incorporating the proposed guidance fusion transformer block. The component refiner refines reflectance component to preserve texture details and optimize illumination distribution, effectively transforming low-light inputs to normal-light counterparts. Experimental evaluations across four benchmark datasets validate that our method achieves competitive performance in low-light enhancement and a more stable training process.
[CV-61] Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors
【速读】:该论文旨在解决超低比特率图像压缩(Ultra-Low-Bitrate Image Compression, ULB-IC)中的性能瓶颈问题,特别是如何在极低码率下同时保证重建图像的感知质量和解码效率。传统基于图像扩散模型(Image Diffusion Model)的方法通常直接从噪声生成目标图像,缺乏结构引导,导致重建质量受限且解码速度慢。其解决方案的关键在于引入一个显式的中间状态——紧凑的锚帧(anchor frame),该锚帧保留场景几何与语义布局但舍弃高频细节,从而作为生成过程的起点;随后利用预训练视频扩散模型(Video Diffusion Model, VDM)作为时序先验,将解码过程建模为从锚帧到目标图像的虚拟时间演化过程,实现从可见、语义忠实的初始状态逐步生成高质量图像。这一机制显著提升了感知保真度和重建 realism,并带来高达 5 倍的解码加速。
链接: https://arxiv.org/abs/2603.15129
作者: Yunuo Chen,Chuqin Zhou,Jiangchuan Li,Xiaoyue Ling,Bing He,Jincheng Dai,Li Song,Guo Lu
机构: Shanghai Jiao Tong University (上海交通大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a novel paradigm for ultra-low-bitrate image compression (ULB-IC) that exploits the ``temporal’’ evolution in generative image compression. Specifically, we define an explicit intermediate state during decoding: a compact anchor frame, which preserves the scene geometry and semantic layout while discarding high-frequency details. We then reinterpret generative decoding as a virtual temporal transition from this anchor to the final reconstructed this http URL model this progression, we leverage a pretrained video diffusion model (VDM) as temporal priors: the anchor frame serves as the initial frame and the original image as the target frame, transforming the decoding process into a next-frame prediction this http URL contrast to image diffusion-based ULB-IC models, our decoding proceeds from a visible, semantically faithful anchor, which improves both fidelity and realism for perceptual image compression. Extensive experiments demonstrate that our method achieves superior objective and subjective performance. On the CLIC2020 test set, our method achieves over \textbf50% bitrate savings across LPIPS, DISTS, FID, and KID compared to DiffC, while also delivering a significant decoding speedup of up to \times 5. Code will be released later.
[CV-62] A Novel Camera-to-Robot Calibration Method for Vision-Based Floor Measurements
【速读】:该论文旨在解决地面观测移动机器人中相机与激光跟踪仪(laser tracker)之间的手眼标定(hand-eye calibration)问题,以实现高精度的多模态测量融合。解决方案的关键在于设计了一种参考板(referencing plate),该板集成激光跟踪仪用于位姿获取的反光球凹槽(reflector nests)和机器人相机可识别的标定靶标(camera calibration target),通过分步估计参考板位姿、参考板与相机位姿以及机器人位姿,最终计算出机器人到相机的变换矩阵,从而实现亚毫米级重复性精度。
链接: https://arxiv.org/abs/2603.15126
作者: Jan Andre Rudolph,Dennis Haitz,Markus Ulrich
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages; accepted for publication in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Abstract:A novel hand-eye calibration method for ground-observing mobile robots is proposed. While cameras on mobile robots are com- mon, they are rarely used for ground-observing measurement tasks. Laser trackers are increasingly used in robotics for precise localization. A referencing plate is designed to combine the two measurement modalities of laser-tracker 3D metrology and camera- based 2D imaging. It incorporates reflector nests for pose acquisition using a laser tracker and a camera calibration target that is observed by the robot-mounted camera. The procedure comprises estimating the plate pose, the plate-camera pose, and the robot pose, followed by computing the robot-camera transformation. Experiments indicate sub-millimeter repeatability.
[CV-63] A Tutorial on ALOS2 SAR Utilization: Dataset Preparation Self-Supervised Pretraining and Semantic Segmentation
【速读】:该论文旨在解决合成孔径雷达(SAR)图像在自监督预训练中因高噪声水平和语义标注困难导致的性能瓶颈问题,以及区域特定模型开发中因土地覆盖分布不均衡(如水体、森林或沙漠占主导)所引发的偏差问题。其关键解决方案是提出两种改进的掩码自动编码器方法:SAR-W-MixMAE 和 SAR-W-SimMIM,二者均引入了针对 SAR 图像强度特性的加权损失函数,以降低斑点噪声和极端强度值对预训练的影响;同时构建了基于ALOS-2单通道HH极化SAR影像的日本区域数据集,用于训练视觉Transformer架构的自编码器,并通过任务专用解码器进行语义分割微调,从而显著提升下游任务性能,优于随机初始化训练结果。
链接: https://arxiv.org/abs/2603.15119
作者: Nevrez Imamoglu,Ali Caglayan,Toru Kouyama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures, 1 Table
Abstract:Masked auto-encoders (MAE) and related approaches have shown promise for satellite imagery, but their application to synthetic aperture radar (SAR) remains limited due to challenges in semantic labeling and high noise levels. Building on our prior work with SAR-W-MixMAE, which adds SAR-specific intensity-weighted loss to standard MixMAE for pretraining, we also introduce SAR-W-SimMIM; a weighted variant of SimMIM applied to ALOS-2 single-channel SAR imagery. This method aims to reduce the impact of speckle and extreme intensity values during self-supervised pretraining. We evaluate its effect on semantic segmentation compared to our previous trial with SAR-W-MixMAE and random initialization, observing notable improvements. In addition, pretraining and fine-tuning models on satellite imagery pose unique challenges, particularly when developing region-specific models. Imbalanced land cover distributions such as dominant water, forest, or desert areas can introduce bias, affecting both pretraining and downstream tasks like land cover segmentation. To address this, we constructed a SAR dataset using ALOS-2 single-channel (HH polarization) imagery focused on the Japan region, marking the initial phase toward a national-scale foundation model. This dataset was used to pretrain a vision transformer-based autoencoder, with the resulting encoder fine-tuned for semantic segmentation using a task-specific decoder. Initial results demonstrate significant performance improvements compared to training from scratch with random initialization. In summary, this work provides a guide to process and prepare ALOS2 observations to create dataset so that it can be taken advantage of self-supervised pretraining of models and finetuning downstream tasks such as semantic segmentation.
[CV-64] VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents
【速读】:该论文旨在解决当前多模态基础模型在从政府表格文档中提取结构化数据时缺乏系统性评估基准的问题。现有基准通常仅基于单一输入模态,无法量化不同输入格式(如文本、布局保留文本、图像或二者结合)对抽取准确率的影响,导致模型性能评估存在偏差且难以优化。解决方案的关键在于提出VAREX(VARied-schema EXtraction)基准,其核心创新是采用“逆向标注”(Reverse Annotation)流水线,通过程序化填充PDF模板生成带确定性真实标签的数据集,并辅以三阶段质量保障机制确保标签可靠性;同时,每个文档提供四种受控输入模态(纯文本、布局保留文本、文档图像及图文组合),从而支持对输入表示方式影响的系统性消融实验,为模型设计与部署提供可解释的性能边界。
链接: https://arxiv.org/abs/2603.15118
作者: Udi Barzelay,Ophir Azulai,Inbar Shapira,Idan Friedman,Foad Abo Dahood,Madison Lee,Abraham Daniels
机构: IBM Research (IBM研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, 4 tables, plus 12-page supplementary. Dataset: this https URL Code: this https URL
Abstract:We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy – a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models =4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance – not extraction capability – is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.
[CV-65] Sampling-guided exploration of active feature selection policies
【速读】:该论文旨在解决机器学习预测模型中特征选择的优化问题,特别是在特征获取成本与模型性能之间权衡的挑战下,如何实现更高效、更具针对性的特征采集策略。传统方法通常采用全局特征选择,忽略了不同实例对特征的需求差异;而本文提出了一种基于强化学习的序贯决策框架,将问题建模为状态维度动态变化的马尔可夫决策过程(Markov Decision Process, MDP),从而避免了数据插补,并能够根据已获取的实例特定信息动态推荐下一最优模态(modality)进行采集。其解决方案的关键在于:1)引入启发式策略以扩展至大规模数据集,聚焦于最有潜力的特征组合;2)设计后拟合正则化策略,减少决策序列中不同的特征组合数量,从而生成结构紧凑的决策路径,最终在多个二分类数据集上实现了优于当前最先进方法的准确率和策略复杂度表现。
链接: https://arxiv.org/abs/2603.15110
作者: Gabriel Bernardino,Anders Jonsson,Patrick Clarysse,Nicolas Duchateau
机构: Universitat Pompeu Fabra (庞培法布拉大学); Université Claude Bernard Lyon 1 (克莱蒙-奥弗涅大学); INSA-Lyon (里昂国立应用科学学院); CNRS (法国国家科学研究中心); Inserm (法国国家健康与医学研究院); CREATIS UMR 5220 (CREATIS实验室); U1294 (法国国家健康与医学研究院研究单元1294); Institut Universitaire de France (法国国家高等教育研究院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Determining the most appropriate features for machine learning predictive models is challenging regarding performance and feature acquisition costs. In particular, global feature choice is limited given that some features will only benefit a subset of instances. In previous work, we proposed a reinforcement learning approach to sequentially recommend which modality to acquire next to reach the best information/cost ratio, based on the instance-specific information already acquired. We formulated the problem as a Markov Decision Process where the state’s dimensionality changes during the episode, avoiding data imputation, contrary to existing works. However, this only allowed processing a small number of features, as all possible combinations of features were considered. Here, we address these limitations with two contributions: 1) we expand our framework to larger datasets with a heuristic-based strategy that focuses on the most promising feature combinations, and 2) we introduce a post-fit regularisation strategy that reduces the number of different feature combinations, leading to compact sequences of decisions. We tested our method on four binary classification datasets (one involving high-dimensional variables), the largest of which had 56 features and 4500 samples. We obtained better performance than state-of-the-art methods, both in terms of accuracy and policy complexity.
[CV-66] PAKAN: Pixel Adaptive Kolmogorov-Arnold Network Modules for Pansharpening
【速读】:该论文旨在解决传统深度神经网络在全色锐化(pansharpening)任务中因使用静态激活函数而导致无法动态建模复杂非线性空间-光谱映射关系的问题。其解决方案的关键在于提出像素自适应的科尔莫戈罗夫-阿诺德网络(Pixel Adaptive Kolmogorov-Arnold Network, PAKAN),通过设计两种自适应变体——二维自适应KAN(2D Adaptive KAN)在空间维度上生成样条求和权重,以及一维自适应KAN(1D Adaptive KAN)在光谱通道上生成权重,并将其集成至PAKAN 2to1(特征融合模块)与PAKAN 1to1(特征精炼模块),从而实现对输入图像像素级动态激活函数的学习与调整,显著提升全色锐化性能。
链接: https://arxiv.org/abs/2603.15109
作者: Haoyu Zhang,Haojing Chen,Zhen Zhong,Liangjian Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages,5 figures,4 tables
Abstract:Pansharpening aims to fuse high-resolution spatial details from panchromatic images with the rich spectral information of multispectral images. Existing deep neural networks for this task typically rely on static activation functions, which limit their ability to dynamically model the complex, non-linear mappings required for optimal spatial-spectral fusion. While the recently introduced Kolmogorov-Arnold Network (KAN) utilizes learnable activation functions, traditional KANs lack dynamic adaptability during inference. To address this limitation, we propose a Pixel Adaptive Kolmogorov-Arnold Network framework. Starting from KAN, we design two adaptive variants: a 2D Adaptive KAN that generates spline summation weights across spatial dimensions and a 1D Adaptive KAN that generates them across spectral channels. These two components are then assembled into PAKAN 2to1 for feature fusion and PAKAN 1to1 for feature refinement. Extensive experiments demonstrate that our proposed modules significantly enhance network performance, proving the effectiveness and superiority of pixel-adaptive activation in pansharpening tasks.
[CV-67] Learning from Limited and Incomplete Data: A Multimodal Framework for Predicting Pathological Response in NSCLC
【速读】:该论文旨在解决非小细胞肺癌(Non-small Cell Lung Cancer, NSCLC)新辅助治疗后主要病理缓解(Major Pathological Response, pR)的术前预测难题,尤其针对真实临床环境中数据有限和临床变量不完整的问题。解决方案的关键在于提出一种多模态深度学习框架,通过融合基于基础模型(foundation model)的CT影像特征提取与缺失感知(missing-aware)架构来处理临床变量,从而在小样本条件下实现稳健学习,并显式建模缺失信息,避免传统插补策略;同时采用加权融合机制整合影像与临床模态的互补贡献,显著优于单一模态基线模型,体现了异构数据融合与缺失感知设计在实际临床场景中的有效性。
链接: https://arxiv.org/abs/2603.15100
作者: Alice Natalina Caragliano,Giulia Farina,Fatih Aksu,Camillo Maria Caruso,Claudia Tacconi,Carlo Greco,Lorenzo Nibid,Edy Ippolito,Michele Fiore,Giuseppe Perrone,Sara Ramella,Paolo Soda,Valerio Guarrasi
机构: Fondazione Policlinico Universitario Campus Bio-Medico di Roma (罗马大学生物医学校区基金会医院); Università Campus Bio-Medico di Roma (罗马大学生物医学校区大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Major pathological response (pR) following neoadjuvant therapy is a clinically meaningful endpoint in non-small cell lung cancer, strongly associated with improved survival. However, accurate preoperative prediction of pR remains challenging, particularly in real-world clinical settings characterized by limited data availability and incomplete clinical profiles. In this study, we propose a multimodal deep learning framework designed to address these constraints by integrating foundation model-based CT feature extraction with a missing-aware architecture for clinical variables. This approach enables robust learning from small cohorts while explicitly modeling missing clinical information, without relying on conventional imputation strategies. A weighted fusion mechanism is employed to leverage the complementary contributions of imaging and clinical modalities, yielding a multimodal model that consistently outperforms both unimodal imaging and clinical baselines. These findings underscore the added value of integrating heterogeneous data sources and highlight the potential of multimodal, missing-aware systems to support pR prediction under realistic clinical conditions.
[CV-68] he Good the Better and the Best: Improving the Discriminability of Face Embeddings through Attribute-aware Learning
【速读】:该论文旨在解决人脸识别在年龄、姿态和遮挡等大变化条件下性能不稳定的问题。现有方法通常依赖于固定且异质的面部属性集进行辅助监督,但这种做法隐含假设所有属性对身份识别具有同等重要性,这并不合理,因为不同属性的判别能力各异,某些属性甚至可能引入有害偏差。解决方案的关键在于提出一种属性感知的人脸识别架构,通过联合使用身份类别标签、与身份相关属性以及非身份相关属性来监督面部嵌入的学习;其中,面部属性被组织为可解释的分组,从而能够以人类可理解的方式分解并分析各属性的贡献。实验表明,仅使用身份相关属性子集比使用更广泛的属性集合表现更优,且显式地让嵌入模型去除非身份相关属性的信息能进一步提升性能,同时该方法还可作为诊断工具评估识别器的信任度。
链接: https://arxiv.org/abs/2603.15062
作者: Ana Dias,João Ribeiro Pinto,Hugo Proença,João C. Neves
机构: University of Beira Interior (贝拉内斯特大学); Instituto de Telecomunicações (电信研究所); NOVA LINCS (NOVA LINCS); Amadeus (阿玛迪乌斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IWBF 2026
Abstract:Despite recent advances in face recognition, robust performance remains challenging under large variations in age, pose, and occlusion. A common strategy to address these issues is to guide representation learning with auxiliary supervision from facial attributes, encouraging the visual encoder to focus on identity-relevant regions. However, existing approaches typically rely on heterogeneous and fixed sets of attributes, implicitly assuming equal relevance across attributes. This assumption is suboptimal, as different attributes exhibit varying discriminative power for identity recognition, and some may even introduce harmful biases. In this paper, we propose an attribute-aware face recognition architecture that supervises the learning of facial embeddings using identity class labels, identity-relevant facial attributes, and non-identity-related attributes. Facial attributes are organized into interpretable groups, making it possible to decompose and analyze their individual contributions in a human-understandable manner. Experiments on standard face verification benchmarks demonstrate that joint learning of identity and facial attributes improves the discriminability of face embeddings with two major conclusions: (i) using identity-relevant subsets of facial attributes consistently outperforms supervision with a broader attribute set, and (ii) explicitly forcing embeddings to unlearn non-identity-related attributes yields further performance gains compared to leaving such attributes unsupervised. Additionally, our method serves as a diagnostic tool for assessing the trustworthiness of face recognition encoders by allowing for the measurement of accuracy gains with suppression of non-identity-relevant attributes, with such gains suggesting shortcut learning from redundant attributes associated with each identity.
[CV-69] SRL-MAD: Structured Residual Latents for One-Class Morphing Attack Detection
【速读】:该论文旨在解决**单类人脸伪造攻击检测(one-class morphing attack detection, MAD)中因依赖已知攻击标签而导致泛化能力受限的问题,尤其针对未见过的伪造攻击类型难以识别的挑战。其解决方案的关键在于提出一种基于结构化残差傅里叶表示(Structured Residual Fourier Representations, SRL-MAD)**的新方法:首先通过抑制图像特异性频谱趋势的残差频率图构建基础特征;其次采用环状结构保留傅里叶域的二维组织,并以可学习的环状谱投影替代传统的方位平均;最后引入频域先验知识,将频谱证据划分为低、中、高频带并建模跨频带交互,从而增强对伪造痕迹的敏感性。该设计直接映射至用于评分的潜在空间,避免依赖重建误差,显著提升了开放集场景下的检测性能。
链接: https://arxiv.org/abs/2603.15050
作者: Diogo J. Paulo,Hugo Proença,João C. Neves
机构: University of Beira Interior (贝拉内里大学); Instituto de Telecomunicações (电信研究所); NOVA LINCS (NOVA LINCS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IWBF 2026
Abstract:Face morphing attacks represent a significant threat to biometric systems as they allow multiple identities to be combined into a single face. While supervised morphing attack detection (MAD) methods have shown promising performance, their reliance on attack-labeled data limits generalization to unseen morphing attacks. This has motivated increasing interest in one-class MAD, where models are trained exclusively on bona fide samples and are expected to detect unseen attacks as deviations from the normal facial structure. In this context, we introduce SRL-MAD, a one-class single-image MAD that uses structured residual Fourier representations for open-set morphing attack detection. Starting from a residual frequency map that suppresses image-specific spectral trends, we preserve the two-dimensional organization of the Fourier domain through a ring-based representation and replace azimuthal averaging with a learnable ring-wise spectral projection. To further encode domain knowledge about where morphing artifacts arise, we impose a frequency-informed inductive bias by organizing spectral evidence into low, mid, and high-frequency bands and learning cross-band interactions. These structured spectral features are mapped into a latent space designed for direct scoring, avoiding the reliance on reconstruction errors. Extensive evaluation on FERET-Morph, FRLL-Morph, and MorDIFF demonstrates that SRL-MAD consistently outperforms recent one-class and supervised MAD models. Overall, our results show that learning frequency-aware projections provides a more discriminative alternative to azimuthal spectral summarization for one-class morphing attack detection.
[CV-70] GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents CVPR2026
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在中文移动图形用户界面(GUI)代理能力评估中存在的重要空白问题:现有基准测试 largely 英文导向,无法反映中文移动生态的语言特征与交互习惯,且缺乏从感知到执行的全流程细粒度评估框架。解决方案的关键在于提出 GUI-CEval——首个完全基于物理设备环境构建的中文移动 GUI 代理综合评测基准,其核心创新包括:(1) 覆盖201个主流应用、四种设备类型,确保场景真实性;(2) 设计两级结构,分别评估原子能力(如感知、规划、反思、执行、评估)与实际应用性能;(3) 通过多阶段人工采集与验证保障数据的真实性与可复现性。实验证明,尽管部分模型如 Qwen2.5-VL 和 UI-TARS 表现良好,但多数模型在反思决策和动作后自评方面仍存在显著短板,凸显了该基准在诊断模型能力缺陷与推动中文移动 GUI 代理发展中的关键价值。
链接: https://arxiv.org/abs/2603.15039
作者: Yang Li,Yuchen Liu,Haoyu Lu,Zhiqiang Xia,Hongzhen Wang,Kaiyang Han,Changpeng Yang,Jinyang Wu,Jiaming Xu,Runyu Shi,Ying Huang
机构: Xiaomi Corporation (小米公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR 2026
Abstract:Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a comprehensive and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.
[CV-71] raining-free Detection of Generated Videos via Spatial-Temporal Likelihoods CVPR2026
【速读】:该论文旨在解决合成视频检测中面临的两大核心问题:一是现有基于图像的检测方法因仅逐帧处理而忽略时序动态信息,导致检测性能受限;二是监督式视频检测模型在面对未见过的生成器(generator)时泛化能力差,难以适应快速涌现的新一代生成模型。解决方案的关键在于提出一种无需训练、模型无关的零样本(zero-shot)检测框架STALL,其通过联合建模空间和时间证据,在概率框架下提供基于似然的评分机制,从而实现对合成视频的可靠识别。该方法不依赖于特定生成模型或合成数据,而是基于真实视频数据的统计特性进行判别,具备理论依据和良好的实际效果。
链接: https://arxiv.org/abs/2603.15026
作者: Omer Ben Hayun,Roy Betser,Meir Yossef Levi,Levi Kassel,Guy Gilboa
机构: Technion – Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026
Abstract:Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce \emphSTALL, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at this https URL.
[CV-72] One CT Unified Model Training Framework to Rule All Scanning Protocols
【速读】:该论文旨在解决非理想测量计算机断层成像(Non-ideal measurement computed tomography, NICT)中因扫描协议差异导致的特征空间离散子流形问题,这一问题使得现有无监督方法因假设噪声同质而泛化能力差,甚至引发模型坍塌。其解决方案的关键在于提出不确定性引导的流形平滑(Uncertainty-Guided Manifold Smoothing, UMS)框架:通过一个分类器识别不同子流形并预测不确定性得分,指导生成跨流形的多样化样本,从而填补子流形间的空隙;同时设计全局与子流形驱动的动态架构,实现对共享特征与域特定特征的自适应建模,显著提升重建性能和泛化能力。
链接: https://arxiv.org/abs/2603.15025
作者: Fengzhi Xu,Ziyuan Yang,Zexin Lu,Yingyu Chen,Fenglei Fan,Hongming Shan,Yi Zhang
机构: Sichuan University(四川大学); City University of Hong Kong(香港城市大学); Fudan University(复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Non-ideal measurement computed tomography (NICT), which lowers radiation at the cost of image quality, is expanding the clinical use of CT. Although unified models have shown promise in NICT enhancement, most methods require paired data, which is an impractical demand due to inevitable organ motion. Unsupervised approaches attempt to overcome this limitation, but their assumption of homogeneous noise neglects the variability of scanning protocols, leading to poor generalization and potential model collapse. We further observe that distinct scanning protocols, which correspond to different physical imaging processes, produce discrete sub-manifolds in the feature space, contradicting these assumptions and limiting their effectiveness. To address this, we propose an Uncertainty-Guided Manifold Smoothing (UMS) framework to bridge the gaps between sub-manifolds. A classifier in UMS identifies sub-manifolds and predicts uncertainty scores, which guide the generation of diverse samples across the entire manifold. By leveraging the classifier’s capability, UMS effectively fills the gaps between discrete sub-manifolds, and promotes a continuous and dense feature space. Due to the complexity of the global manifold, it’s hard to directly model it. Therefore, we propose to dynamically incorporate the global- and sub-manifold-specific features. Specifically, we design a global- and sub-manifold-driven architecture guided by the classifier, which enables dynamic adaptation to subdomain variations. This dynamic mechanism improves the network’s capacity to capture both shared and domain-specific features, thereby improving reconstruction performance. Extensive experiments on public datasets are conducted to validate the effectiveness of our method across different generation paradigms.
[CV-73] Reference-Free Omnidirectional Stereo Matching via Multi-View Consistency Maximization
【速读】:该论文旨在解决多鱼眼相机(multi-fisheye)立体匹配中全局一致性、可见性感知与尺度变化敏感等问题,现有方法通常依赖球面扫描或参考视图中心的立体匹配策略,难以显式建模多视角间的几何关系,导致对遮挡、部分重叠及基线变化等场景适应能力弱。其解决方案的关键在于提出一种无参考(reference-free)框架 FreeOmniMVS,通过最大化多视角一致性实现鲁棒的深度估计;核心创新包括:1)引入视图对相关性Transformer(View-pair Correlation Transformer, VCT),显式建模所有相机视图对之间的相关性体积,从而剔除因遮挡或离焦导致的不可靠视图对;2)设计轻量级注意力机制,自适应融合相关向量,实现可扩展且可见性感知的全局共识,无需指定参考视图,使所有相机平等参与立体匹配过程。
链接: https://arxiv.org/abs/2603.15019
作者: Lehuai Xu,Weiming Zhang,Yang Li,Sidan Du,Lin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:Reliable omnidirectional depth estimation from multi-fisheye stereo matching is pivotal to many applications, such as embodied robotics. Existing approaches either rely on spherical sweeping with heuristic fusion strategies to build the cost columns or perform reference-centric stereo matching based on rectified views. However, these methods fail to explicitly exploit geometric relationships between multiple views, rendering them less capable of capturing the global dependencies, visibility, or scale changes. In this paper, we shift to a new perspective and propose a novel reference-free framework, dubbed FreeOmniMVS, via multi-view consistency maximization. The highlight of FreeOmniMVS is that it can aggregate pair-wise correlations into a robust, visibility-aware, and global consensus. As such, it is tolerant to occlusions, partial overlaps, and varying baselines. Specifically, to achieve global coherence, we introduce a novel View-pair Correlation Transformer (VCT) that explicitly models pairwise correlation volumes across all camera view pairs, allowing us to drop unreliable pairs caused by occlusion or out-of-focus observations. To realize scalable and visibility-aware consensus, we propose a lightweight attention mechanism that adaptively fuses the correlation vectors, eliminating the need for a designated reference view and allowing all cameras to contribute equally to the stereo matching process. Extensive experiments on diverse benchmark datasets demonstrate the superiority of our method for globally consistent, visibility-aware, and scale-aware omnidirectional depth estimation.
[CV-74] Riemannian Motion Generation: A Unified Framework for Human Motion Representation and Generation via Riemannian Flow Matching
【速读】:该论文旨在解决人类动作生成模型在欧几里得空间中学习时忽略动作固有非欧几里得几何结构的问题,从而导致生成动作缺乏物理合理性与自然性。解决方案的关键在于提出Riemannian Motion Generation (RMG) 框架,该框架将动作表示为流形乘积空间上的结构,并通过黎曼流匹配(Riemannian flow matching)建模动力学;其核心创新包括:将运动分解为平移(translation)和旋转(rotation)等几何因子,获得无量纲的内在归一化表示,同时采用测地线插值、切空间监督和保持流形结构的常微分方程(ODE)积分策略进行训练与采样,显著提升了动作生成的质量与稳定性。
链接: https://arxiv.org/abs/2603.15016
作者: Fangran Miao,Jian Huang,Ting Li
机构: The Hong Kong Polytechnic University(香港理工大学); Southern University of Science and Technology(南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 18 pages, 6 figures
Abstract:Human motion generation is often learned in Euclidean spaces, although valid motions follow structured non-Euclidean geometry. We present Riemannian Motion Generation (RMG), a unified framework that represents motion on a product manifold and learns dynamics via Riemannian flow matching. RMG factorizes motion into several manifold factors, yielding a scale-free representation with intrinsic normalization, and uses geodesic interpolation, tangent-space supervision, and manifold-preserving ODE integration for training and sampling. On HumanML3D, RMG achieves state-of-the-art FID in the HumanML3D format (0.043) and ranks first on all reported metrics under the MotionStreamer format. On MotionMillion, it also surpasses strong baselines (FID 5.6, R@1 0.86). Ablations show that the compact \mathscrT+\mathscrR (translation + rotations) representation is the most stable and effective, highlighting geometry-aware modeling as a practical and scalable route to high-fidelity motion generation.
[CV-75] Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing
【速读】:该论文旨在解决生成式 AI (Generative AI) 在反应图谱解析(Reaction Diagram Parsing, RxnDP)任务中面临的两大核心挑战:一是视觉化学实体与预训练知识之间的对齐困难,二是词元级训练与反应级评估之间的固有不一致性。解决方案的关键在于从提示表示(prompting representation)和学习范式(learning paradigms)两个互补角度进行优化:首先提出“标识符作为视觉提示”(Identifier as Visual Prompting, IdtVP),利用自然出现的分子标识符(如加粗数字 1a)激活 VLM 预训练阶段所学的化学知识,从而显著提升零样本和分布外泛化能力;其次引入 Re3-DAPO 强化学习算法,通过可验证奖励直接优化反应级指标,相较标准监督微调实现持续性能提升。
链接: https://arxiv.org/abs/2603.15011
作者: Jiahe Song,Chuang Wang,Yinfan Wang,Hao Zheng,Rui Nie,Bowen Jiang,Xingjian Wei,Junyuan Gao,Yubin Wang,Bin Wang,Lijun Wu,Jiang Wu,Qian Yu,Conghui He
机构: Peking University (北京大学); Beijing University of Aeronautics and Astronautics (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reaction diagram parsing (RxnDP) is critical for extracting chemical synthesis information from literature. Although recent Vision-Language Models (VLMs) have emerged as a promising paradigm to automate this complex visual reasoning task, their application is fundamentally bottlenecked by the inability to align visual chemical entities with pre-trained knowledge, alongside the inherent discrepancy between token-level training and reaction-level evaluation. To address these dual challenges, this work enhances VLM-based RxnDP from two complementary perspectives: prompting representation and learning paradigms. First, we propose Identifier as Visual Prompting (IdtVP), which leverages naturally occurring molecule identifiers (e.g., bold numerals like 1a) to activate the chemical knowledge acquired during VLM pre-training. IdtVP enables powerful zero-shot and out-of-distribution capabilities, outperforming existing prompting strategies. Second, to further optimize performance within fine-tuning paradigms, we introduce Re3-DAPO, a reinforcement learning algorithm that leverages verifiable rewards to directly optimize reaction-level metrics, thereby achieving consistent gains over standard supervised fine-tuning. Additionally, we release the ScannedRxn benchmark, comprising scanned historical reaction diagrams with real-world artifacts, to rigorously assess model robustness and out-of-distribution ability. Our contributions advance the accuracy and generalization of VLM-based reaction diagram parsing. We will release data, models, and code on GitHub.
[CV-76] Clue Matters: Leverag ing Latent Visual Clues to Empower Video Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在视频问答(Video Question Answering, VideoQA)任务中面临的挑战,尤其是由于缺乏显式的结构化推理机制而导致的幻觉严重、可解释性差等问题。核心问题包括:视觉线索提取不忠实、线索筛选缺乏效用感知、以及线索与答案之间无法端到端对齐。解决方案的关键在于提出ClueNet框架,其采用两阶段监督微调范式,通过解耦监督实现线索提取与链式推理的对齐,并引入自适应线索过滤器在推理阶段优化高阶推理过程,同时设计轻量化模块提升推理效率。该方法显著提升了视频理解的准确性、鲁棒性和可解释性,且兼容多种骨干模型架构。
链接: https://arxiv.org/abs/2603.15008
作者: Kaixin zhang,Xiaohe Li,Jiahao Li,Haohua Wu,Xinyu Zhao,Zide Fan,Lei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures
Abstract:Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by \ge 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.
[CV-77] Edit2Interp: Adapting Image Foundation Models from Spatial Editing to Video Frame Interpolation with Few-Shot Learning
【速读】:该论文旨在解决预训练图像编辑模型缺乏显式时间建模能力的问题,即如何利用仅具备空间推理和对象感知变换能力的静态图像编辑模型来实现视频帧插值(Video Frame Interpolation, VFI)这一时序任务。解决方案的关键在于:通过少量样本(64–256个训练样本)使用低秩适应(Low-Rank Adaptation, LoRA)对原生设计用于静态指令编辑的大型图像编辑模型(Qwen-Image-Edit)进行微调,从而激活其隐含的时间推理能力——这种能力源于模型对“物体在静态场景中如何变化”的理解,而无需引入任何视频专用架构或运动估计模块。该方法证明了基础图像编辑模型在资源受限场景下具有可迁移的时序合成潜力,为连接图像操作与视频理解提供了新的数据高效路径。
链接: https://arxiv.org/abs/2603.15003
作者: Nasrin Rahimi,Mısra Yavuz,Burak Can Biner,Yunus Bilge Kurt,Ahmet Rasim Emirdağı,Süleyman Aslan,Görkay Aydemir,M. Akın Yılmaz,A. Murat Tekalp
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained image editing models exhibit strong spatial reasoning and object-aware transformation capabilities acquired from billions of image-text pairs, yet they possess no explicit temporal modeling. This paper demonstrates that these spatial priors can be repurposed to unlock temporal synthesis capabilities through minimal adaptation - without introducing any video-specific architecture or motion estimation modules. We show that a large image editing model (Qwen-Image-Edit), originally designed solely for static instruction-based edits, can be adapted for Video Frame Interpolation (VFI) using only 64-256 training samples via Low-Rank Adaptation (LoRA). Our core contribution is revealing that the model’s inherent understanding of “how objects transform” in static scenes contains latent temporal reasoning that can be activated through few-shot fine-tuning. While the baseline model completely fails at producing coherent intermediate frames, our parameter-efficient adaptation successfully unlocks its interpolation capability. Rather than competing with task-specific VFI methods trained from scratch on massive datasets, our work establishes that foundation image editing models possess untapped potential for temporal tasks, offering a data-efficient pathway for video synthesis in resource-constrained scenarios. This bridges the gap between image manipulation and video understanding, suggesting that spatial and temporal reasoning may be more intertwined in foundation models than previously recognized
[CV-78] hermal Image Refinement with Depth Estimation using Recurrent Networks for Monocular ORB-SLAM3
【速读】:该论文旨在解决在无GPS信号且视觉条件恶劣环境下,无人飞行器(UAV)自主导航的难题。其核心挑战在于如何利用单一热成像相机实现实时深度估计与同时定位与建图(SLAM)。解决方案的关键在于提出了一种新颖的轻量级监督网络架构,该架构融合了循环块(Recurrent Blocks, RBs)以捕捉时间依赖性,从而提升深度预测的鲁棒性;同时结合轻量化卷积骨干网络与热图像精炼网络(T-RefNet),增强热图像特征可见度,并将优化后的热图像和预测深度图集成至ORB-SLAM3中,实现纯热成像下的稳定定位。值得注意的是,该方法仅需在非辐射校准(non-radiometric)热图像数据集上训练,避免了对昂贵辐射校准热成像设备的依赖,实验证明其在低光照条件下具有优于现有方法的深度精度与SLAM性能。
链接: https://arxiv.org/abs/2603.14998
作者: Hürkan Şahin,Huy Xuan Pham,Van Huyen Dang,Alper Yegenoglu,Erdal Kayacan
机构: Paderborn University (帕德博恩大学); Aarhus University (奥胡斯大学); Upteko ApS
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 8 figures, 2 table
Abstract:Autonomous navigation in GPS-denied and visually degraded environments remains challenging for unmanned aerial vehicles (UAVs). To this end, we investigate the use of a monocular thermal camera as a standalone sensor on a UAV platform for real-time depth estimation and simultaneous localization and mapping (SLAM). To extract depth information from thermal images, we propose a novel pipeline employing a lightweight supervised network with recurrent blocks (RBs) integrated to capture temporal dependencies, enabling more robust predictions. The network combines lightweight convolutional backbones with a thermal refinement network (T-RefNet) to refine raw thermal inputs and enhance feature visibility. The refined thermal images and predicted depth maps are integrated into ORB-SLAM3, enabling thermal-only localization. Unlike previous methods, the network is trained on a custom non-radiometric dataset, obviating the need for high-cost radiometric thermal cameras. Experimental results on datasets and UAV flights demonstrate competitive depth accuracy and robust SLAM performance under low-light conditions. On the radiometric VIVID++ (indoor-dark) dataset, our method achieves an absolute relative error of approximately 0.06, compared to baselines exceeding 0.11. In our non-radiometric indoor set, baseline errors remain above 0.24, whereas our approach remains below 0.10. Thermal-only ORB-SLAM3 maintains a mean trajectory error under 0.4 m.
[CV-79] MMSpec: Benchmarking Speculative Decoding for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在推理过程中因模型规模大和多模态上下文长而导致的高延迟问题。现有高效的推测解码(Speculative Decoding)技术虽在纯文本大语言模型(Large Language Models, LLMs)中表现良好,但在VLMs中的行为尚不明确且效果不佳。为系统评估这一问题,作者提出了MMSpec——首个针对VLMs的推测解码基准,涵盖600个多模态样本和10种代表性算法。研究发现:(1)专为文本LLMs设计的推测方法在多模态场景下性能下降;(2)随着批量大小增加,视觉感知能力变得愈发关键;(3)吞吐量提升并不能可靠反映实际延迟改善。基于这些洞见,论文提出ViSkip,一种即插即用的推测解码方法,能动态适配视觉token,实现最优延迟性能,达到当前最优水平。
链接: https://arxiv.org/abs/2603.14989
作者: Hui Shen,Xin Wang,Ping Zhang,Yunta Hsieh,Qi Han,Zhongwei Wan,Ziheng Zhang,Jingxuan Zhang,Jing Xiong,Ziyuan Liu,Yifan Zhang,Hangrui Cao,Chenyang Zhao,Mi Zhang
机构: University of Michigan (密歇根大学); The Ohio State University (俄亥俄州立大学); Independent (独立); Indiana University (印第安纳大学); The University of Hong Kong (香港大学); Peking University (北京大学); Carnegie Mellon University (卡内基梅隆大学); LMSYS Org (LMSYS组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.
[CV-80] Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation
【速读】:该论文旨在解决在自然环境中准确估计情感模仿强度(Emotional Mimicry Intensity, EMI)的难题,其核心挑战在于如何建模跨异构模态(如视觉和听觉信号)中复杂的非线性时序动态,尤其是在物理信号受噪声污染或缺失的情况下。解决方案的关键在于提出一种名为TAEMI(Text-Anchored Emotional Mimicry Intensity estimation)的新型多模态框架:首先打破传统对称融合范式,利用文本转录所蕴含的稳定、时间无关语义先验作为中心锚点;其次引入Text-Anchored Dual Cross-Attention机制,通过文本查询主动过滤帧级冗余信息并对齐噪声干扰的物理流;此外,为应对现实场景中不可避免的模态缺失问题,设计了可学习的缺失模态标记(Learnable Missing-Modality Tokens)与模态丢弃策略(Modality Dropout),从而显著提升模型在不完美条件下的预测鲁棒性。
链接: https://arxiv.org/abs/2603.14976
作者: Lingsi Zhu,Yuefeng Zou,Yunxiang Zhang,Naixiang Zheng,Guoyuan Wang,Jun Yu,Jiaen Liang,Wei Huang,Shengping Liu,Ximin Zheng
机构: University of Science and Technology of China (中国科学技术大学); Unisound AI Technology Co., Ltd. (声智科技有限公司); Pingan Technology Co., Ltd. (平安科技有限公司)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially when physical signals are corrupted or missing. To tackle this, we propose TAEMI (Text-Anchored Emotional Mimicry Intensity estimation), a novel multimodal framework designed for the 10th ABAW Competition. Motivated by the observation that continuous visual and acoustic signals are highly susceptible to transient environmental noise, we break the traditional symmetric fusion paradigm. Instead, we leverage textual transcript–which inherently encode a stable, time-independent semantic prior–as central anchors. Specifically, we introduce a Text-Anchored Dual Cross-Attention mechanism that utilizes these robust textual queries to actively filter out frame-level redundancies and align the noisy physical streams. Furthermore, to prevent catastrophic performance degradation caused by inevitably missing data in unconstrained real-world scenarios, we integrate Learnable Missing-Modality Tokens and a Modality Dropout strategy during training. Extensive experiments on the Hume-Vidmimic2 dataset demonstrate that TAEMI effectively captures fine-grained emotional variations and maintains robust predictive resilience under imperfect conditions. Our framework achieves a state-of-the-art mean Pearson correlation coefficient across six continuous emotional dimensions, significantly outperforming existing baseline methods.
[CV-81] Voronoi-based Second-order Descriptor with Whitened Metric in LiDAR Place Recognition ICRA26
【速读】:该论文旨在解决LiDAR Place Recognition (LPR) 中全局描述符在使用二阶池化(second-order pooling)时因传统实现方式与后归一化导致的欧氏距离不适用问题。其核心解决方案是将二阶池化与Voronoi单元的归纳偏置(inductive bias)相结合,通过构建局部描述符的二阶矩阵并引入白化(whitening)机制,在隐式度量马氏距离(Mahalanobis distance)的同时保留Voronoi单元的聚类特性,从而缓解学习过程中因多样化技术带来的数值不稳定问题。
链接: https://arxiv.org/abs/2603.14974
作者: Jaein Kim,Hee Bin Yoo,Dong-Sig Han,Byoung-Tak Zhang
机构: Seoul National University (首尔国立大学); École Normale Supérieure (ENS) (法国巴黎高等师范学院); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at ICRA 26
Abstract:The pooling layer plays a vital role in aggregating local descriptors into the metrizable global descriptor in the LiDAR Place Recognition (LPR). In particular, the second-order pooling is capable of capturing higher-order interactions among local descriptors. However, its existing methods in the LPR adhere to conventional implementations and post-normalization, and incur the descriptor unsuitable for Euclidean distancing. Based on the recent interpretation that associates NetVLAD with the second-order statistics, we propose to integrate second-order pooling with the inductive bias from Voronoi cells. Our novel pooling method aggregates local descriptors to form the second-order matrix and whitens the global descriptor to implicitly measure the Mahalanobis distance while conserving the cluster property from Voronoi cells, addressing its numerical instability during learning with diverse techniques. We demonstrate its performance gains through the experiments conducted on the Oxford Robotcar and Wild-Places benchmarks and analyze the numerical effect of the proposed whitening algorithm.
[CV-82] GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis
【速读】:该论文旨在解决现有相机控制视频扩散模型在新视角合成(Novel View Synthesis, NVS)任务中面临的几何一致性不足与相机可控性有限的问题。其解决方案的关键在于提出一种基于几何引导的特征适配器——Gaussian Splat Feature Adapter (GS-Adapter),该模块将输入视图的扩散特征显式提升至3D高斯表示,通过几何约束渲染新视角特征,并自适应融合以修正几何不一致的表示,从而在特征空间中实现更精确的几何保真度和更强的相机控制能力。此方法避免了传统基于输入层注入几何信息所带来的视图依赖性颜色噪声问题,且具备即插即用特性,无需额外训练即可兼容多种前向几何模型。
链接: https://arxiv.org/abs/2603.14965
作者: Minjun Kang,Inkyu Shin,Taeyeop Lee,Myungchul Kim,In So Kweon,Kuk-Jin Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code will be available at this https URL
Abstract:Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.
[CV-83] CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models
【速读】:该论文旨在解决现有视觉模型在图像理解与图像生成任务中通常依赖独立模块、难以协同优化的问题。其解决方案的关键在于提出一种统一的视觉-语言基础模型CyCLeGen,采用全集成架构并通过图像-布局-图像和布局-图像-布局的循环一致性学习机制,实现感知与合成的联合训练。这一设计不仅引入了“自省”能力(introspection),使模型能够对自身生成结果进行推理,还提升了数据效率,支持通过强化学习框架下的循环一致性监督实现自我改进。
链接: https://arxiv.org/abs/2603.14957
作者: Xiaojun Shan,Haoyu Shen,Yucheng Mao,Xiang Zhang,Abhay Anand,Bingnan Li,Haiyang Xu,Zhuowen Tu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image-layout-image and layout-image-layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.
[CV-84] Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在视频问答(VideoQA)任务中因推理成本高和信息稀释导致的性能瓶颈问题。其核心挑战在于如何在保证效率的同时提升时序与因果推理能力。解决方案的关键在于提出一种问题感知的关键帧选择框架,包含两个创新组件:一是利用LMM生成伪关键帧标签(pseudo keyframe labels),提供更具信息量的监督信号以缓解稀疏标注问题;二是引入覆盖正则化(coverage regularization),鼓励时间维度上多样且互补的证据选择,从而优化关键帧分布并增强推理准确性。实验表明,该方法显著提升了NExT-QA数据集上的整体准确率,尤其在时序和因果类问题上表现突出。
链接: https://arxiv.org/abs/2603.14953
作者: Minchan Kwon,Hyounguk Shon,Junmo Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large multimodal models (LMMs) have recently demonstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity. We present a question-aware keyframe selection framework with two components: pseudo keyframe labels derived from LMMs that provide informative supervision and a coverage regularization that promotes diverse, complementary evidence across time. Experiments on NExT-QA show that our method significantly improves accuracy, especially for temporal and causal question types, establishing keyframe selection as an effective and learnable module for VideoQA.
[CV-85] Pansharpening for Thin-Cloud Contaminated Remote Sensing Images: A Unified Framework and Benchmark Dataset AAAI2026
【速读】:该论文旨在解决薄云条件下遥感图像的融合问题,即在同时存在空间分辨率下降和云引起的光谱失真的情况下,如何实现高质量的多光谱图像(MSI)与全色图像(PAN)融合。现有方法通常将去云与融合分步处理,导致误差累积且性能受限,因其缺乏对两类退化过程的联合建模。本文提出统一的端到端框架Pan-TCR,其核心创新在于设计了一个频域解耦恢复(FDR)模块,将MSI特征分解为幅度与相位两部分:其中近红外(NIR)波段幅度用于增强抗云能力,全色(PAN)相位用于提升高分辨率结构细节;进一步引入交互式跨频一致性(IFC)模块,通过跨模态优化确保幅度与相位之间的频率一致性,从而提升整体重建鲁棒性。该方案首次实现了薄云污染场景下的联合退化建模与高效修复,显著优于传统分步处理方法。
链接: https://arxiv.org/abs/2603.14952
作者: Songcheng Du,Yang Zou,Jiaxin Li,Mingxuan Liu,Ying Li,Changjing Shang,Qiang Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages,5 figures,published in AAAI2026
Abstract:Pansharpening under thin cloudy conditions is a practically significant yet rarely addressed task, challenged by simultaneous spatial resolution degradation and cloud-induced spectral distortions. Existing methods often address cloud removal and pansharpening sequentially, leading to cumulative errors and suboptimal performance due to the lack of joint degradation modeling. To address these challenges, we propose a Unified Pansharpening Model with Thin Cloud Removal (Pan-TCR), an end-to-end framework that integrates physical priors. Motivated by theoretical analysis in the frequency domain, we design a frequency-decoupled restoration (FDR) block that disentangles the restoration of multispectral image (MSI) features into amplitude and phase components, each guided by complementary degradation-robust prompts: the near-infrared (NIR) band amplitude for cloud-resilient restoration, and the panchromatic (PAN) phase for high-resolution structural enhancement. To ensure coherence between the two components, we further introduce an interactive inter-frequency consistency (IFC) module, enabling cross-modal refinement that enforces consistency and robustness across frequency cues. Furthermore, we introduce the first real-world thin-cloud contaminated pansharpening dataset (PanTCR-GF2), comprising paired clean and cloudy PAN-MSI images, to enable robust benchmarking under realistic conditions. Extensive experiments on real-world and synthetic datasets demonstrate the superiority and robustness of Pan-TCR, establishing a new benchmark for pansharpening under realistic atmospheric degradations.
[CV-86] GT-PCQA: Geometry-Texture Decoupled Point Cloud Quality Assessment with MLLM
【速读】:该论文旨在解决基于多模态大语言模型(Multi-modal Large Language Models, MLLMs)的无参考图像质量评估(No-reference Image Quality Assessment, NR-IQA)方法在客观图像质量评估(Perceptual Image Quality Assessment, PCQA)任务中表现受限的问题。具体而言,现有PCQA数据集规模有限,难以支持MLLM的有效指令微调;同时,由于大规模图文预训练导致MLLM倾向于依赖纹理特征进行推理,对几何结构退化(geometric structural degradations)敏感度不足,而这正是PCQA的关键判别因素。解决方案的关键在于提出一种名为GT-PCQA的新框架,其核心创新为两个策略:一是采用2D-3D联合训练策略,将PCQA建模为相对质量比较问题,从而融合大规模IQA数据与有限PCQA数据,并结合低秩适应(Low-Rank Adaptation, LoRA)实现高效指令微调;二是设计几何-纹理解耦策略,通过双提示机制与交替优化方案,缓解MLLM固有的纹理主导偏差,提升对几何结构退化的感知能力。
链接: https://arxiv.org/abs/2603.14951
作者: Guohua Zhang,Jian Jin,Meiqin Liu,Chao Yao,Weisi Lin,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); Nanyang Technological University (南洋理工大学); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising generalization. However, directly extending these MLLM-based IQA methods to PCQA remains challenging. On the one hand, existing PCQA datasets are limited in scale, which hinders stable and effective instruction tuning of MLLMs. On the other hand, due to large-scale image-text pretraining, MLLMs tend to rely on texture-dominant reasoning and are insufficiently sensitive to geometric structural degradations that are critical for PCQA. To address these gaps, we propose a novel MLLM-based no-reference PCQA framework, termed GT-PCQA, which is built upon two key strategies. First, to enable stable and effective instruction tuning under scarce PCQA supervision, a 2D-3D joint training strategy is proposed. This strategy formulates PCQA as a relative quality comparison problem to unify large-scale IQA datasets with limited PCQA datasets. It incorporates a parameter-efficient Low-Rank Adaptation (LoRA) scheme to support instruction tuning. Second, a geometry-texture decoupling strategy is presented, which integrates a dual-prompt mechanism with an alternating optimization scheme to mitigate the inherent texture-dominant bias of pre-trained MLLMs, while enhancing sensitivity to geometric structural degradations. Extensive experiments demonstrate that GT-PCQA achieves competitive performance and exhibits strong generalization.
[CV-87] Bridging Scene Generation and Planning : Driving with World Model via Unifying Vision and Motion Representation
【速读】:该论文旨在解决当前驾驶世界模型(Driving World Model)在视觉表征与运动规划之间存在割裂的问题,即现有方法多聚焦于视觉场景的生成,而缺乏对可共享且可继承的运动表征设计,导致视觉动态优化与精确运动规划需求不一致。解决方案的关键在于提出一个整体框架WorldDrive,其核心创新是通过统一视觉与运动表征来耦合场景生成与实时规划:首先构建轨迹感知的驾驶世界模型(Trajectory-aware Driving World Model),以轨迹词汇为条件强制视觉动态与运动意图的一致性;随后将预训练的视觉和运动编码器迁移至多模态规划器(Multi-modal Planner),使驾驶策略基于已优化的成熟表示运行;并引入未来感知奖励器(Future-aware Rewarder),从冻结的世界模型中提取未来潜在表征用于实时轨迹评估与选择,从而实现高质量、多模态轨迹生成与高保真动作可控视频生成的协同优化。
链接: https://arxiv.org/abs/2603.14948
作者: Xingtai Gui,Meijie Zhang,Tianyi Yan,Wencheng Han,Jiahao Gong,Feiyang Tan,Cheng-zhong Xu,Jianbing Shen
机构: University of Macau (澳门大学); Afari Intelligent Drive (Afari智能驾驶)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures. The code is available at this https URL
Abstract:End-to-end autonomous driving aims to generate safe and plausible planning policies from raw sensor input. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner-shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation. We first introduce a Trajectory-aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi-modal Planner, ensuring the driving policy operates on mature representations pre-optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high-quality, multi-modal trajectories. Furthermore, to exploit the world model’s foresight, we propose a Future-aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real-time. Extensive experiments on the NAVSIM, NAVSIM-v2, and nuScenes benchmarks demonstrate that WorldDrive achieves leading planning performance among vision-only methods while maintaining high-fidelity action-controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving.
[CV-88] FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving
【速读】:该论文旨在解决自动驾驶系统在训练与评估中因缺乏可扩展且具备交互能力的仿真环境而导致的可靠性瓶颈问题。现有生成式视频模型虽具高视觉保真度,但多为开环设置,难以实现智能体动作与环境演化之间的帧级精细交互。解决方案的关键在于提出FAR-Drive框架,其核心创新包括:1)设计多视角扩散Transformer(multi-view diffusion transformer),引入细粒度结构化控制以实现多摄像头视角下的几何一致性生成;2)采用两阶段训练策略——自适应参考时域条件和混合强制自回归训练(blend-forcing autoregressive training),有效缓解长时间序列下迭代自条件带来的退化问题并提升一致性;3)集成系统级效率优化,确保单GPU推理延迟低于1秒,满足低延迟交互需求。该方法在nuScenes数据集上实现了当前闭合环路自动驾驶仿真中的最先进性能。
链接: https://arxiv.org/abs/2603.14938
作者: Yaoru Li,Federico Landi,Marco Godi,Xin Jin,Ruiju Fu,Yufei Ma,Muyang Sun,Heyu Si,Qi Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.
[CV-89] Relevance Feedback in Text-to-Image Diffusion: A Training-Free And Model-Agnostic Interactive Framework
【速读】:该论文旨在解决文本到图像生成中用户视觉意图难以通过语言精确表达的问题,即由于提示词模糊导致生成图像与用户真实需求不一致。现有方法通常依赖高负载的文本对话、黑箱推理或昂贵的微调,无法同时实现低认知负荷、可解释的偏好推断,并保持训练-free 和模型无关性。其解决方案的关键在于提出一种名为 RFD 的交互式框架,将信息检索中的相关反馈机制引入扩散模型:用户通过多选视觉反馈替代显式文本对话以降低认知负担;构建专家标注的特征库并采用信息论加权累积偏好分析方法,从当前轮次反馈中计算偏好并增量累积,避免历史交互拼接引发的上下文过长导致的性能退化;同时引入概率采样机制进行提示重建,在利用已有偏好和探索新空间之间取得平衡,防止输出趋同。RFD 完全在外部文本空间运行,无需训练且对任意扩散模型通用,是一种即插即用的解决方案。
链接: https://arxiv.org/abs/2603.14936
作者: Wenxi Wang,Hongbin Liu,Mingqian Li,Junyan Yuan,Junqi Zhang
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image generation using diffusion models has achieved remarkable success. However, users often possess clear visual intents but struggle to express them precisely in language, resulting in ambiguous prompts and misaligned images. Existing methods struggle to bridge this gap, typically relying on high-load textual dialogues, opaque black-box inferences, or expensive fine-tuning. They fail to simultaneously achieve low cognitive load, interpretable preference inference, and remain training-free and model-agnostic. To address this, we propose RFD, an interactive framework that adapts the relevance feedback mechanism from information retrieval to diffusion models. In RFD, users replace explicit textual dialogue with implicit, multi-select visual feedback to minimize cognitive load, easily expressing complex, multi-dimensional preferences. To translate feedback into precise generative guidance, we construct an expert-curated feature repository and introduce an information-theoretic weighted cumulative preference analysis. This white-box method calculates preferences from current-round feedback and incrementally accumulates them, avoiding the concatenation of historical interactions and preventing inference degradation caused by lengthy contexts. Furthermore, RFD employs a probabilistic sampling mechanism for prompt reconstruction to balance exploitation and exploration, preventing output homogenization. Crucially, RFD operates entirely within the external text space, making it strictly training-free and model-agnostic as a universal plug-and-play solution. Extensive experiments demonstrate that RFD effectively captures the user’s true visual intent, significantly outperforming baselines in preference alignment.
[CV-90] Video-CoE: Reinforcing Video Event Prediction via Chain of Events
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频事件预测(Video Event Prediction, VEP)任务中表现不佳的问题,主要体现在模型缺乏对视频时序细节的精细建模能力以及难以建立视频内容与未来事件之间的逻辑关联。解决方案的关键在于提出一种名为“事件链”(Chain of Events, CoE)的新范式,通过构建时间上的事件链来隐式引导MLLM聚焦于视觉信息并强化视频与未来事件间的逻辑连接,同时结合多种训练策略以提升模型的推理能力。实验结果表明,该方法在公开基准上显著优于当前主流开源和商用MLLMs,确立了VEP任务的新SOTA性能。
链接: https://arxiv.org/abs/2603.14935
作者: Qile Su,Jing Tang,Rui Chen,Lei Sun,Xiangxiang Chu
机构: AMAP, Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 18 figures, 6 tables
Abstract:Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbfChain \textbfof \textbfEvents (\textbfCoE) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model’s reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.
[CV-91] Workflow-Aware Structured Layer Decomposition for Illustration Production
【速读】:该论文旨在解决当前生成式图像编辑方法在处理动漫插画等人工创作图像时,因依赖对象分割的分层表示而难以捕捉结构与风格特性的问题。其解决方案的关键在于提出一种面向动画制作流程的结构化分层分解框架,通过模拟动漫制作的标准工作流,将插画分解为语义明确的生产层(如线稿、平涂色、阴影和高光),并引入轻量级层语义嵌入以指导各层的具体任务,同时设计分层损失函数监督训练过程,从而实现准确且视觉一致的分层分解。
链接: https://arxiv.org/abs/2603.14925
作者: Tianyu Zhang,Dongchi Li,Keiichi Sawada,Haoran Xie
机构: Japan Advanced Institute of Science and Technology (日本高级科学技术研究院); Live2D Inc. (Live2D公司); Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 17 pages, 15 figures
Abstract:Recent generative image editing methods adopt layered representations to mitigate the entangled nature of raster images and improve controllability, typically relying on object-based segmentation. However, such strategies may fail to capture the structural and stylized properties of human-created images, such as anime illustrations. To solve this issue, we propose a workflow-aware structured layer decomposition framework tailored to the illustration production of anime artwork. Inspired by the creation pipeline of anime production, our method decomposes the illustration into semantically meaningful production layers, including line art, flat color, shadow, and highlight. To decouple all these layers, we introduce lightweight layer semantic embeddings to provide specific task guidance for each layer. Furthermore, a set of layer-wise losses is incorporated to supervise the training process of individual layers. To overcome the lack of ground-truth layered data, we construct a high-quality illustration dataset that simulated the standard anime production workflow. Experiments demonstrate that the accurate and visually coherent layer decompositions were achieved by using our method. We believe that the resulting layered representation further enables downstream tasks such as recoloring and embedding texture, supporting content creation, and illustration editing. Code is available at: this https URL
[CV-92] textF2textHDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling CVPR2026
【速读】:该论文旨在解决从交替曝光的低动态范围(Low Dynamic Range, LDR)视频帧中重建高动态范围(High Dynamic Range, HDR)视频的问题,尤其针对动态场景下因跨曝光不一致性与复杂运动导致的帧间对齐困难、鬼影效应及细节丢失等挑战。解决方案的关键在于提出一个两阶段的HDR视频重建框架F²HDR,其核心创新包括:1)引入流适配器(flow adapter),将通用光流(optical flow)适配用于鲁棒的跨曝光对齐;2)设计物理运动建模机制以识别显著运动区域;3)构建运动感知细化网络(motion-aware refinement network),在聚合互补信息的同时有效去除鬼影和噪声,从而在大运动和曝光变化条件下实现高质量、无鬼影的HDR视频重建。
链接: https://arxiv.org/abs/2603.14920
作者: Huanjing Yue,Dawei Li,Shaoxiong Tu,Jingyu Yang
机构: Tianjin University (天津大学); Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Reconstructing High Dynamic Range (HDR) videos from sequences of alternating-exposure Low Dynamic Range (LDR) frames remains highly challenging, especially under dynamic scenes where cross-exposure inconsistencies and complex motion make inter-frame alignment difficult, leading to ghosting and detail loss. Existing methods often suffer from inaccurate alignment, suboptimal feature aggregation, and degraded reconstruction quality in motion-dominated regions. To address these challenges, we propose \textF^2\textHDR , a two-stage HDR video reconstruction framework that robustly perceives inter-frame motion and restores fine details in complex dynamic scenarios. The proposed framework integrates a flow adapter that adapts generic optical flow for robust cross-exposure alignment, a physical motion modeling to identify salient motion regions, and a motion-aware refinement network that aggregates complementary information while removing ghosting and noise. Extensive experiments demonstrate that \textF^2\textHDR achieves state-of-the-art performance on real-world HDR video benchmarks, producing ghost-free and high-fidelity results under large motion and exposure variations.
[CV-93] EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing
【速读】:该论文旨在解决当前文本引导图像编辑(Text-guided Image Editing, TIE)模型在生成图像时存在的质量问题,如伪影、意外编辑和审美不佳等,以及缺乏可扩展的人类偏好评估模型限制了人类反馈奖励模型(Human Feedback Reward Models)发展的瓶颈问题。解决方案的关键在于:首先构建了一个百万级图像编辑数据集 EditHF-1M,包含超过2900万组人类偏好对和14.8万条人类平均评分,从视觉质量、指令对齐度和属性保持性三个维度进行标注;在此基础上提出基于多模态大语言模型(Multimodal Large Language Model, MLLM)的评估模型 EditHF,能够提供与人类偏好高度一致的反馈;最终设计出 EditHF-Reward,利用 EditHF 作为奖励信号,通过强化学习优化 TIE 模型,显著提升了图像编辑性能并展现出良好的泛化能力。
链接: https://arxiv.org/abs/2603.14916
作者: Zitong Xu,Huiyu Duan,Zhongpeng Ji,Xinyun Zhang,Yutao Liu,Xiongkuo Min,Ke Gu,Jian Zhang,Shusong Xu,Jinwei Chen,Bo Li,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Recent text-guided image editing (TIE) models have achieved remarkable progress, while many edited images still suffer from issues such as artifacts, unexpected editings, unaesthetic contents. Although some benchmarks and methods have been proposed for evaluating edited images, scalable evaluation models are still lacking, which limits the development of human feedback reward models for image editing. To address the challenges, we first introduce \textbfEditHF-1M, a million-scale image editing dataset with over 29M human preference pairs and 148K human mean opinion ratings, both evaluated from three dimensions, \textiti.e., visual quality, instruction alignment, and attribute preservation. Based on EditHF-1M, we propose \textbfEditHF, a multimodal large language model (MLLM) based evaluation model, to provide human-aligned feedback from image editing. Finally, we introduce \textbfEditHF-Reward, which utilizes EditHF as the reward signal to optimize the text-guided image editing models through reinforcement learning. Extensive experiments show that EditHF achieves superior alignment with human preferences and demonstrates strong generalization on other datasets. Furthermore, we fine-tune the Qwen-Image-Edit using EditHF-Reward, achieving significant performance improvements, which demonstrates the ability of EditHF to serve as a reward model to scale-up the image editing. Both the dataset and code will be released in our GitHub repository: this https URL.
[CV-94] ILV: Iterative Latent Volumes for Fast and Accurate Sparse-View CT Reconstruction
【速读】:该论文旨在解决稀疏视角锥束计算机断层成像(sparse-view CBCT)中重建速度与精度难以兼顾的问题,即如何在减少辐射剂量、降低系统成本的同时实现临床可用的快速且高质量三维重建。其解决方案的关键在于提出一种名为迭代潜在体(Iterative Latent Volumes, ILV)的前馈框架,该框架将数据驱动先验与经典迭代重建原理相结合:通过构建一个显式的三维潜在体积(3D latent volume),并利用多视角X射线特征和学习到的解剖先验对其进行反复条件更新,从而恢复出超越以往前馈模型能力的精细结构细节。此外,ILV还引入了X射线特征体、组交叉注意力机制、高效自注意力以及视图级特征聚合等关键组件,有效实现了潜在体精炼的核心思想,在约14,000个CT体积的大规模数据集上验证了其在重建质量和速度上的显著优势。
链接: https://arxiv.org/abs/2603.14915
作者: Seungryong Lee,Woojeong Baek,Joosang Lee,Eunbyung Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: \url{ this https URL }
Abstract:A long-term goal in CT imaging is to achieve fast and accurate 3D reconstruction from sparse-view projections, thereby reducing radiation exposure, lowering system cost, and enabling timely imaging in clinical workflows. Recent feed-forward approaches have shown strong potential toward this overarching goal, yet their results still suffer from artifacts and loss of fine details. In this work, we introduce Iterative Latent Volumes (ILV), a feed-forward framework that integrates data-driven priors with classical iterative reconstruction principles to overcome key limitations of prior feed-forward models in sparse-view CBCT reconstruction. At its core, ILV constructs an explicit 3D latent volume that is repeatedly updated by conditioning on multi-view X-ray features and the learned anatomical prior, enabling the recovery of fine structural details beyond the reach of prior feed-forward models. In addition, we develop and incorporate several key architectural components, including an X-ray feature volume, group cross-attention, efficient self-attention, and view-wise feature aggregation, that efficiently realize its core latent volume refinement concept. Extensive experiments on a large-scale dataset of approximately 14,000 CT volumes demonstrate that ILV significantly outperforms existing feed-forward and optimization-based methods in both reconstruction quality and speed. These results show that ILV enables fast and accurate sparse-view CBCT reconstruction suitable for clinical use. The project page is available at: this https URL.
[CV-95] opoVST: Toward Topology-fidelitous Vessel Skeleton Tracking
【速读】:该论文旨在解决医学图像中血管骨架(vessel skeleton)自动提取的拓扑保真度问题,特别是针对细小血管结构易出现断续和伪骨架段(spurious skeleton segments)的挑战。其核心解决方案是提出TopoVST方法,关键在于构建多尺度球体图(multi-scale sphere graphs)以采样输入图像,并利用图神经网络联合估计追踪方向与血管半径;同时引入基于门控机制的特征融合策略增强多尺度表示能力,并通过几何感知加权方案缓解训练中的类别不平衡问题;此外,设计了一种基于波传播的骨架追踪算法,借助空间占用过滤机制有效抑制伪骨架生成,从而在两个具有不同几何特性的血管数据集上实现优于现有方法的重叠率与拓扑一致性指标。
链接: https://arxiv.org/abs/2603.14909
作者: Yaoyu Liu,Minghui Zhang,Junjun He,Yun Gu
机构: Institute of Medical Robotics, Shanghai Jiao Tong University (上海交通大学医学机器人研究所); School of Automation and Intelligent Sensing, Shanghai Jiao Tong University (上海交通大学自动化与智能感知学院); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures. Under Review
Abstract:Automatic extraction of vessel skeletons is crucial for many clinical applications. However, achieving topologically faithful delineation of thin vessel skeletons remains highly challenging, primarily due to frequent discontinuities and the presence of spurious skeleton segments. To address these difficulties, we propose TopoVST, a topology-fidelitious vessel skeleton tracker. TopoVST constructs multi-scale sphere graphs to sample the input image and employs graph neural networks to jointly estimate tracking directions and vessel radii. The utilization of multi-scale representations is enhanced through a gating-based feature fusion mechanism, while the issue of class imbalance during training is mitigated by embedding a geometry-aware weighting scheme into the directional loss. In addition, we design a wave-propagation-based skeleton tracking algorithm that explicitly mitigates the generation of spurious skeletons through space-occupancy filtering. We evaluate TopoVST on two vessel datasets with different geometries. Extensive comparisons with state-of-the-art baselines demonstrate that TopoVST achieves competitive performance in both overlapping and topological metrics. Our source code is available at: this https URL.
[CV-96] PerlAD: Towards Enhanced Closed-loop End-to-end Autonomous Driving with Pseudo-simulation-based Reinforcement Learning
【速读】:该论文旨在解决端到端自动驾驶策略在闭环执行中表现不佳的问题,其根源在于模仿学习(Imitation Learning, IL)的开环训练目标与实际驾驶需求之间存在不匹配,而传统强化学习(Reinforcement Learning, RL)方法因依赖渲染环境导致渲染差距(rendering gap)和计算效率低下。解决方案的关键在于提出一种基于伪仿真(pseudo-simulation)的强化学习方法 PerlAD,该方法利用离线数据构建向量空间中的伪仿真环境,实现无渲染、高效率的试错训练;同时引入预测世界模型以生成受自车规划驱动的反应式代理轨迹,弥合静态数据与动态闭环环境之间的差异,并采用分层解耦规划器(hierarchical decoupled planner)分别结合IL进行横向路径生成和RL进行纵向速度优化,从而在无需昂贵在线交互的前提下显著提升驾驶性能。
链接: https://arxiv.org/abs/2603.14908
作者: Yinfeng Gao,Qichao Zhang,Deqing Liu,Zhongpu Xia,Guang Li,Kun Ma,Guang Chen,Hangjun Ye,Long Chen,Da-Wei Ding,Dongbin Zhao
机构: University of Science and Technology Beijing (北京科技大学); Xiaomi EV (小米汽车); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE RA-L. Submitted: 2025.12.2; Revised: 2026.2.4; Accepeted: 2026.3.7
Abstract:End-to-end autonomous driving policies based on Imitation Learning (IL) often struggle in closed-loop execution due to the misalignment between inadequate open-loop training objectives and real driving requirements. While Reinforcement Learning (RL) offers a solution by directly optimizing driving goals via reward signals, the rendering-based training environments introduce the rendering gap and are inefficient due to high computational costs. To overcome these challenges, we present a novel Pseudo-simulation-based RL method for closed-loop end-to-end autonomous driving, PerlAD. Based on offline datasets, PerlAD constructs a pseudo-simulation that operates in vector space, enabling efficient, rendering-free trial-and-error training. To bridge the gap between static datasets and dynamic closed-loop environments, PerlAD introduces a prediction world model that generates reactive agent trajectories conditioned on the ego vehicle’s plan. Furthermore, to facilitate efficient planning, PerlAD utilizes a hierarchical decoupled planner that combines IL for lateral path generation and RL for longitudinal speed optimization. Comprehensive experimental results demonstrate that PerlAD achieves state-of-the-art performance on the Bench2Drive benchmark, surpassing the previous E2E RL method by 10.29% in Driving Score without requiring expensive online interactions. Additional evaluations on the DOS benchmark further confirm its reliability in handling safety-critical occlusion scenarios.
[CV-97] Balancing Saliency and Coverag e: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, VLMs)在处理高分辨率图像时因视觉标记(visual tokens)数量庞大而导致的计算瓶颈问题。现有方法通常采用基于显著性(saliency)、多样性(diversity)或固定组合的静态压缩策略,但忽略了不同样本间语义显著性分布差异导致的最优局部显著性保留与全局覆盖权衡的不同。解决方案的关键在于提出PromPrune框架,其核心创新是引入语义显著性感知的预算分配机制和两阶段选择流水线,使模型能够根据每个样本的语义显著性分布自适应地平衡局部显著区域保留与全局多样性覆盖,从而在高压缩比下仍保持优异性能——实验表明在LLaVA-NeXT-7B上可降低88%浮点运算量(FLOPs)和22%预填充延迟,同时维持97.5%的原始准确率。
链接: https://arxiv.org/abs/2603.14892
作者: Jaehoon Lee,Mingi Jung,Soohyuk Jang,Seungryong Yoo,Dahuin Jung,Sungroh Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision-Language Models (VLMs) achieve strong multimodal understanding capabilities by leveraging high-resolution visual inputs, but the resulting large number of visual tokens creates a major computational bottleneck. Recent work mitigates this issue through visual token compression, typically compressing tokens based on saliency, diversity, or a fixed combination of both. We observe that the distribution of semantic prominence varies substantially across samples, leading to different optimal trade-offs between local saliency preservation and global coverage. This observation suggests that applying a static compression strategy across all samples can be suboptimal. Motivated by this insight, we propose PromPrune, a sample-adaptive visual token selection framework composed of semantic prominence-aware budget allocation and a two-stage selection pipeline. Our method adaptively balances local saliency preservation and global coverage according to the semantic prominence distribution of each sample. By allocating token budgets between locally salient regions and globally diverse regions, our method maintains strong performance even under high compression ratios. On LLaVA-NeXT-7B, our approach reduces FLOPs by 88% and prefill latency by 22% while preserving 97.5% of the original accuracy.
[CV-98] PASTE: Physics-Aware Scattering Topology Embedding Framework for SAR Object Detection
【速读】:该论文旨在解决当前基于深度学习的合成孔径雷达(SAR)图像目标检测方法中,忽视电磁散射物理机制的问题。现有方法多将目标视为纹理块处理,或仅依赖幅度统计模型,难以有效利用散射拓扑信息,且部分引入频域信息的方法计算复杂度高、泛化能力弱。其解决方案的关键在于提出物理感知的散射拓扑嵌入框架(PASTE),通过构建从拓扑生成、注入到联合监督的闭环架构,实现散射先验的系统性融合:首先基于属性散射中心(Attributed Scattering Center, ASC)模型设计可扩展且物理一致的散射关键点生成与自动标注方案;其次设计散射拓扑注入模块引导多尺度特征学习;最后采用散射先验监督策略约束网络优化,使预测结果与散射中心分布对齐。实验表明,PASTE兼容多种检测器,在真实数据集上相对基线提升mAP 2.9%–11.3%,同时具备良好的可解释性。
链接: https://arxiv.org/abs/2603.14886
作者: Jiacheng Chen,Yuxuan Xiong,Haipeng Wang
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current deep learning-based object detection for Synthetic Aperture Radar (SAR) imagery mainly adopts optical image methods, treating targets as texture patches while ignoring inherent electromagnetic scattering mechanisms. Though scattering points have been studied to boost detection performance, most methods still rely on amplitude-based statistical models. Some approaches introduce frequency-domain information for scattering center extraction, but they suffer from high computation cost and poor compatibility with diverse datasets. Thus, effectively embedding scattering topological information into modern detection frameworks remains challenging. To solve these problems, this paper proposes the Physics-Aware Scattering Topology Embedding Framework (PASTE), a novel closed-loop architecture for comprehensive scattering prior integration. By building the full pipeline from topology generation, injection to joint supervision, PASTE elegantly integrates scattering physics into modern SAR detectors. Specifically, it designs a scattering keypoint generation and automatic annotation scheme based on the Attributed Scattering Center (ASC) model to produce scalable and physically consistent priors. A scattering topology injection module guides multi-scale feature learning, and a scattering prior supervision strategy constrains network optimization by aligning predictions with scattering center distributions. Experiments on real datasets show that PASTE is compatible with various detectors and brings relative mAP gains of 2.9% to 11.3% over baselines with acceptable computation overhead. Visualization of scattering maps verifies that PASTE successfully embeds scattering topological priors into feature space, clearly distinguishing target and background scattering regions, thus providing strong interpretability for results.
[CV-99] SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras CVPR2026
【速读】:该论文旨在解决RGB到RAW图像转换中的两个关键问题:一是不同像素强度下的重建难度差异,二是多相机场景下需要针对不同相机ISP(图像信号处理)特性进行适应。解决方案的关键在于提出SpiralDiff框架,其采用信号依赖的噪声权重策略,实现跨强度级别的自适应重建保真度;同时引入CamLoRA模块,作为轻量级相机感知适配机制,使统一模型能够灵活适应不同相机的ISP特征,从而提升转换质量和下游任务(如基于RAW的物体检测)性能。
链接: https://arxiv.org/abs/2603.14885
作者: Huanjing Yue,Shangbin Xie,Cong Cao,Qian Wu,Lei Zhang,Lei Zhao,Jingyu Yang
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:RAW images preserve superior fidelity and rich scene information compared to RGB, making them essential for tasks in challenging imaging conditions. To alleviate the high cost of data collection, recent RGB-to-RAW conversion methods aim to synthesize RAW images from RGB. However, they overlook two key challenges: (i) the reconstruction difficulty varies with pixel intensity, and (ii) multi-camera conversion requires camera-specific adaptation. To address these issues, we propose SpiralDiff, a diffusion-based framework tailored for RGB-to-RAW conversion with a signal-dependent noise weighting strategy that adapts reconstruction fidelity across intensity levels. In addition, we introduce CamLoRA, a camera-aware lightweight adaptation module that enables a unified model to adapt to different camera-specific ISP characteristics. Extensive experiments on four benchmark datasets demonstrate the superiority of SpiralDiff in RGB-to-RAW conversion quality and its downstream benefits in RAW-based object detection. Our code and model are available at this https URL.
[CV-100] LLM ind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models CVPR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理图像时存在空间冗余的问题,即VLMs通常对整个输入图像区域采用均匀的空间精度,忽略了人类视觉系统中自适应、选择性与资源高效的特点。为此,作者提出了一种名为LLMind(Looking Like the Mind)的训练-free框架,其核心创新在于引入生物启发的自适应采样策略(Bio-inspired Adaptive Sampling Strategy, BASS),通过Mobius参数化模块实现非均匀采样以保留全局场景结构,同时结合测试时的闭环语义反馈(Closed-loop Semantic Feedback, CSF)机制,使感知显著性与冻结VLM中的文本信息对齐。这一方案显著提升了VLM在像素预算受限条件下的性能表现,且无需修改现有模型架构,具备轻量化和即插即用的优势。
链接: https://arxiv.org/abs/2603.14882
作者: Soumyaratna Debnath,Bui Duc Manh,Zinan Liu,Lin Wang
机构: Nanyang Technological University, Singapore (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, 10 pages, 7 figures, 3 tables
Abstract:Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.
[CV-101] RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation CVPR2026
【速读】:该论文旨在解决现有视觉-语言接地(Visual-Language Grounding, VLG)方法仅能实现粗粒度的目标级定位,而传统机器人抓取方法主要依赖几何线索且缺乏语言引导的问题,从而限制了其在语言驱动操作场景中的应用。解决方案的关键在于提出 RealVLG 框架,该框架由 RealVLG-11B 数据集和 RealVLG-R1 模型组成:RealVLG-11B 提供多粒度标注(包括边界框、分割掩码、抓取姿态、接触点及人工验证的细粒度语言描述),覆盖约 16.5 万张图像、800 个物体实例和超过 110 亿个抓取样本;RealVLG-R1 则基于预训练的大规模视觉-语言模型进行强化微调,能够统一预测给定自然语言指令下的边界框、分割掩码、抓取姿态与接触点,从而实现零样本感知与真实世界未见环境中的操作能力,建立了一个统一的语义-视觉多模态基准平台。
链接: https://arxiv.org/abs/2603.14880
作者: Linfei Li,Lin Zhang,Ying Shen
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Visual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitations, we propose the RealVLG framework, which integrates the RealVLG-11B dataset and the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million segmentation, detection, and language annotations, and roughly 11 billion grasping examples. Building on this dataset, RealVLG-R1 employs Reinforcement Fine-tuning on pretrained large-scale vision-language models to predict bounding boxes, segmentation masks, grasp poses, and contact points in a unified manner given natural language instructions. Experimental results demonstrate that RealVLG supports zero-shot perception and manipulation in real-world unseen environments, establishing a unified semantic-visual multimodal benchmark that provides a comprehensive data and evaluation platform for language-driven robotic perception and grasping policy learning. All data and code are publicly available at this https URL.
[CV-102] Video Detector: A Dual-Phase Vision-Based System for Real-Time Traffic Intersection Control and Intelligent Transportation Analysis
【速读】:该论文旨在解决城市交通管理中对动态交通条件适应性不足且依赖昂贵基础设施改造的问题,提出了一种基于视觉的交通交叉口管理系统——Video Detector (VD),作为传统感应线圈检测器的成本效益替代方案。其关键在于设计了一个双阶段框架:实时模块(VD-RT)用于交叉口控制,离线分析模块(VD-Offline)用于详细交通行为分析;通过集成SSD Inception v2、Faster R-CNN Inception v2和CenterNet ResNet-50 V1 FPN三种目标检测模型,在包含108,000张标注图像的多类车辆数据集上训练,实现了高达90%测试准确率和29.5 mAP@0.5的检测性能,同时保持37 FPS的实时处理能力,支持虚拟环形检测、车辆计数、多目标跟踪、排队估计、速度分析及多类别车辆分类等功能,从而在无需嵌入式道路传感器的情况下实现全面的交叉口监控。
链接: https://arxiv.org/abs/2603.14861
作者: Mustafa Fatih Şen,Halûk Gümüşkaya,Şenol Pazar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures, 4 tables, preprint, the dataset is openly available
Abstract:Urban traffic management increasingly requires intelligent sensing systems capable of adapting to dynamic traffic conditions without costly infrastructure modifications. Vision-based vehicle detection has therefore become a key technology for modern intelligent transportation systems. This study presents Video Detector (VD), a dual-phase vision-based traffic intersection management system designed as a flexible and cost-effective alternative to traditional inductive loop detectors. The framework integrates a real-time module (VD-RT) for intersection control with an offline analytical module (VD-Offline) for detailed traffic behavior analysis. Three system configurations were implemented using SSD Inception v2, Faster R-CNN Inception v2, and CenterNet ResNet-50 V1 FPN, trained on datasets totaling 108,000 annotated images across 6-10 vehicle classes. Experimental results show detection performance of up to 90% test accuracy and 29.5 mAP@0.5, while maintaining real-time throughput of 37 FPS on HD video streams. Field deployments conducted in collaboration with Istanbul IT and Smart City Technologies Inc. (ISBAK) demonstrate stable operation under diverse environmental conditions. The system supports virtual loop detection, vehicle counting, multi-object tracking, queue estimation, speed analysis, and multiclass vehicle classification, enabling comprehensive intersection monitoring without the need for embedded road sensors. The annotated dataset and training pipeline are publicly released to support reproducibility. These results indicate that the proposed framework provides a scalable and deployable vision-based solution for intelligent transportation systems and smart-city traffic management.
[CV-103] From Horizontal to Rotated: Cross-View Object Geo-Localization with Orientation Awareness
【速读】:该论文旨在解决跨视角目标地理定位(Cross-View object geo-localization, CVOGL)中检测类方法精度不足的问题,其核心挑战在于水平边界框(Horizontal Bounding Boxes, HBoxes)对倾斜目标的几何拟合效果差,以及特征图缩放导致的定位精度下降。解决方案的关键在于引入旋转边界框(Rotated Bounding Boxes, RBoxes)以实现更紧密的目标几何匹配,并设计了包含多尺度感知模块和方向敏感头的新型框架OSGeo,从而精确回归RBoxes。该方案在保持高精度的同时,显著降低了标注成本,优于现有检测方法且接近或超越分割类方法的性能。
链接: https://arxiv.org/abs/2603.14856
作者: Chenlin Fu,Ao Gong,Yingying Zhu
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-View object geo-localization (CVOGL) aims to precisely determine the geographic coordinates of a query object from a ground or drone perspective by referencing a satellite map. Segmentation-based approaches offer high precision but require prohibitively expensive pixel-level annotations, whereas more economical detection-based methods suffer from lower accuracy. This performance disparity in detection is primarily caused by two factors: the poor geometric fit of Horizontal Bounding Boxes (HBoxes) for oriented objects and the degradation in precision due to feature map scaling. Motivated by these, we propose leveraging Rotated Bounding Boxes (RBoxes) as a natural extension of the detection-based paradigm. RBoxes provide a much tighter geometric fit to oriented objects. Building on this, we introduce OSGeo, a novel geo-localization framework, meticulously designed with a multi-scale perception module and an orientation-sensitive head to accurately regress RBoxes. To support this scheme, we also construct and release CVOGL-R, the first dataset with precise RBox annotations for CVOGL. Extensive experiments demonstrate that our OSGeo achieves state-of-the-art performance, consistently matching or even surpassing the accuracy of leading segmentation-based methods but with an annotation cost that is over an order of magnitude lower.
[CV-104] AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
【速读】:该论文旨在解决现有端到端(end-to-end, E2E)自动驾驶系统中视觉-语言模型(Vision-Language Models, VLMs)集成所面临的三大挑战:一是推理空间与动作空间之间的分布不匹配问题;二是未能充分利用预训练VLM的通用推理能力;三是动作策略生成过程中引入显著的推理延迟,影响驾驶性能。解决方案的关键在于提出一个统一的视觉-语言-动作(Vision-Language-Action, VLA)建模框架——\OURS,其核心创新是采用混合Transformer(Mixture-of-Transformer, MoT)架构并引入联合注意力共享机制,既保留了预训练VLM的通用推理能力,又通过异步执行不同任务频率实现高效“快慢”推理,从而在保持高精度场景理解的同时显著降低推理延迟,提升整体驾驶性能。
链接: https://arxiv.org/abs/2603.14851
作者: Wenhui Huang,Songyan Zhang,Qihang Huang,Zhidong Wang,Zhiqi Mao,Collister Chua,Zhan Chen,Long Chen,Chen Lv
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \hrefthis https URLProject Page for the demonstration videos and qualitative results.
[CV-105] From Artefact to Insight: Efficient Low-Rank Adaptation of BrushNet for Scanning Probe Microscopy Image Restoration
【速读】:该论文旨在解决扫描探针显微镜(Scanning Probe Microscopy, SPM)图像中常见的结构化伪影问题,如线扫描缺失、增益噪声、探针卷积效应和相位跳跃等,这些问题严重影响了纳米尺度成像的质量与可解释性。传统方法通常将去伪影视为孤立的去噪或插值任务,而本文提出了一种基于扩散模型(diffusion model)的生成式图像修复(generative inpainting)框架,其关键在于采用低秩适应(Low-Rank Adaptation, LoRA)技术对预训练模型进行轻量化微调——仅需调整少于0.2%的BrushNet权重参数,并利用从739个实验扫描中提取的7390对带伪影与干净图像数据即可实现高效适配。该方案在公开的SPM InpBench基准上显著提升峰值信噪比(PSNR)达6.61 dB,同时将感知图像相似度(LPIPS)降低一半,且性能接近全量重训练,可在单张GPU上完成训练,具备良好的泛化能力与实用性。
链接: https://arxiv.org/abs/2603.14850
作者: Ziwei Wei,Yao Shen,Wanheng Lu,Ghim Wei Ho,Kaiyang Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Mesoscale and Nanoscale Physics (cond-mat.mes-hall)
备注: 37 pages, 7 figures, 7 tables, jounral paper
Abstract:Scanning Probe Microscopy or SPM offers nanoscale resolution but is frequently marred by structured artefacts such as line scan dropout, gain induced noise, tip convolution, and phase hops. While most available methods treat SPM artefact removal as isolated denoising or interpolation tasks, the generative inpainting perspective remains largely unexplored. In this work, we introduce a diffusion based inpainting framework tailored to scientific grayscale imagery. By fine tuning less than 0.2 percent of BrushNet weights with rank constrained low rank adaptation (LoRA), we adapt a pretrained diffusion model using only 7390 artefact, clean pairs distilled from 739 experimental scans. On our forthcoming public SPM InpBench benchmark, the LoRA enhanced model lifts the Peak Signal to Noise Ratio or PSNR by 6.61 dB and halves the Learned Perceptual Image Patch Similarity or LPIPS relative to zero-shot inference, while matching or slightly surpassing the accuracy of full retraining, trainable on a single GPU instead of four high-memory cards. The approach generalizes across various SPM image channels including height, amplitude and phase, faithfully restores subtle structural details, and suppresses hallucination artefacts inherited from natural image priors. This lightweight framework enables efficient, scalable recovery of irreplaceable SPM images and paves the way for a broader diffusion model adoption in nanoscopic imaging analysis.
[CV-106] Personalized Federated Learning with Residual Fisher Information for Medical Image Segmentation
【速读】:该论文旨在解决联邦学习(Federated Learning)中因客户端数据异构性(data heterogeneity)导致的模型性能下降问题,特别是如何实现客户端自适应的个性化建模。其解决方案的关键在于提出了一种名为pFL-ResFIM的新框架,通过引入残差Fisher信息矩阵(Residual Fisher Information Matrix, ResFIM)来量化模型参数对不同域差异的敏感程度,并采用谱迁移策略在隐私约束下生成模拟数据以估计各客户端的ResFIM;进而将模型参数划分为域敏感和域不变成分,仅在服务器端聚合域不变参数构建个性化模型,从而在保持隐私的同时提升各客户端的模型表现。
链接: https://arxiv.org/abs/2603.14848
作者: Meilu Zhu,Yuxing Li,Zhiwei Wang,Edmund Y. Lam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ISBI 2026
Abstract:Federated learning enables multiple clients (institutions) to collaboratively train machine learning models without sharing their private data. To address the challenge of data heterogeneity across clients, personalized federated learning (pFL) aims to learn customized models for each client. In this work, we propose pFL-ResFIM, a novel pFL framework that achieves client-adaptive personalization at the parameter level. Specifically, we introduce a new metric, Residual Fisher Information Matrix (ResFIM), to quantify the sensitivity of model parameters to domain discrepancies. To estimate ResFIM for each client model under privacy constraints, we employ a spectral transfer strategy that generates simulated data reflecting the domain styles of different clients. Based on the estimated ResFIM, we partition model parameters into domain-sensitive and domain-invariant components. A personalized model for each client is then constructed by aggregating only the domain-invariant parameters on the server. Extensive experiments on public datasets demonstrate that pFL-ResFIM consistently outperforms state-of-the-art methods, validating its effectiveness.
[CV-107] DamageArbiter: A CLIP-Enhanced Multimodal Arbitration Framework for Hurricane Damage Assessment from Street-View Imagery
【速读】:该论文旨在解决基于街景图像(street-view imagery)进行灾害损毁评估时,传统计算机视觉模型因缺乏可解释性和可靠性而难以在应急响应中广泛应用的问题。其核心挑战在于如何提升模型的准确性、鲁棒性与透明度,尤其是在视觉线索模糊或受干扰的情况下避免模型过度自信的错误预测。解决方案的关键在于提出一种基于对比语言-图像预训练(CLIP)模型的多模态分歧驱动仲裁框架(DamageArbiter),通过轻量级逻辑回归元分类器对单模态(图像/文本)与多模态模型之间的预测分歧进行仲裁,从而融合各自优势,在不显著增加计算成本的前提下实现更高精度和更可靠的损毁估计。
链接: https://arxiv.org/abs/2603.14837
作者: Yifan Yang,Lei Zou,Wenjing Gong,Kani Fu,Zongrong Li,Siqin Wang,Bing Zhou,Heng Cai,Hao Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Analyzing street-view imagery with computer vision models for rapid, hyperlocal damage assessment is becoming popular and valuable in emergency response and recovery, but traditional models often act like black boxes, lacking interpretability and reliability. This study proposes a multimodal disagreement-driven Arbitration framework powered by Contrastive Language-Image Pre-training (CLIP) models, DamageArbiter, to improve the accuracy, interpretability, and robustness of damage estimation from street-view imagery. DamageArbiter leverages the complementary strengths of unimodal and multimodal models, employing a lightweight logistic regression meta-classifier to arbitrate cases of disagreement. Using 2,556 post-disaster street-view images, paired with both manually generated and large language model (LLM)-generated text descriptions, we systematically compared the performance of unimodal models (including image-only and text-only models), multimodal CLIP-based models, and DamageArbiter. Notably, DamageArbiter improved the accuracy from 74.33% (ViT-B/32, image-only) to 82.79%, surpassing the 80% accuracy threshold and achieving an absolute improvement of 8.46% compared to the strongest baseline model. Beyond improvements in overall accuracy, compared to visual models relying solely on images, DamageArbiter, through arbitration of discrepancies between unimodal and multimodal predictions, mitigates common overconfidence errors in visual models, especially in situations where disaster visual cues are ambiguous or subject to interference, reducing overconfidence but incorrect predictions. We further mapped and analyzed geo-referenced predictions and misclassifications to compare model performance across locations. Overall, this work advances street-view-based disaster assessment from coarse severity classification toward a more reliable and interpretable framework.
[CV-108] Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis
【速读】:该论文旨在解决新冠肺炎(COVID-19)检测与多类肺部疾病分类问题,尤其针对来自不同医疗源的胸部CT影像在分布差异下的鲁棒性分析挑战。其解决方案的关键在于提出一种融合2.5D与3D表示的深度学习框架:2.5D分支利用DINOv3视觉Transformer处理轴向、冠状面和矢状面多视角CT切片,提取高判别力的切片级特征;3D分支则基于ResNet-18建模体积上下文,并通过Variance Risk Extrapolation(VREx)预训练结合监督对比学习提升跨数据源的泛化能力;最终通过logit级集成策略融合两分支预测结果,实现更准确且稳定的诊断性能。
链接: https://arxiv.org/abs/2603.14832
作者: Tuan-Anh Yang,Bao V. Q. Bui,Chanh-Quang Vo-Van,Truong-Son Hy
机构: VNUHCM University of Science, Vietnam National University, Vietnam; Ho Chi Minh University of Technology, Vietnam National University, Vietnam; The University of Alabama at Birmingham, United States
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We propose a deep learning framework for COVID-19 detection and disease classification from chest CT scans that integrates both 2.5D and 3D representations to capture complementary slice-level and volumetric information. The 2.5D branch processes multi-view CT slices (axial, coronal, sagittal) using a DINOv3 vision transformer to extract robust visual features, while the 3D branch employs a ResNet-18 architecture to model volumetric context and is pretrained with Variance Risk Extrapolation (VREx) followed by supervised contrastive learning to improve cross-source robustness. Predictions from both branches are combined through logit-level ensemble inference. Experiments on the PHAROS-AIF-MIH benchmark demonstrate the effectiveness of the proposed approach: for binary COVID-19 detection, the ensemble achieves 94.48% accuracy and a 0.9426 Macro F1-score, outperforming both individual models, while for multi-class disease classification the 2.5D DINOv3 model achieves the best performance with 79.35% accuracy and a 0.7497 Macro F1-score. These results highlight the benefit of combining pretrained slice-based representations with volumetric modeling for robust multi-source medical imaging analysis. Code is available at this https URL
[CV-109] SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space
【速读】:该论文旨在解决从单张图像中估计面部动作时,现有方法依赖于紧凑表达空间(如PCA降维后的参数)所导致的语义解释性不足的问题。这类方法虽能实现参数预测或拟合,但缺乏与真实肌肉运动对应的可解释性,难以满足虚拟角色控制、人机交互等应用场景的需求。解决方案的关键在于提出SemanticFace框架,其核心创新是将系数预测重构为结构化的语义推理任务:首先从真实ARKit混合形状(blendshape)系数中提取结构化语义监督信号,再通过两阶段语义蒸馏机制,将该监督知识迁移至多模态大语言模型(multimodal large language model),从而实现从图像到可解释面部动作系数的准确预测。该方法显著提升了系数精度与感知一致性,并具备跨身份泛化能力和对大规模域偏移(如卡通人脸)的鲁棒性。
链接: https://arxiv.org/abs/2603.14827
作者: Zejian Kang,Kai Zheng,Yuanchen Fei,Wentao Yang,Hongyuan Zou,Xiangru Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbfSemanticFace, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.
[CV-110] wo Birds One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在对抗越狱攻击(jailbreak attack)防御中普遍存在的安全与性能(utility)权衡问题,即强化安全性往往会损害模型在通用视觉 grounded 推理任务上的表现。其核心解决方案是识别出一种跨数据集一致存在的模态诱导偏差方向(modality-induced bias direction),该方向源于语言模型主干与视觉编码器之间次优耦合,并被证明同时削弱了安全性和通用推理能力。作者提出“Two Birds, One Projection”方法,在推理阶段通过将跨模态特征投影到该偏差方向的零空间(null space),一次性去除相关干扰成分,仅需一次前向传播即可实现安全性和性能的同步提升。
链接: https://arxiv.org/abs/2603.14825
作者: Yewon Han,Yumin Seol,EunGyung Kong,Minsoo Jo,Taesup Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing jailbreak defence frameworks for Large Vision-Language Models often suffer from a safety utility tradeoff, where strengthening safety inadvertently degrades performance on general visual-grounded reasoning tasks. In this work, we investigate whether safety and utility are inherently antagonistic objectives. We focus on a modality induced bias direction consistently observed across datasets, which arises from suboptimal coupling between the Large Language Model backbone and visual encoders. We further demonstrate that this direction undermines performance on both tasks. Leveraging this insight, we propose Two Birds, One Projection, an efficient inference time jailbreak defence that projects cross-modal features onto the null space of the identified bias direction to remove the corresponding components. Requiring only a single forward pass, our method effectively breaks the conventional tradeoff, simultaneously improving both safety and utility across diverse benchmarks.
[CV-111] RadarXFormer: Robust Object Detection via Cross-Dimension Fusion of 4D Radar Spectra and Images for Autonomous Driving
【速读】:该论文旨在解决自动驾驶系统在复杂天气和光照条件下,基于摄像头与激光雷达(LiDAR)的感知性能下降问题,从而限制其在智能交通系统中的鲁棒性和大规模部署。解决方案的关键在于提出RadarXFormer框架,通过直接利用4D毫米波(mmWave)雷达原始频谱数据构建高效的3D表示,避免传统稀疏点云带来的信息损失,并引入跨维度(3D-2D)融合机制,将多尺度3D球面雷达特征立方体与互补的2D图像特征图进行高效融合,从而在保持实时推理能力的同时提升检测精度与环境适应性。
链接: https://arxiv.org/abs/2603.14822
作者: Yue Sun,Yeqiang Qian,Zhe Wang,Tianhui Li,Chunxiang Wang,Ming Yang
机构: Shanghai Jiao Tong University (上海交通大学); SAIC GM Wuling Automobile Company Co., Ltd. (上汽通用五菱汽车股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable perception is essential for autonomous driving systems to operate safely under diverse real-world traffic conditions. However, camera- and LiDAR-based perception systems suffer from performance degradation under adverse weather and lighting conditions, limiting their robustness and large-scale deployment in intelligent transportation systems. Radar-vision fusion provides a promising alternative by combining the environmental robustness and cost efficiency of millimeter-wave (mmWave) radar with the rich semantic information captured by cameras. Nevertheless, conventional 3D radar measurements lack height resolution and remain highly sparse, while emerging 4D mmWave radar introduces elevation information but also brings challenges such as signal noise and large data volume. To address these issues, this paper proposes RadarXFormer, a 3D object detection framework that enables efficient cross-modal fusion between 4D radar spectra and RGB images. Instead of relying on sparse radar point clouds, RadarXFormer directly leverages raw radar spectra and constructs an efficient 3D representation that reduces data volume while preserving complete 3D spatial information. The “X” highlights the proposed cross-dimension (3D-2D) fusion mechanism, in which multi-scale 3D spherical radar feature cubes are fused with complementary 2D image feature maps. Experiments on the K-Radar dataset demonstrate improved detection accuracy and robustness under challenging conditions while maintaining real-time inference capability.
[CV-112] RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models CVPR2026
【速读】:该论文旨在解决基于Transformer的扩散模型和视觉-语言模型(Vision-Language Models, VLMs)中,如何在不重新训练模型的前提下高效移除敏感或不 desired 信息(如特定身份、风格或物体)这一关键问题,以保障模型的安全性和合规性。解决方案的关键在于提出一种轻量级、模型无关的遗忘学习框架 RAZOR(Ratio-Aware Zero/One-step Optimized Retentive unlearning),其核心创新在于:通过量化各层(layer)和注意力头(attention head)对目标数据遗忘的贡献度,识别出最关键且可编辑的组件;随后采用受正则化约束的更新规则对这些组件进行协同编辑,从而实现精准遗忘的同时最小化对整体性能的影响;此外,编辑范围逐步扩展的设计确保了无过编辑(over-editing)并保留了模型的通用能力。实验表明,RAZOR 在 CLIP、Stable Diffusion 和 VLM 上均表现出高精度、稳定性及效率优势,尤其在量化场景下仍能保持良好效果。
链接: https://arxiv.org/abs/2603.14819
作者: Ravi Ranjan,Utkarsh Grover,Xiaomin Lin,Agoritsa Polyzou
机构: Florida International University (佛罗里达国际大学); University of South Florida (南佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures, 8 tables, accepted to the CVPR 2026 and to appear in the Findings Track Proceedings of IEEE/CVF Conference
Abstract:Transformer based diffusion and vision-language models have achieved remarkable success; yet, efficiently removing undesirable or sensitive information without retraining remains a central challenge for model safety and compliance. We introduce Ratio-Aware Zero/One-step Optimized Retentive unlearning (RAZOR), a lightweight, model-agnostic unlearning framework that generalizes forgetting updates to coordinated multi-layer and multi-head edits within transformer backbones. RAZOR identifies the most important layers and attention heads by measuring how much they contribute to forgetting the target data while preserving useful knowledge. Then, it updates these parts of the model using a carefully regularized rule to avoid harming overall performance. The set of edited components grows gradually, ensuring precise unlearning without over-editing or damaging unrelated capabilities. We evaluate RAZOR on CLIP, Stable Diffusion, and vision-language models (VLMs) using widely adopted unlearning benchmarks covering identity, style, and object erasure tasks. Our results show that RAZOR achieves highly accurate and stable forgetting, even under quantization. This approach offers stronger retention and better efficiency than prior methods. Notably, it also operates significant faster than conventional techniques. These results demonstrate that RAZOR is a practical and scalable solution for safe, adaptive unlearning in transformer-based vision models.
[CV-113] M2IR: Proactive All-in-One Image Restoration via Mamba-style Modulation and Mixture-of-Experts
【速读】:该论文旨在解决基于Transformer的图像复原模型在面对多种退化类型时存在的根本性被动响应问题,即模型在编码阶段无法主动抑制退化传播,导致退化信号干扰特征学习,迫使解码器在去噪与细节保留之间权衡,从而增加模型复杂度并限制其适应性。解决方案的关键在于提出M2IR框架,通过两个核心模块实现主动退化控制:一是Mamba-Style Transformer(MST)块,利用像素级选择性状态调制在编码阶段抑制退化同时保持结构完整性;二是自适应退化专家协作(ADEC)模块,借助DA-CLIP驱动的路由器调度特定退化专家,并结合共享专家协同消除残余退化,从而实现高效、精准的恢复。此设计使模型从被动反应转向主动调控,显著提升泛化能力、适应性和细粒度细节重建效果。
链接: https://arxiv.org/abs/2603.14816
作者: Shiwei Wang,Yongzhen Wang,Bingwen Hu,Liyan Zhang,Xiao-Ping Zhang,Mingqiang Wei
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Anhui University of Technology (安徽工业大学); Tsinghua University (清华大学); Taiyuan University of Technology (太原理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Transformer-based architectures have dominated recent advances in all-in-one image restoration, they remain fundamentally reactive: propagating degradations rather than proactively suppressing them. In the absence of explicit suppression mechanisms, degraded signals interfere with feature learning, compelling the decoder to balance artifact removal and detail preservation, thereby increasing model complexity and limiting adaptability. To address these challenges, we propose M2IR, a novel restoration framework that proactively regulates degradation propagation during the encoding stage and efficiently eliminates residual degradations during decoding. Specifically, the Mamba-Style Transformer (MST) block performs pixel-wise selective state modulation to mitigate degradations while preserving structural integrity. In parallel, the Adaptive Degradation Expert Collaboration (ADEC) module utilizes degradation-specific experts guided by a DA-CLIP-driven router and complemented by a shared expert to eliminate residual degradations through targeted and cooperative restoration. By integrating the MST block and ADEC module, M2IR transitions from passive reaction to active degradation control, effectively harnessing learned representations to achieve superior generalization, enhanced adaptability, and refined recovery of fine-grained details across diverse all-in-one image restoration benchmarks. Our source codes are available at this https URL.
[CV-114] Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning
【速读】:该论文旨在解决具身多智能体系统(embodied multi-agent systems)从分布式、局部视角理解世界的核心挑战,即如何融合多个以自我为中心(ego-centric)的有限感知视图,克服遮挡与模糊性,实现全局场景理解。其解决方案的关键在于提出一个两阶段框架 CoRL,结合思维链监督微调(Chain-of-Thought supervised fine-tuning)与基于组相对策略优化(Group-Relative Policy Optimization)的强化学习方法;其中核心创新是跨视图空间奖励机制(Cross-View Spatial Reward, CVSR),通过将推理步骤与视觉证据对齐,提供密集的任务对齐反馈,确保跨视角实体的一致性解析,并引导模型达成正确最终预测。
链接: https://arxiv.org/abs/2603.14811
作者: Heng Zhou,Li Kang,Yiran Qin,Xiufeng Song,Ao Yu,Zilu Zhang,Haoming Song,Kaixin Xu,Yuchen Fan,Dongzhan Zhou,Xiaohong Liu,Ruimao Zhang,Philip Torr,Lei Bai,Zhenfei Yin
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model’s ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.
[CV-115] HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的视觉语言导航(Vision-Language Navigation, VLN)任务中,使用开源LLM作为导航器时性能显著落后于闭源模型的问题。核心挑战在于“导航遗忘”(Navigation Amnesia)现象——即代理在长时间导航过程中对先前视觉感知和空间信息的记忆衰减,导致路径规划失败并加剧了与闭源方法的性能差距。解决方案的关键是提出HiMemVLN框架,其创新性地引入分层记忆系统(Hierarchical Memory System),增强多模态大模型对视觉感知的回忆能力和长期定位精度,从而有效缓解导航遗忘问题,显著提升导航性能,在模拟和真实环境中均实现接近闭源方法的性能表现。
链接: https://arxiv.org/abs/2603.14807
作者: Kailin Lyu,Kangyi Wu,Pengna Li,Xiuyu Hu,Qingyi Si,Cui Miao,Ning Yang,Zihang Wang,Long Xiao,Lianyu Hu,Jingyuan Sun,Ce Hao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Zhongguancun Academy (中关村学院); Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University (西安交通大学人工智能与机器人研究所); School of Transportation, Tongji University (同济大学交通运输学院); JD.com (京东); Institute of National University of Defense Technology (国防科技大学研究院); Southeast University (东南大学); Nanyang Technological University (南洋理工大学); Huawei Technologies Co., Ltd. (华为技术有限公司); School of Intelligent Science and Technology, Nanjing University (南京大学智能科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 7 figures
Abstract:LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent’s navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at this https URL.
[CV-116] Global Truncated Loss Minimization for Robust and Threshold-Resilient Geometric Estimation
【速读】:该论文旨在解决传统一致性最大化(Consensus Maximization, CM)方法在鲁棒几何估计中因仅依赖内点数量而对内点阈值敏感,且由于其离散特性导致边界松散、分支定界(Branch-and-Bound, BnB)搜索效率低的问题。同时,尽管截断损失(Truncated Losses, TL)能更有效地利用残差信息,但此前尚无研究系统性地探索基于BnB的TL全局最小化及其在提升阈值鲁棒性和搜索效率方面的潜力。论文提出GTM框架,作为首个统一的基于BnB的TL全局优化方法,其关键创新在于采用混合求解设计:对于n维问题,在(n−1)维子空间上执行BnB搜索,剩余1维变量则通过构造Lipschitz连续的边界函数进行高效求解,该边界函数可由经典全局Lipschitz求解器DIRECT快速计算,从而显著降低搜索空间并提升边界紧致性,实现更高的阈值鲁棒性和计算效率。
链接: https://arxiv.org/abs/2603.14796
作者: Tianyu Huang,Liangzu Peng,Xinyue Zhang,Tongfan Guan,Jinhu Dong,Haoang Li,Laurent Kneip,Yun-Hui Liu
机构: The Hong Kong Centre For Logistics Robotics, The Chinese University of Hong Kong; Center for Innovation in Data Engineering and Science (IDEAS), University of Pennsylvania; Mobile Perception Lab of the School of Information and Technology, ShanghaiTech University; Thrust of Robotics and Autonomous Systems, The Hong Kong University of Science and Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 19 pages, 10 figures
Abstract:To achieve outlier-robust geometric estimation, robust objective functions are generally employed to mitigate the influence of outliers. The widely used consensus maximization(CM) is highly robust when paired with global branch-and-bound(BnB) search. However, CM relies solely on inlier counts and is sensitive to the inlier threshold. Besides, the discrete nature of CM leads to loose bounds, necessitating extensive BnB iterations and computation cost. Truncated losses(TL), another continuous alternative, leverage residual information more effectively and could potentially overcome these issues. But to our knowledge, no prior work has systematically explored globally minimizing TL with BnB and its potential for enhanced threshold resilience or search efficiency. In this work, we propose GTM, the first unified BnB-based framework for globally-optimal TL loss minimization across diverse geometric problems. GTM involves a hybrid solving design: given an n-dimensional problem, it performs BnB search over an (n-1)-dimensional subspace while the remaining 1D variable is solved by bounding the objective function. Our hybrid design not only reduces the search space, but also enables us to derive Lipschitz-continuous bounding functions that are general, tight, and can be efficiently solved by a classic global Lipschitz solver named DIRECT, which brings further acceleration. We conduct a systematic evaluation on various BnB-based methods for CM and TL on the robust linear regression problem, showing that GTM enjoys remarkable threshold resilience and the highest efficiency compared to baseline methods. Furthermore, we apply GTM on different geometric estimation problems with diverse residual forms. Extensive experiments demonstrate that GTM achieves state-of-the-art outlier-robustness and threshold-resilience while maintaining high efficiency across these estimation tasks.
[CV-117] Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling
【速读】:该论文旨在解决人类对话中反应节奏(reactive tempo)建模困难的问题,传统音视频数据集通常仅包含孤立说话者短时独白,难以捕捉两人交互中的时序依赖关系。其解决方案的关键在于构建了一个70小时、14,000片段的双人访谈对话数据集——Face-to-Face with Jimmy Fallon (F2F-JF),该数据集保留了嘉宾发言与主持人回应之间的序列依赖性;并通过半自动流程(结合多人跟踪、语音分 speaker 识别和轻量级人工验证)提取出对齐的主客两方时序轨迹及紧密裁剪的视频片段与元数据,为下游建模提供高质量输入。在此基础上,作者进一步展示了基于跨人物视觉上下文条件的多说话人扩散模型(MultiTalk-style diffusion model)在生成主持人响应视频时能稳定提升情感一致性和视频质量(Emotion-FID 和 FVD 指标),同时保持唇形同步(lip-sync)精度,从而为研究二人互动的时序行为提供了端到端的范式。
链接: https://arxiv.org/abs/2603.14794
作者: Ernie Chu,Vishal M. Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbfFace-to-Face with Jimmy Fallon (F2F-JF), a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host’s response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during [t_1,t_2] is generated from their audio plus the guest’s preceding video during [t_0,t_1] . Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.
[CV-118] Mind-of-Director: Multi-modal Agent -Driven Film Previsualization via Collaborative Decision-Making
【速读】:该论文旨在解决传统影视前期可视化(previz)流程中效率低、协作复杂且难以实现高度自动化的问题,特别是在创意构思到初步视觉呈现之间存在显著断层。解决方案的关键在于提出一个基于多模态代理(multi-modal agent-driven)的协同框架——Mind-of-Director,其通过四个协作模块(剧本开发、虚拟场景设计、角色行为控制和摄像机规划)模拟电影制作团队的决策过程,使生成式AI(Generative AI)能够自动完成从文本创意到可交互3D预演序列的端到端转换,并在游戏引擎中实现实时可视化编辑与跨模块同步调整,从而显著提升创作效率与语义一致性。
链接: https://arxiv.org/abs/2603.14790
作者: Shufeng Nan,Mengtian Li,Sixiao Zheng,Yuwei Lu,Han Zhang,Yanwei Fu
机构: Fudan University (复旦大学); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures
Abstract:We present Mind-of-Director, a multi-modal agent-driven framework for film previz that models the collaborative decision-making process of a film production team. Given a creative idea, Mind-of-Director orchestrates multiple specialized agents to produce previz sequences within the game engine. The framework consists of four cooperative modules: Script Development, where agents draft and refine the screenplay iteratively; Virtual Scene Design, which transforms text into semantically aligned 3D environments; Character Behaviour Control, which determines character blocking and motion; and Camera Planning, which optimizes framing, movement, and composition for cinematic camera effects. A real-time visual editing system built in the game engine further enables interactive inspection and synchronized timeline adjustment across scenes, behaviours, and cameras. Extensive experiments and human evaluations show that Mind-of-Director generates high-quality, semantically grounded previz sequences in approximately 25 minutes per idea, demonstrating the effectiveness of agent collaboration for both automated prototyping and human-in-the-loop filmmaking.
[CV-119] High-Fidelity 3D Facial Avatar Synthesis with Controllable Fine-Grained Expressions
【速读】:该论文旨在解决现有面部表情编辑方法在精细表情控制上的局限性,尤其是2D-based方法缺乏三维建模能力,而3D-based方法虽能生成高质量且视角一致的渲染结果,但在细粒度表情调控方面仍显不足。其解决方案的关键在于提出一种双映射(Dual Mappers)架构,包含Texture Mapper和Emotion Mapper,分别用于优化预训练3D-Aware GAN模型的潜在码(latent code)以实现纹理编辑,以及驱动3DMM模型的表情码(expression code)以进行网格编辑;同时引入基于CLIP的文本引导优化(Text-Guided Optimization)机制,并结合子空间投影(SubSpace Projection)将文本嵌入映射到表情子空间,从而实现对细微表情的精确控制。
链接: https://arxiv.org/abs/2603.14781
作者: Yikang He,Jichao Zhang,Wei Wang,Nicu Sebe,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); University of Trento (特伦托大学); Ocean University of China (中国海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial expression editing methods can be mainly categorized into two types based on their architectures: 2D-based and 3D-based methods. The former lacks 3D face modeling capabilities, making it difficult to edit 3D factors effectively. The latter has demonstrated superior performance in generating high-quality and view-consistent renderings using single-view 2D face images. Although these methods have successfully used animatable models to control facial expressions, they still have limitations in achieving precise control over fine-grained expressions. To address this issue, in this paper, we propose a novel approach by simultaneously refining both the latent code of a pretrained 3D-Aware GAN model for texture editing and the expression code of the driven 3DMM model for mesh editing. Specifically, we introduce a Dual Mappers module, comprising Texture Mapper and Emotion Mapper, to learn the transformations of the given latent code for textures and the expression code for meshes, respectively. To optimize the Dual Mappers, we propose a Text-Guided Optimization method, leveraging a CLIP-based objective function with expression text prompts as targets, while integrating a SubSpace Projection mechanism to project the text embedding to the expression subspace such that we can have more precise control over fine-grained expressions. Extensive experiments and comparative analyses demonstrate the effectiveness and superiority of our proposed method.
[CV-120] Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image CVPR2026
【速读】:该论文旨在解决现有单图像三维人体虚拟形象(3D human avatar)方法依赖刚性关节变换,难以建模真实衣物动态的问题。其核心解决方案是提出一个零样本框架DynaAvatar,通过基于Transformer的前馈架构直接预测运动相关的3D高斯形变(3D Gaussian deformations),无需针对特定主体进行优化;关键创新在于引入静态到动态的知识迁移策略——利用大规模静态捕捉数据预训练的Transformer提供几何与外观先验,并通过轻量级LoRA微调高效适配至动态形变建模,同时设计了基于光流引导的DynaFlow损失函数以提供可靠的衣物动态几何线索,从而显著提升重建质量与泛化能力。
链接: https://arxiv.org/abs/2603.14772
作者: Joohyun Kwon,Geonhee Sim,Gyeongsik Moon
机构: Korea University (韩国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Existing single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow-guided objective that provides reliable motion-direction geometric cues for cloth dynamics in rendered space. Finally, we reannotate the missing or noisy SMPL-X fittings in existing dynamic capture datasets, as most public dynamic capture datasets contain incomplete or unreliable fittings that are unsuitable for training high-quality 3D avatar reconstruction models. Experiments demonstrate that DynaAvatar produces visually rich and generalizable animations, outperforming prior methods.
[CV-121] AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas
【速读】:该论文旨在解决多人群体身份保真生成(multi-person identity-preserving generation)中的关键挑战:如何在文本提示驱动下,将多个参考人脸精准绑定至指定位置,同时避免因强身份和布局约束导致的“复制粘贴”捷径(copy-paste shortcuts),从而削弱文本控制能力的问题。解决方案的关键在于提出AnyPhoto框架,其核心创新包括:(i) 基于RoPE对齐的位置画布与位置对齐的token剪枝实现空间定位;(ii) 采用AdaLN风格的身份自适应调制机制,从人脸识别嵌入中注入持久身份信息;(iii) 引入身份隔离注意力机制防止跨身份干扰。训练过程结合条件流匹配与嵌入空间人脸相似性损失,并引入参考人脸替换和位置画布退化策略以抑制捷径学习,显著提升身份保真度与文本可控性。
链接: https://arxiv.org/abs/2603.14770
作者: Longhui Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-person identity-preserving generation requires binding multiple reference faces to specified locations under a text prompt. Strong identity/layout conditions often trigger copy-paste shortcuts and weaken prompt-driven controllability. We present AnyPhoto, a diffusion-transformer finetuning framework with (i) a RoPE-aligned location canvas plus location-aligned token pruning for spatial grounding, (ii) AdaLN-style identity-adaptive modulation from face-recognition embeddings for persistent identity injection, and (iii) identity-isolated attention to prevent cross-identity interference. Training combines conditional flow matching with an embedding-space face similarity loss, together with reference-face replacement and location-canvas degradations to discourage shortcuts. On MultiID-Bench, AnyPhoto improves identity similarity while reducing copy-paste tendency, with gains increasing as the number of identities grows. AnyPhoto also supports prompt-driven stylization with accurate placement, showing great potential application value.
[CV-122] SSR: A Training-Free Approach for Streaming 3D Reconstruction
【速读】:该论文旨在解决流式三维重建(streaming 3D reconstruction)中因长期状态更新导致的几何漂移(geometric drift)问题,尤其是在严格延迟约束下,传统有状态循环模型因误差累积而性能下降。其解决方案的关键在于从Grassmann流形(Grassmannian manifold)视角重新审视潜在持久状态(latent persistent state),将其建模为子空间表示——即在Grassmann流形上演化的一个点;在此框架下,提出一种无需训练、可插拔的Self-expressive Sequence Regularization(SSR)算子,通过在历史状态窗口内计算基于自表达性质(self-expressive property)的解析亲和矩阵,对当前状态更新进行正则化,从而将噪声预测拉回与流形一致的状态轨迹,实现最小开销下的漂移抑制与重建质量提升。
链接: https://arxiv.org/abs/2603.14765
作者: Hui Deng,Yuxin Mao,Yuxin He,Yuchao Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:Streaming 3D reconstruction demands long-horizon state updates under strict latency constraints, yet stateful recurrent models often suffer from geometric drift as errors accumulate over time. We revisit this problem from a Grassmannian manifold perspective: the latent persistent state can be viewed as a subspace representation, i.e., a point evolving on a Grassmannian manifold, where temporal coherence implies the state trajectory should remain on (or near) this this http URL on this view, we propose Self-expressive Sequence Regularization (SSR), a plug-and-play, training-free operator that enforces Grassmannian sequence regularity during this http URL a window of historical states, SSR computes an analytical affinity matrix via the self-expressive property and uses it to regularize the current update, effectively pulling noisy predictions back toward the manifold-consistent trajectory with minimal overhead. Experiments on long-sequence benchmarks demonstrate that SSR consistently reduces drift and improves reconstruction quality across multiple streaming 3D reconstruction tasks.
[CV-123] opology-Preserving Data Augmentation for Ring-Type Polygon Annotations
【速读】:该论文旨在解决几何数据增强(Geometric Data Augmentation)在处理环形区域(ring-type regions)时导致的拓扑结构破坏问题。在建筑平面图分析等结构化领域中,环形区域常以单个循环多边形链表示外边界与内边界之间的连接关系,而传统增强方法中的裁剪操作会移除中间顶点并破坏这种循环连通性,从而影响分割任务的准确性。解决方案的关键在于提出一种保持顺序的多边形增强策略:在掩码空间(mask space)中执行变换后,将存活顶点投影回索引空间(index-space),以恢复相邻关系,确保原始多边形遍历顺序不变且拓扑一致性得以维持,同时计算开销极低。实验表明,该方法能可靠地恢复连接性,在单一和复合增强场景下均实现了接近完美的循环邻接保持率(Cyclic Adjacency Preservation, CAP)。
链接: https://arxiv.org/abs/2603.14764
作者: Sudip Laudari,Sang Hun Baek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 6 figures
Abstract:Geometric data augmentation is widely used in segmentation pipelines and typically assumes that polygon annotations represent simply connected regions. However, in structured domains such as architectural floorplan analysis, ring-type regions are often encoded as a single cyclic polygon chain connecting outer and inner boundaries. During augmentation, clipping operations may remove intermediate vertices and disrupt this cyclic connectivity, breaking the structural relationship between the boundaries. In this work, we introduce an order-preserving polygon augmentation strategy that performs transformations in mask space and then projects surviving vertices back into index-space to restore adjacency relations. This repair maintains the original traversal order of the polygon and preserves topological consistency with minimal computational overhead. Experiments demonstrate that the approach reliably restores connectivity, achieving near-perfect Cyclic Adjacency Preservation (CAP) across both single and compound augmentations.
[CV-124] LiDAR-EVS: Enhance Extrapolated View Synthesis for 3D Gaussian Splatting with Pseudo-LiDAR Supervision
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在自动驾驶仿真中对未见行驶轨迹(extrapolated views)进行激光雷达(LiDAR)合成时存在的过拟合与泛化能力差的问题。现有方法通常基于单次遍历的传感器扫描训练,难以适应新的车辆路径。其解决方案的关键在于提出LiDAR-EVS框架,包含两个核心组件:一是通过多帧LiDAR融合、视角变换、遮挡卷绕(occlusion curling)和强度调整构建伪外推视图点云监督信号;二是引入空间约束的丢弃正则化(spatially-constrained dropout regularization),增强模型对真实驾驶场景中多样轨迹变化的鲁棒性。该方法无需外部多遍历数据即可实现可靠LiDAR外推视图模拟,在多个数据集上达到当前最优性能。
链接: https://arxiv.org/abs/2603.14763
作者: Yiming Huang,Xin Kang,Sipeng Zhang,Hongliang Ren,Weihua Zhang,Junjie Lai
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 8 figures
Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time LiDAR and camera synthesis in autonomous driving simulation. However, simulating LiDAR with 3DGS remains challenging for extrapolated views beyond the training trajectory, as existing methods are typically trained on single-traversal sensor scans, suffer from severe overfitting and poor generalization to novel ego-vehicle paths. To enable reliable simulation of LiDAR along unseen driving trajectories without external multi-pass data, we present LiDAR-EVS, a lightweight framework for robust extrapolated-view LiDAR simulation in autonomous driving. Designed to be plug-and-play, LiDAR-EVS readily extends to diverse LiDAR sensors and neural rendering baselines with minimal modification. Our framework comprises two key components: (1) pseudo extrapolated-view point cloud supervision with multi-frame LiDAR fusion, view transformation, occlusion curling, and intensity adjustment; (2) spatially-constrained dropout regularization that promotes robustness to diverse trajectory variations encountered in real-world driving. Extensive experiments demonstrate that LiDAR-EVS achieves SOTA performance on extrapolated-view LiDAR synthesis across three datasets, making it a promising tool for data-driven simulation, closed-loop evaluation, and synthetic data generation in autonomous driving systems.
[CV-125] Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization
【速读】:该论文旨在解决点级弱监督时序情感定位(Point-level weakly-supervised temporal sentiment localization, P-WTSL)中情感边界不精确的问题,即在仅使用时间戳标注的情感标签下,模型难以准确识别情感变化的起止时刻。解决方案的关键在于提出一种统一框架Face-guided Sentiment Boundary Enhancement Network (FSENet),其核心创新包括:(1) 引入Face-guided Sentiment Discovery (FSD)模块,通过双分支建模将面部特征融入多模态交互,提取更细粒度的情感刺激线索;(2) 设计Point-aware Sentiment Semantics Contrast (PSSC)策略,利用对比学习区分标注点附近候选帧的情感语义差异,增强对情感边界的感知能力;(3) 提出Boundary-aware Sentiment Pseudo-label Generation (BSPG)方法,将稀疏点标注转化为时序平滑的伪标签,提升模型训练稳定性与泛化性能。
链接: https://arxiv.org/abs/2603.14750
作者: Cailing Han,Zhangbin Li,Jinxing Zhou,Wei Qian,Jingjing Hu,Yanghao Zhou,Zhangling Duan,Dan Guo
机构: Hefei University of Technology (合肥工业大学); MBZUAI; NUS (新加坡国立大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); The Key Laboratory of Knowledge Engineering with Big Data, Hefei University of Technology (大数据知识工程安徽省重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Point-level weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To further tackle the challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network (\textbfFSENet), a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach \textitfirst introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We \textitthen propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model’s ability to recognize sentiment boundaries. At \textitlast, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings.
[CV-126] PHAC: Promptable Human Amodal Completion CVPR2026
【速读】:该论文旨在解决现有**人像模态补全(HAC)与姿态引导的人像合成(PGPIS)方法在用户控制能力上的局限性:前者无法可靠地融入用户指定的约束(如特定姿态或空间范围),后者虽支持姿态条件但常因依赖训练分布而难以保持可见区域的实例特异性外观。解决方案的关键在于提出可提示的人像模态补全(PHAC)**任务,通过引入基于ControlNet的多类型点提示编码模块(如关节点或边界框),将用户指令注入预训练扩散模型,并仅微调交叉注意力块以实现强提示对齐而不破坏生成先验;同时设计基于图像修复(inpainting)的精修模块,从轻微噪声化的粗略补全出发,忠实保留可见区域并确保遮挡边界处的无缝融合,从而显著提升补全结果的物理合理性、质量及提示一致性。
链接: https://arxiv.org/abs/2603.14741
作者: Seung Young Noh,Ju Yong Chang
机构: Kwangwoon University (광운대학교)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result, users often resort to repeatedly sampling the model until they obtain a satisfactory output. Pose-guided person image synthesis (PGPIS) methods allow explicit pose conditioning, but frequently fail to preserve the instance-specific visible appearance and tend to be biased toward the training distribution, even when built on strong diffusion model priors. To address these limitations, we introduce promptable human amodal completion (PHAC), a new task that completes occluded human images while satisfying both visible appearance constraints and multiple user prompts. Users provide simple point-based prompts, such as additional joints for the target pose or bounding boxes for desired regions; these prompts are encoded using ControlNet modules specialized for each prompt type. These modules inject the prompt signals into a pre-trained diffusion model, and we fine-tune only the cross-attention blocks to obtain strong prompt alignment without degrading the underlying generative prior. To further preserve visible content, we propose an inpainting-based refinement module that starts from a slightly noised coarse completion, faithfully preserves the visible regions, and ensures seamless blending at occlusion boundaries. Extensive experiments on the HAC and PGPIS benchmarks show that our approach yields more physically plausible and higher-quality completions, while significantly improving prompt alignment compared with existing amodal completion and pose-guided synthesis methods.
[CV-127] rajMamba: An Ego-Motion-Guided Mamba Model for Pedestrian Trajectory Prediction from an Egocentric Perspective ICRA2026
【速读】:该论文旨在解决从第一人称视角(egocentric perspective)预测被追踪行人未来轨迹的问题,其核心挑战在于ego-camera与行人之间的复杂动态相对运动。解决方案的关键在于提出一种基于Mamba架构的自车运动引导轨迹预测网络:利用两个Mamba编码器分别提取行人的运动特征和自车的运动特征;随后设计一个自车运动引导的Mamba解码器,通过将行人运动特征作为历史上下文、自车运动特征作为引导线索,显式建模行人与车辆间的相对运动关系,从而捕获更精准的解码特征,并最终生成未来的轨迹。
链接: https://arxiv.org/abs/2603.14739
作者: Yusheng Peng,Gaofeng Zhang,Liping Zheng
机构: Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accept by ICRA 2026
Abstract:Future trajectory prediction of a tracked pedestrian from an egocentric perspective is a key task in areas such as autonomous driving and robot navigation. The challenge of this task lies in the complex dynamic relative motion between the ego-camera and the tracked pedestrian. To address this challenge, we propose an ego-motion-guided trajectory prediction network based on the Mamba model. Firstly, two Mamba models are used as encoders to extract pedestrian motion and ego-motion features from pedestrian movement and ego-vehicle movement, respectively. Then, an ego-motion guided Mamba decoder that explicitly models the relative motion between the pedestrian and the vehicle by integrating pedestrian motion features as historical context with ego-motion features as guiding cues to capture decoded features. Finally, the future trajectory is generated from the decoded features corresponding to the future timestamps. Extensive experiments demonstrate the effectiveness of the proposed model, which achieves state-of-the-art performance on the PIE and JAAD datasets.
[CV-128] Efficient Event Camera Volume System ICRA2026
【速读】:该论文旨在解决事件相机(Event Camera)稀疏输出与标准机器人流水线之间难以集成的问题,特别是传统基于时间分箱(temporal binning)的压缩方法会引入伪影,影响下游任务性能。解决方案的关键在于提出一种名为EECVS(Efficient Event Camera Volume System)的新框架,其核心创新是将事件流建模为连续时间狄拉克脉冲序列(continuous-time Dirac impulse trains),从而实现无伪影的压缩;同时结合密度驱动的自适应选择机制,在DCT、DTFT和DWT三种变换之间动态择优,并针对每种变换设计特定的系数剪枝策略以匹配其稀疏特性,从而在保持低延迟(如DCT处理仅1.5ms)和高吞吐量的同时,显著提升重建保真度与跨数据集的泛化能力。
链接: https://arxiv.org/abs/2603.14738
作者: Juan Camilo Soto,Ian Noronha,Saru Bharti,Upinder Kaur
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICRA 2026
Abstract:Event cameras promise low latency and high dynamic range, yet their sparse output challenges integration into standard robotic pipelines. We introduce \nameframew (Efficient Event Camera Volume System), a novel framework that models event streams as continuous-time Dirac impulse trains, enabling artifact-free compression through direct transform evaluation at event timestamps. Our key innovation combines density-driven adaptive selection among DCT, DTFT, and DWT transforms with transform-specific coefficient pruning strategies tailored to each domain’s sparsity characteristics. The framework eliminates temporal binning artifacts while automatically adapting compression strategies based on real-time event density analysis. On EHPT-XC and MVSEC datasets, our framework achieves superior reconstruction fidelity with DTFT delivering the lowest earth mover distance. In downstream segmentation tasks, EECVS demonstrates robust generalization. Notably, our approach demonstrates exceptional cross-dataset generalization: when evaluated with EventSAM segmentation, EECVS achieves mean IoU 0.87 on MVSEC versus 0.44 for voxel grids at 24 channels, while remaining competitive on EHPT-XC. Our ROS2 implementation provides real-time deployment with DCT processing achieving 1.5 ms latency and 2.7X higher throughput than alternative transforms, establishing the first adaptive event compression framework that maintains both computational efficiency and superior generalization across diverse robotic scenarios.
[CV-129] A Skill-augmented Agent ic Framework and Benchmark for Multi-Video Understanding
【速读】:该论文旨在解决多视频理解中模型跨视频推理能力不足的问题,具体表现为现有方法在将多个视频拼接为单一输入后直接推理时存在训练-推理不一致、帧压缩导致的信息丢失以及缺乏显式的跨视频协调机制;同时,当前多视频基准测试主要聚焦事件层面的比较,忽视了身份层面匹配、细粒度区分和结构化多步推理等关键任务。解决方案的关键在于提出MVX-Bench这一多视频跨维度基准,将11个经典计算机视觉任务统一为多视频问答框架,并设计SAMA(Skill-Augmented Agentic Framework for Multi-Video Understanding),其核心创新包括引入视觉工具、任务特定技能(task-specific skills)和冲突感知验证机制,从而实现迭代式、结构化的跨视频推理。
链接: https://arxiv.org/abs/2603.14733
作者: Yue Zhang,Liqiang Jing,Jia Li,Yapeng Tian,Xinya Du,Yunhui Guo,Vibhav Gogate
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.
[CV-130] Automated Diabetic Screening via Anterior Segment Ocular Imaging: A Deep Learning and Explainable AI Approach
【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)筛查依赖于专业眼底照相设备和专家判断的问题,尤其是在基层医疗和资源匮乏地区难以实现的困境。其解决方案的关键在于开发并验证了一种基于深度学习(Deep Learning, DL)的自动化分类系统,利用常规摄影设备获取的前段眼部图像(anterior segment ocular imaging)中可观察到的虹膜、巩膜和结膜等可见生物标志物,实现对正常人群、血糖控制良好及血糖控制不佳糖尿病患者的准确区分。通过引入针对眼部图像域的自监督学习(Self-Supervised Learning, SSL)策略(如SimCLR),显著提升了模型性能,其中EfficientNet-V2-S with SSL架构在F1-score达到98.21%的同时,对正常人群的分类精度接近100%,有效减少了不必要的临床转诊,为非专科场景下的糖尿病视网膜病变早期筛查提供了可行路径。
链接: https://arxiv.org/abs/2603.14727
作者: Hasaan Maqsood,Saif Ur Rehman Khan,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim
机构: DFKI(德国弗劳恩霍夫计算机图形学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diabetic retinopathy screening traditionally relies on fundus photography, requiring specialized equipment and expertise often unavailable in primary care and resource limited settings. We developed and validated a deep learning (DL) system for automated diabetic classification using anterior segment ocular imaging a readily accessible alternative utilizing standard photography equipment. The system leverages visible biomarkers in the iris, sclera, and conjunctiva that correlate with systemic diabetic status. We systematically evaluated five contemporary architectures (EfficientNet-V2-S with self-supervised learning (SSL), Vision Transformer, Swin Transformer, ConvNeXt-Base, and ResNet-50) on 2,640 clinically annotated anterior segment images spanning Normal, Controlled Diabetic, and Uncontrolled Diabetic categories. A tailored preprocessing pipeline combining specular reflection mitigation and contrast limited adaptive histogram equalization (CLAHE) was implemented to enhance subtle vascular and textural patterns critical for classification. SSL using SimCLR on domain specific ocular images substantially improved model this http URL-V2-S with SSL achieved optimal performance with an F1-score of 98.21%, precision of 97.90%, and recall of 98.55% a substantial improvement over ImageNet only initialization (94.63% F1). Notably, the model attained near perfect precision (100%) for Normal classification, critical for minimizing unnecessary clinical referrals.
[CV-131] Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator CVPR2026
【速读】:该论文旨在解决3D全身姿态估计中手部姿态恢复不准确的问题,其核心挑战源于监督差距:全身姿态估计算法在包含有限手部多样性的全身体数据集上训练,而仅针对手部的估计算法虽能精确建模手指关节运动但缺乏全局身体上下文感知。解决方案的关键在于提出Hand4Whole++框架,通过引入轻量级的条件手部调制模块(CHAM),利用预训练手部姿态估计器提取的手部特征对全身特征流进行调制,从而提升手腕朝向预测的准确性与上肢运动学结构的一致性;同时,直接融合手部估计器输出的指关节姿态和手部形状,并通过可微刚性配准将其对齐至全身网格,实现全局身体推理与细粒度手部细节的协同优化。
链接: https://arxiv.org/abs/2603.14726
作者: Gyeongsik Moon
机构: Korea University (韩国高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Accurately recovering hand poses within the body context remains a major challenge in 3D whole-body pose estimation. This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose Hand4Whole++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. We introduce CHAM (Conditional Hands Modulator), a lightweight module that modulates the whole-body feature stream using hand-specific features extracted from a pre-trained hand pose estimator. This modulation enables the whole-body model to predict wrist orientations that are both accurate and coherent with the upper-body kinematic structure, without retraining the full-body model. In parallel, we directly incorporate finger articulations and hand shapes predicted by the hand pose estimator, aligning them to the full-body mesh via differentiable rigid alignment. This design allows Hand4Whole++ to combine globally consistent body reasoning with fine-grained hand detail. Extensive experiments demonstrate that Hand4Whole++ substantially improves hand accuracy and enhances overall full-body pose quality.
[CV-132] AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers
【速读】:该论文针对基于视觉Transformer(Vision Transformer, ViT)的冻结主干迁移学习中两个未被充分解决的问题展开研究:一是当适配器(adapter)简单插入固定特征提取器时导致的优化不稳定;二是缺乏对适配器容量设置的理论指导。解决方案的关键在于提出AdapterTune,其核心创新是在每个Transformer块中引入一个残差低秩瓶颈结构,且上投影(up-projection)零初始化,从而确保适应后的网络从预训练函数精确开始,消除早期训练阶段的表征漂移(representation drift)。同时,作者从理论上将适配器秩(rank)建模为特征空间中下游任务偏移的容量预算,并通过过剩风险分解预测出随着秩增加准确率呈单调递增但边际收益递减的“肘部”(elbow)行为,该结论在受控实验中得到验证。
链接: https://arxiv.org/abs/2603.14706
作者: Salim Khazem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow’’ behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head-only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, AdapterTune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: this https URL
[CV-133] Chain-of-Trajectories: Unlocking the Intrinsic Generative Optimality of Diffusion Models via Graph-Theoretic Planning
【速读】:该论文旨在解决扩散模型(Diffusion Models)在采样过程中因状态空间维度灾难导致的计算资源分配不合理问题,即固定、内容无关的采样调度策略难以适应不同生成阶段的复杂度差异,从而造成系统性计算冗余和性能瓶颈。其解决方案的关键在于提出Chain-of-Trajectories(CoTj)框架,核心创新是引入“扩散DNA”(Diffusion DNA)——一种低维特征表示,用于量化每阶段去噪难度并作为高维状态空间的代理变量,从而将采样过程重构为有向无环图上的图规划问题,并通过“预测-规划-执行”范式动态分配计算资源至最困难的生成阶段,实现高质量、稳定且高效的内容感知型生成。
链接: https://arxiv.org/abs/2603.14704
作者: Ping Chen,Xiang Liu,Xingpeng Zhang,Fei Shen,Xun Gong,Zhaoxiang Liu,Zezhou Chen,Huan Hu,Kai Wang,Shiguo Lian
机构: China Unicom(中国联通); National University of Singapore(新加坡国立大学); Southwest Petroleum University(西南石油大学); Southwest Jiaotong University(西南交通大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 12 figues, 5 tables
Abstract:Diffusion models operate in a reflexive System 1 mode, constrained by a fixed, content-agnostic sampling schedule. This rigidity arises from the curse of state dimensionality, where the combinatorial explosion of possible states in the high-dimensional noise manifold renders explicit trajectory planning intractable and leads to systematic computational misallocation. To address this, we introduce Chain-of-Trajectories (CoTj), a train-free framework enabling System 2 deliberative planning. Central to CoTj is Diffusion DNA, a low-dimensional signature that quantifies per-stage denoising difficulty and serves as a proxy for the high-dimensional state space, allowing us to reformulate sampling as graph planning on a directed acyclic graph. Through a Predict-Plan-Execute paradigm, CoTj dynamically allocates computational effort to the most challenging generative phases. Experiments across multiple generative models demonstrate that CoTj discovers context-aware trajectories, improving output quality and stability while reducing redundant computation. This work establishes a new foundation for resource-aware, planning-based diffusion modeling. The code is available at this https URL.
[CV-134] Fractal Autoregressive Depth Estimation with Continuous Token Diffusion
【速读】:该论文旨在解决单目深度估计(monocular depth estimation)中因RGB与深度模态差异导致的自回归(autoregressive, AR)建模困难、逐像素生成效率低下以及连续深度预测不稳定等问题。其核心解决方案是提出一种分形视觉自回归扩散框架(Fractal Visual Autoregressive Diffusion, FVARD),将深度估计重构为粗到细的多尺度自回归生成过程;通过VCFR模块融合多尺度图像特征与当前深度预测以增强跨模态条件建模,利用条件去噪扩散损失直接在连续空间建模深度分布从而避免离散量化误差;同时采用分形递归架构复用基础视觉AR单元以提升计算效率,并引入不确定性感知的鲁棒共识聚合策略实现多样本推理的稳定融合与像素级可靠性估计。
链接: https://arxiv.org/abs/2603.14702
作者: Jinchang Zhang,Xinrou Kang,Guoyu Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular depth estimation can benefit from autoregressive (AR) generation, but direct AR modeling is hindered by the modality gap between RGB and depth, inefficient pixel-wise generation, and instability in continuous depth prediction. We propose a Fractal Visual Autoregressive Diffusion framework that reformulates depth estimation as a coarse-to-fine, next-scale autoregressive generation process. A VCFR module fuses multi-scale image features with current depth predictions to improve cross-modal conditioning, while a conditional denoising diffusion loss models depth distributions directly in continuous space and mitigates errors caused by discrete quantization. To improve computational efficiency, we organize the scale-wise generators into a fractal recursive architecture, reusing a base visual AR unit in a self-similar hierarchy. We further introduce an uncertainty-aware robust consensus aggregation scheme for multi-sample inference to improve fusion stability and provide a practical pixel-wise reliability estimate. Experiments on standard benchmarks demonstrate strong performance and validate the effectiveness of the proposed design.
[CV-135] AURORA-KITTI: Any-Weather Depth Completion and Denoising in the Wild
【速读】:该论文旨在解决恶劣天气条件下RGB-LiDAR融合方法在深度补全(Depth Completion)任务中性能显著下降的问题,即在天气诱导的图像和LiDAR数据均受损的情况下,如何实现鲁棒的稠密深度图重建。其解决方案的关键在于提出AURORA-KITTI——首个大规模多模态、多天气场景下的深度补全基准数据集,并将深度补全与去噪(Depth Completion and Denoising, DCD)统一为一个联合任务,通过引入基于知识蒸馏的基线方法DDCD,利用深度基础模型注入干净结构先验信息,从而在野外复杂环境中实现更高效且鲁棒的深度补全。实验表明,气象感知且物理一致的数据对模型鲁棒性的提升效果优于单纯依赖网络架构改进。
链接: https://arxiv.org/abs/2603.14701
作者: Yiting Wang,Tim Brödermann,Hamed Haghighi,Haonan Zhao,Christos Sakaridis,Kurt Debattista,Valentina Donzella
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robust depth completion is fundamental to real-world 3D scene understanding, yet existing RGB-LiDAR fusion methods degrade significantly under adverse weather, where both camera images and LiDAR measurements suffer from weather-induced corruption. In this paper, we introduce AURORA-KITTI, the first large-scale multi-modal, multi-weather benchmark for robust depth completion in the wild. We further formulate Depth Completion and Denoising (DCD) as a unified task that jointly reconstructs a dense depth map from corrupted sparse inputs while suppressing weather-induced noise. AURORA-KITTI contains over \textit82K weather-consistent RGBL pairs with metric depth ground truth, spanning diverse weather types, three severity levels, day and night scenes, paired clean references, lens occlusion conditions, and textual descriptions. Moreover, we introduce DDCD, an efficient distillation-based baseline that leverages depth foundation models to inject clean structural priors into in-the-wild DCD training. DDCD achieves state-of-the-art performance on AURORA-KITTI and the real-world DENSE dataset while maintaining efficiency. Notably, our results further show that weather-aware, physically consistent data contributes more to robustness than architectural modifications alone. Data and code will be released upon publication.
[CV-136] Robust Building Damage Detection in Cross-Disaster Settings Using Domain Adaptation
【速读】:该论文旨在解决遥感影像中建筑物损伤分类模型在未见地理区域部署时因领域偏移(domain shift)导致性能下降的问题,从而影响人机系统(HMS)中决策者对自动化评估的信任。解决方案的关键在于采用两阶段集成方法,核心是利用监督域适应(supervised domain adaptation, SDA)技术,将xView2竞赛第一名的方法迁移至Ida-BD数据集,通过SDA有效缓解训练与部署数据之间的分布差异;实验表明,去除SDA会导致损伤检测完全失效,而结合非锐化掩膜增强的RGB输入可使Macro-F1达到0.5552,显著提升了模型在跨地域场景下的鲁棒性与可信度。
链接: https://arxiv.org/abs/2603.14694
作者: Asmae Mouradi,Shruti Kshirsagar
机构: Wichita State University (威奇托州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Rapid structural damage assessment from remote sensing imagery is essential for timely disaster response. Within human-machine systems (HMS) for disaster management, automated damage detection provides decision-makers with actionable situational awareness. However, models trained on multi-disaster benchmarks often underperform in unseen geographic regions due to domain shift - a distributional mismatch between training and deployment data that undermines human trust in automated assessments. We explore a two-stage ensemble approach using supervised domain adaptation (SDA) for building damage classification across four severity classes. The pipeline adapts the xView2 first-place method to the Ida-BD dataset using SDA and systematically investigates the effect of individual augmentation components on classification performance. Comprehensive ablation experiments on the unseen Ida-BD test split demonstrate that SDA is indispensable: removing it causes damage detection to fail entirely. Our pipeline achieves the most robust performance using SDA with unsharp-enhanced RGB input, attaining a Macro-F1 of 0.5552. These results underscore the critical role of domain adaptation in building trustworthy automated damage assessment modules for HMS-integrated disaster response.
[CV-137] MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model
【速读】:该论文旨在解决人类-物体交互(Human-Object Interaction, HOI)视频重演中复杂非平面操作(如出平面旋转)难以实现真实运动的问题。现有方法主要局限于图像平面内的简单运动(如平移),难以处理三维空间中复杂的物体操控。其解决方案的关键在于提出一个两阶段的HOI视频重演框架MVHOI,通过引入3D基础模型(3D Foundation Model, 3DFM)将多视角参考条件与视频基础模型相连接:第一阶段由3DFM基于隐式运动动力学生成跨视角一致的对象先验;第二阶段利用多视角参考图像和合理的检索机制合成高保真物体纹理,确保外观一致性。两个阶段在推理过程中相互增强,从而显著提升长时序、复杂3D物体操作场景下的HOI视频生成质量。
链接: https://arxiv.org/abs/2603.14686
作者: Jinguang Tong,Jinbo Wu,Kaisiyuan Wang,Zhelun Shen,Xuan Huang,Mochu Xiang,Xuesong Li,Yingying Li,Haocheng Feng,Chen Zhao,Hang Zhou,Wei He,Chuong Nguyen,Jingdong Wang,Hongdong Li
机构: Baidu Inc.(百度公司); ANU(澳大利亚国立大学); Sun Yat-sen University(中山大学); CSIRO(澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Human-Object Interaction (HOI) video reenactment with realistic motion remains a frontier in expressive digital human creation. Existing approaches primarily handle simple image-plane motion (e.g., in-plane translations), struggling with complex non-planar manipulations like out-of-plane reorientation. In this paper, we propose MVHOI, a two-stage HOI video reenactment framework that bridges multi-view reference conditions and video foundation models via a 3D Foundation Model (3DFM). The 3DFM first produces view-consistent object priors conditioned on implicit motion dynamics across novel viewpoints. A controllable video generation model then synthesizes high-fidelity object texture by incorporating multi-view reference images, ensuring appearance consistency via a reasonable retrieval mechanism. By enabling these two stages to mutually reinforce one another during the inference phase, our framework shows superior performance in generating long-duration HOI videos with intricate object manipulations. Extensive experiments show substantial improvements over prior approaches, especially for HOI with complex 3D object manipulations.
[CV-138] E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction CVPR2026
【速读】:该论文旨在解决事件相机(event camera)在无轨迹信息(pose-free)条件下进行高质量三维重建与新视角合成(NVS)的难题。现有方法通常依赖于已知相机位姿或受限于初始观测范围的深度估计模型,难以适应未见过的场景区域,导致泛化能力差。其解决方案的关键在于:利用边缘信息作为结构线索,在噪声事件流中通过基于局部时序一致性的方差分析提取有效边缘,从而实现无需先验位姿的结构感知高斯初始化与边缘加权损失优化,最终在初始化、跟踪和捆绑调整阶段均获得稳定且精确的重建结果。
链接: https://arxiv.org/abs/2603.14684
作者: Yunsoo Kim,Changki Sung,Dasol Hong,Hyun Myung
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, accepted to CVPR 2026
Abstract:The emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require high-quality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods either assume known poses or rely on depth estimation models that are bounded by their initial observations, failing to generalize as the camera traverses previously unseen regions. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we exploit the distinct spatio-temporal characteristics of edges and non-edge regions. The event camera’s movement induces consistent events along edges, while non-edge regions produce sparse noise. We leverage this through a patch-based temporal coherence analysis that measures local variance to extract edges while robustly suppressing noise. The extracted edges guide structure-aware Gaussian initialization and enable edge-weighted losses throughout initialization, tracking, and bundle adjustment. Extensive experiments on both synthetic and real datasets demonstrate that E2EGS achieves superior reconstruction quality and trajectory accuracy, establishing a fully pose-free paradigm for event-based 3D reconstruction.
[CV-139] Comparative Analysis of 3D Convolutional and 2.5D Slice-Conditioned U-Net Architectures for MRI Super-Resolution via Elucidated Diffusion Models
【速读】:该论文旨在解决低场强磁共振成像(MRI)中分辨率不足的问题,通过计算方法将低分辨率图像重建为接近高场强扫描的高质量高分辨率图像,从而替代昂贵的高场强设备。其解决方案的关键在于采用一种阐明扩散模型(EDM)框架,并对比两种U-Net主干结构:一是全3D卷积U-Net,利用3D卷积和多头自注意力机制处理体素块;二是2.5D切片条件U-Net,在独立超分辨每个切片的同时引入相邻切片作为上下文信息。两种模型均采用连续sigma噪声条件机制,训练数据来自FOMO60K数据集中的NKI队列,实验表明3D模型在PSNR(37.75 dB)、SSIM(0.997)和LPIPS(0.020)三项指标上均优于预训练EDSR基线和2.5D变体,验证了3D建模对空间上下文捕捉的有效性。
链接: https://arxiv.org/abs/2603.14667
作者: Hendrik Chiche,Ludovic Corcos,Logan Rouge
机构: GENCI (Grand Équipement National de Calcul Intensif)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Magnetic resonance imaging (MRI) super-resolution (SR) methods that computationally enhance low-resolution acquisitions to approximate high-resolution quality offer a compelling alternative to expensive high-field scanners. In this work we investigate an elucidated diffusion model (EDM) framework for brain MRI SR and compare two U-Net backbone architectures: (i) a full 3D convolutional U-Net that processes volumetric patches with 3D convolutions and multi-head self-attention, and (ii) a 2.5D slice-conditioned U-Net that super-resolves each slice independently while conditioning on an adjacent slice for inter-slice context. Both models employ continuous-sigma noise conditioning following Karras et al. and are trained on the NKI cohort of the FOMO60K dataset. On a held-out test set of 5 subjects (6 volumes, 993 slices), the 3D model achieves 37.75 dB PSNR, 0.997 SSIM, and 0.020 LPIPS, improving on the off-the-shelf pretrained EDSR baseline (35.57 dB / 0.024 LPIPS) and the 2.5D variant (35.82 dB) across all three metrics under the same test data and degradation pipeline.
[CV-140] EviATTA: Evidential Active Test-Time Adaptation for Medical Segment Anything Models
【速读】:该论文旨在解决在大规模分布偏移下,基于测试时适应(Test-Time Adaptation, TTA)部署基础医学分割模型(Medical Segment Anything Models, SAMs)时,因测试监督信号不可靠而导致的性能下降问题。现有主动测试时适应(Active Test-Time Adaptation, ATTA)方法仍面临不确定性估计不可靠和稀疏标注利用效率低两大挑战。其解决方案的关键在于提出Evidential Active Test-Time Adaptation (EviATTA),首次专为医学SAMs设计的ATTA框架:首先采用基于Dirichlet的证据建模(Evidential Modeling),将预测不确定性分解为分布不确定性(distribution uncertainty)与数据不确定性(data uncertainty);进而设计分层证据采样策略,利用图像级分布不确定性筛选信息量大的偏移样本,并以距离感知的数据不确定性指导稀疏像素标注以消除数据歧义;最后引入双一致性正则化机制,通过逐步提示一致性约束增强稀疏标注样本的利用效率,并在未标注样本上施加变分特征一致性以稳定适应过程。
链接: https://arxiv.org/abs/2603.14666
作者: Jiayi Chen,Yasmeen George,Winston Chong,Jianfei Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures, 5 tables
Abstract:Deploying foundational medical Segment Anything Models (SAMs) via test-time adaptation (TTA) is challenging under large distribution shifts, where test-time supervision is often unreliable. While active test-time adaptation (ATTA) introduces limited expert feedback to improve reliability, existing ATTA methods still suffer from unreliable uncertainty estimation and inefficient utilization of sparse annotations. To address these issues, we propose Evidential Active Test-Time Adaptation (EviATTA), which is, to our knowledge, the first ATTA framework tailored for medical SAMs. Specifically, we adopt the Dirichlet-based Evidential Modeling to decompose overall predictive uncertainty into distribution uncertainty and data uncertainty. Building on this decomposition, we design a Hierarchical Evidential Sampling strategy, where image-wise distribution uncertainty is used to select informative shifted samples, while distance-aware data uncertainty guides sparse pixel annotations to resolve data ambiguities. We further introduce Dual Consistency Regularization, which enforces progressive prompt consistency on sparsely labeled samples to better exploit sparse supervision and applies variational feature consistency on unlabeled samples to stabilize adaptation. Extensive experiments on six medical image segmentation datasets demonstrate that EviATTA consistently improves adaptation reliability with minimal expert feedback under both batch-wise and instance-wise test-time adaptation settings.
[CV-141] VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
【速读】:该论文旨在解决视频推理(video reasoning)中模型难以实现可靠时空定位(spatio-temporal grounding)的问题,尤其在缺乏显式标注或依赖高成本外部感知工具的情况下,模型难以准确捕捉与问题相关的视觉证据并排除干扰项。解决方案的关键在于提出一种输入自适应的强化学习框架 VisonCoach,其核心机制是通过训练阶段的视觉提示(visual prompting)增强关键证据、抑制干扰,并借助自蒸馏(self-distillation)使模型内化这一能力,从而在推理阶段无需外部提示即可实现精准的时空定位。该方法包含两个组件:视觉提示选择器(Visual Prompt Selector)根据视频和问题动态决定提示类型,以及基于对象感知奖励优化的时空推理器(Spatio-Temporal Reasoner),确保对象身份一致性与多区域边界框重叠,显著提升了多个视频理解与时空定位基准上的性能。
链接: https://arxiv.org/abs/2603.14659
作者: Daeun Lee,Shoubin Yu,Yue Zhang,Mohit Bansal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project website: this https URL
Abstract:Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.
[CV-142] Human-AI Ensembles Improve Deepfake Detection in Low-to-Medium Quality Videos
【速读】:该论文试图解决的问题是:在现实场景中,人类与人工智能(AI)检测深度伪造(Deepfake)视频的能力差异及其互补性,尤其是在非专业拍摄的低质量视频环境下,AI检测器性能显著下降的问题。解决方案的关键在于提出人类-AI协同检测框架,通过融合人类判断与AI检测结果形成混合集成模型,从而有效降低高置信度错误率,提升整体检测鲁棒性,尤其在处理移动设备拍摄的日常活动视频时表现突出。
链接: https://arxiv.org/abs/2603.14658
作者: Marco Postiglione,Isabel Gortner,V.S. Subrahmanian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deepfake detection is widely framed as a machine learning problem, yet how humans and AI detectors compare under realistic conditions remains poorly understood. We evaluate 200 human participants and 95 state-of-the-art AI detectors across two datasets: DF40, a standard benchmark, and CharadesDF, a novel dataset of videos of everyday activities. CharadesDF was recorded using mobile phones leading to low/moderate quality videos compared to the more professionally captured DF40. Humans outperform AI detectors on both datasets, with the gap widening in the case of CharadesDF where AI accuracy collapses to near chance (0.537) while humans maintain robust performance (0.784). Human and AI errors are complementary: humans miss high-quality deepfakes while AI detectors flag authentic videos as fake, and hybrid human-AI ensembles reduce high-confidence errors. These findings suggest that effective real-world deepfake detection, especially in non-professionally produced videos, requires human-AI collaboration rather than AI algorithms alone.
[CV-143] opoCL: Topological Contrastive Learning for Medical Imaging
【速读】:该论文旨在解决现有对比学习(Contrastive Learning, CL)方法在医学图像分析中过度依赖视觉外观特征而忽视拓扑结构特征的问题。拓扑特征(如连通性模式、边界配置和腔体形成等)在医学影像中具有重要诊断价值,但传统CL方法未能有效建模这些信息。其解决方案的关键在于提出一种新的拓扑对比学习框架(TopoCL),核心创新包括:(1) 设计拓扑感知的数据增强策略,通过持久图(persistence diagrams)之间的相对瓶颈距离控制拓扑扰动,从而在保留医学相关拓扑属性的同时引入可控的结构变化;(2) 构建分层拓扑编码器(Hierarchical Topology Encoder),利用自注意力与交叉注意力机制提取拓扑特征;(3) 引入自适应混合专家(adaptive mixture-of-experts, MoE)模块,动态融合视觉与拓扑表示。该框架可无缝集成至主流CL方法,并在多个医学图像分类任务中显著提升性能。
链接: https://arxiv.org/abs/2603.14647
作者: Guangyu Meng,Pengfei Gu,Peixian Liang,John P. Lalor,Erin Wolf Chambers,Danny Z. Chen
机构: University of Notre Dame (圣母大学); The University of Texas Rio Grande Valley (得克萨斯州里奥格兰德河谷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Contrastive learning (CL) has become a powerful approach for learning representations from unlabeled images. However, existing CL methods focus predominantly on visual appearance features while neglecting topological characteristics (e.g., connectivity patterns, boundary configurations, cavity formations) that provide valuable cues for medical image analysis. To address this limitation, we propose a new topological CL framework (TopoCL) that explicitly exploits topological structures during contrastive learning for medical imaging. Specifically, we first introduce topology-aware augmentations that control topological perturbations using a relative bottleneck distance between persistence diagrams, preserving medically relevant topological properties while enabling controlled structural variations. We then design a Hierarchical Topology Encoder that captures topological features through self-attention and cross-attention mechanisms. Finally, we develop an adaptive mixture-of-experts (MoE) module to dynamically integrate visual and topological representations. TopoCL can be seamlessly integrated with existing CL methods. We evaluate TopoCL on five representative CL methods (SimCLR, MoCo-v3, BYOL, DINO, and Barlow Twins) and five diverse medical image classification datasets. The experimental results show that TopoCL achieves consistent improvements: an average gain of +3.26% in linear probe classification accuracy with strong statistical significance, verifying its effectiveness.
[CV-144] Spectrum Matching: a Unified Perspective for Superior Diffusability in Latent Diffusion NIPS
【速读】:该论文旨在解决变分自编码器(Variational Autoencoder, VAE)在潜在空间扩散(latent diffusion)中的可扩散性(diffusability,即学习能力)问题,特别是现有方法在潜在空间中常出现过噪或过度平滑的现象。其核心问题是:如何设计潜在表示以提升扩散模型的生成质量与稳定性。解决方案的关键在于提出“谱匹配假设”(Spectrum Matching Hypothesis),该假设包含两个维度:(i) 编码谱匹配(Encoding Spectrum Matching, ESM),通过使潜在变量的功率谱密度(Power Spectral Density, PSD)趋近于图像的扁平化幂律分布,从而增强潜在空间的可扩散性;(ii) 解码谱匹配(Decoding Spectrum Matching, DSM),利用频域对齐的重建和共享谱掩码机制,保持频率到频率的语义一致性。这一统一框架不仅解释了以往方法的局限性,还为多个近期改进方法提供了理论依据,并在CelebA和ImageNet数据集上验证了其优越性能。
链接: https://arxiv.org/abs/2603.14645
作者: Mang Ning,Mingxiao Li,Le Zhang,Lanmiao Liu,Matthew B. Blaschko,Albert Ali Salah,Itir Onal Ertugrul
机构: Utrecht University (乌得勒支大学); KU Leuven (鲁汶大学); Mila (Mila); Max Planck Institute for Psycholinguistics (马克斯普朗克心理语言学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: We use NIPS template for readability reason
Abstract:In this paper, we study the diffusability (learnability) of variational autoencoders (VAE) in latent diffusion. First, we show that pixel-space diffusion trained with an MSE objective is inherently biased toward learning low and mid spatial frequencies, and that the power-law power spectral density (PSD) of natural images makes this bias perceptually beneficial. Motivated by this result, we propose the \emphSpectrum Matching Hypothesis: latents with superior diffusability should (i) follow a flattened power-law PSD (\emphEncoding Spectrum Matching, ESM) and (ii) preserve frequency-to-frequency semantic correspondence through the decoder (\emphDecoding Spectrum Matching, DSM). In practice, we apply ESM by matching the PSD between images and latents, and DSM via shared spectral masking with frequency-aligned reconstruction. Importantly, Spectrum Matching provides a unified view that clarifies prior observations of over-noisy or over-smoothed latents, and interprets several recent methods as special cases (e.g., VA-VAE, EQ-VAE). Experiments suggest that Spectrum Matching yields superior diffusion generation on CelebA and ImageNet datasets, and outperforms prior approaches. Finally, we extend the spectral view to representation alignment (REPA): we show that the directional spectral energy of the target representation is crucial for REPA, and propose a DoG-based method to further improve the performance of REPA. Our code is available this https URL.
[CV-145] Seeing Where to Deploy: Metric RGB-Based Traversability Analysis for Aerial-to-Ground Hidden Space Inspection
【速读】:该论文旨在解决受限基础设施(如涵洞)内部空间巡检中,如何从空中视角准确识别并评估适合无人机(UAV)投放小型地面机器人(UGV)的部署区域的问题。其核心挑战在于:仅依赖RGB图像进行地形重建时存在尺度模糊性(scale ambiguity)、重建不确定性(reconstruction uncertainty)以及地形语义理解不足(terrain semantics)。解决方案的关键在于提出一个基于RGB的几何-语义联合重建与可通行性分析框架,其中引入了“具身运动先验”(embodied motion prior),通过强制预测相机位姿与平台自身位姿的一致性来恢复度量尺度;在此基础上构建具有置信度感知的几何-语义可通行性地图,并在显式可达性约束下评估候选部署区域,从而实现无需激光雷达(LiDAR)即可可靠识别适合部署的区域。
链接: https://arxiv.org/abs/2603.14639
作者: Seoyoung Lee,Shaekh Mohammad Shithil,Durgakant Pushp,Lantao Liu,Zhangyang Wang
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校); Indiana University, Bloomington (印第安纳大学布卢明顿分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inspection of confined infrastructure such as culverts often requires accessing hidden spaces whose entrances are reachable primarily from elevated viewpoints. Aerial-ground cooperation enables a UAV to deploy a compact UGV for interior exploration, but selecting a suitable deployment region from aerial observations requires metric terrain reasoning involving scale ambiguity, reconstruction uncertainty, and terrain semantics. We present a metric RGB-based geometric-semantic reconstruction and traversability analysis framework for aerial-to-ground hidden space inspection. A feed-forward multi-view RGB reconstruction backbone produces dense geometry, while temporally consistent semantic segmentation yields a 3D semantic map. To enable deployment-relevant measurements without LiDAR-based dense mapping, we introduce an embodied motion prior that recovers metric scale by enforcing consistency between predicted camera motion and onboard platform egomotion. From the metrically grounded reconstruction, we construct a confidence-aware geometric-semantic traversability map and evaluate candidate deployment zones under explicit reachability constraints. Experiments on a tethered UAV-UGV platform demonstrate reliable deployment-zone identification in hidden space scenarios.
[CV-146] Continual Few-shot Adaptation for Synthetic Fingerprint Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成的指纹图像日益逼真所带来的指纹识别系统安全漏洞问题,即数据注入攻击中恶意插入合成指纹导致的误识风险。其核心挑战在于传统深度神经网络(DNN)模型在训练后对未见过的生成模型所产合成指纹泛化能力差,易过拟合。解决方案的关键在于将合成指纹检测建模为持续少样本适应(continual few-shot adaptation)问题,通过结合二元交叉熵损失与监督对比损失(supervised contrastive loss)优化特征表示,并在微调过程中回放少量已知风格样本以缓解灾难性遗忘,从而实现对新型合成指纹的快速适应与对已有风格记忆的稳定保持。
链接: https://arxiv.org/abs/2603.14632
作者: Joseph Geo Benjamin,Anil K. Jain,Karthik Nandakumar
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注: Accepted in 14th International Workshop on Biometrics and Forensics (IWBF-2026)
Abstract:The quality and realism of synthetically generated fingerprint images have increased significantly over the past decade fueled by advancements in generative artificial intelligence (GenAI). This has exacerbated the vulnerability of fingerprint recognition systems to data injection attacks, where synthetic fingerprints are maliciously inserted during enrollment or authentication. Hence, there is an urgent need for methods to detect if a fingerprint image is real or synthetic. While it is straightforward to train deep neural network (DNN) models to classify images as real or synthetic, often such DNN models overfit the training data and fail to generalize well when applied to synthetic fingerprints generated using unseen GenAI models. In this work, we formulate synthetic fingerprint detection as a continual few-shot adaptation problem, where the objective is to rapidly evolve a base detector to identify new types of synthetic data. To enable continual few-shot adaptation, we employ a combination of binary cross-entropy and supervised contrastive (applied to the feature representation) losses and replay a few samples from previously known styles during fine-tuning to mitigate catastrophic forgetting. Experiments based on several DNN backbones (as feature extractors) and a variety of real and synthetic fingerprint datasets indicate that the proposed approach achieves a good trade-off between fast adaptation for detecting unseen synthetic styles and forgetting of known styles.
[CV-147] A Heterogeneous Ensemble for Multi-Center COVID-19 Classification from Chest CT Scans
【速读】:该论文旨在解决多中心医疗影像诊断中因设备差异、采集协议不一致和人群异质性导致的域偏移(domain shift)问题,从而提升新冠肺部CT图像自动分类的泛化性能。其关键解决方案在于构建一个异构集成模型(heterogeneous ensemble),融合三种不同推理范式的九个模型:基于自监督学习的DINOv2 Vision Transformer、预训练于RadImageNet的DenseNet-121以及七种采用门控注意力机制的多实例学习模型(Gated Attention Multiple Instance Learning, GAMIL),并通过随机种子扰动与Stochastic Weight Averaging增强模型多样性;同时引入Focal Loss、嵌入层Mixup和领域感知增强策略缓解严重过拟合现象,并通过加权概率平均融合与源域阈值校准实现跨中心稳定性能——最终在四个医院中心实现平均宏F1达0.9280,显著优于最优单模型(F1=0.8969)。
链接: https://arxiv.org/abs/2603.14621
作者: Aadit Nilay,Bhavesh Thapar,Anant Agrawal,Mohammad Nayeem Teli
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The COVID-19 pandemic exposed critical limitations in diagnostic workflows: RT-PCR tests suffer from slow turnaround times and high false-negative rates, while CT-based screening offers faster complementary diagnosis but requires expert radiological interpretation. Deploying automated CT analysis across multiple hospital centres introduces further challenges, as differences in scanner hardware, acquisition protocols, and patient populations cause substantial domain shift that degrades single-model performance. To address these challenges, we present a heterogeneous ensemble of nine models spanning three inference paradigms: (1) a self-supervised DINOv2 Vision Transformer with slice-level sigmoid aggregation, (2) a RadImageNet-pretrained DenseNet-121 with slice-level sigmoid averaging, and (3) seven Gated Attention Multiple Instance Learning models using EfficientNet-B3, ConvNeXt-Tiny, and EfficientNetV2-S backbones with scan-level softmax classification. Ensemble diversity is further enhanced through random-seed variation and Stochastic Weight Averaging. We address severe overfitting, reducing the validation-to-training loss ratio from 35x to less than 3x, through a combination of Focal Loss, embedding-level Mixup, and domain-aware augmentation. Model outputs are fused via score-weighted probability averaging and calibrated with per-source threshold optimization. The final ensemble achieves an average macro F1 of 0.9280 across four hospital centres, outperforming the best single model (F1=0.8969) by +0.031, demonstrating that heterogeneous architectures combined with source-aware calibration are essential for robust multi-site medical image classification.
[CV-148] Make it SING: Analyzing Semantic Invariants in Classifiers
【速读】:该论文旨在解决现有分类器(包括最先进的视觉模型)中存在但难以解释的不变性问题——这些不变性源于分类器线性映射的几何结构,存在于分类器的零空间(null-space)中,导致不同输入可能产生相同输出,而其语义内容却缺乏可解释性。为填补这一空白,作者提出Semantic Interpretation of the Null-space Geometry (SING) 方法,其核心在于利用从网络特征到多模态视觉语言模型(vision-language models)的映射,生成与模型等价的图像,并赋予这些变化以语义解释,从而实现对零空间几何结构的语义化理解。该方法既可用于单张图像揭示局部不变性,也可用于图像集合进行类级和模型级的统计分析,有效识别如ResNet50在零空间泄露语义属性,而DinoViT则更优地保持类语义一致性等关键发现。
链接: https://arxiv.org/abs/2603.14610
作者: Harel Yadid,Meir Yossef Levi,Roy Betser,Guy Gilboa
机构: Technion – Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:All classifiers, including state-of-the-art vision models, possess invariants, partially rooted in the geometry of their linear mappings. These invariants, which reside in the null-space of the classifier, induce equivalent sets of inputs that map to identical outputs. The semantic content of these invariants remains vague, as existing approaches struggle to provide human-interpretable information. To address this gap, we present Semantic Interpretation of the Null-space Geometry (SING), a method that constructs equivalent images, with respect to the network, and assigns semantic interpretations to the available variations. We use a mapping from network features to multi-modal vision language models. This allows us to obtain natural language descriptions and visual examples of the induced semantic shifts. SING can be applied to a single image, uncovering local invariants, or to sets of images, allowing a breadth of statistical analysis at the class and model levels. For example, our method reveals that ResNet50 leaks relevant semantic attributes to the null space, whereas DinoViT, a ViT pretrained with self-supervised DINO, is superior in maintaining class semantics across the invariant space.
[CV-149] GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data
【速读】:该论文旨在解决遥感(Remote Sensing, RS)领域中多模态大语言模型在细粒度空间理解方面的显著不足问题,其根源在于现有模型依赖于有限或被重新利用的遗留数据集。解决方案的关键在于构建一个基于可验证地籍矢量数据的大规模标注数据集,包含380万条标注对象和51万张高分辨率图像,涵盖135个细粒度语义类别,并通过七项空间推理任务的指令微调基准进行验证。实验表明,高质量监督信号能够有效弥合当前专用及商用模型(如Gemini)在零样本设置下的性能差距,使标准LLaVA架构无需复杂结构修改即可掌握细粒度空间定位能力。
链接: https://arxiv.org/abs/2603.14609
作者: Roger Ferrod,Maël Lecene,Krishna Sapkota,George Leifman,Vered Silverman,Genady Beryozkin,Sylvain Lobry
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Precise spatial understanding in Earth Observation is essential for translating raw aerial imagery into actionable insights for critical applications like urban planning, environmental monitoring and disaster management. However, Multimodal Large Language Models exhibit critical deficiencies in fine-grained spatial understanding within Remote Sensing, primarily due to a reliance on limited or repurposed legacy datasets. To bridge this gap, we introduce a large-scale dataset grounded in verifiable cadastral vector data, comprising 3.8 million annotated objects across 510k high-resolution images with 135 granular semantic categories. We validate this resource through a comprehensive instruction-tuning benchmark spanning seven spatial reasoning tasks. Our evaluation establishes a robust baseline using a standard LLaVA architecture. We show that while current RS-specialized and commercial models (e.g., Gemini) struggle in zero-shot settings, high-fidelity supervision effectively bridges this gap, enabling standard architectures to master fine-grained spatial grounding without complex architectural modifications.
[CV-150] actile Modality Fusion for Vision-Language-Action Models
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在接触密集型操作任务中因仅依赖视觉感知而无法有效捕捉复杂交互动力学的问题,例如接触力、表面摩擦、柔顺性和剪切力等。解决方案的关键在于提出TacFiLM,一种轻量级模态融合方法,通过特征-wise 线性调制(Feature-wise Linear Modulation, FiLM)在预训练后的阶段对中间视觉特征进行条件化处理,以引入预训练的触觉表示,从而实现高效且低计算开销的视觉-触觉信号融合。实验表明,该方法在插入任务中显著提升了成功率、直接插入性能、完成时间和力稳定性,适用于分布内与分布外场景。
链接: https://arxiv.org/abs/2603.14604
作者: Charlotte Morissette,Amin Abyaneh,Wei-Di Chang,Anas Houssaini,David Meger,Hsiu-Chin Lin,Jonathan Tremblay,Gregory Dudek
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 5 figures
Abstract:We propose TacFiLM, a lightweight modality-fusion approach that integrates visual-tactile signals into vision-language-action (VLA) models. While recent advances in VLA models have introduced robot policies that are both generalizable and semantically grounded, these models mainly rely on vision-based perception. Vision alone, however, cannot capture the complex interaction dynamics that occur during contact-rich manipulation, including contact forces, surface friction, compliance, and shear. While recent attempts to integrate tactile signals into VLA models often increase complexity through token concatenation or large-scale pretraining, the heavy computational demands of behavioural models necessitate more lightweight fusion strategies. To address these challenges, TacFiLM outlines a post-training finetuning approach that conditions intermediate visual features on pretrained tactile representations using feature-wise linear modulation (FiLM). Experimental results on insertion tasks demonstrate consistent improvements in success rate, direct insertion performance, completion time, and force stability across both in-distribution and out-of-distribution tasks. Together, these results support our method as an effective approach to integrating tactile signals into VLA models, improving contact-rich manipulation behaviours.
[CV-151] xel Splatting: Perspective-Stable 3D Pixel Art
【速读】:该论文旨在解决在渲染像素艺术(pixel art)风格的3D场景时,如何保持离散像素在相机移动过程中稳定的问题。传统方法通过将相机对齐到网格来实现稳定性,但在透视投影下失效,因为不同深度的像素漂移速率不同,单一的网格对齐无法同时校正所有深度。其解决方案的关键在于采用纹理点绘制(texel splatting)技术:将场景几何体从世界空间中固定点渲染到立方体贴图(cubemap),每个纹素(texel)以世界空间中的四边形形式投射到屏幕。这种方法利用立方体贴图索引实现旋转不变性,并通过固定原点的网格对齐实现平移不变性,从而在透视投影下维持像素稳定性。主要限制是固定原点无法覆盖全部场景几何,导致探针边界处的遮挡缺失(disocclusion)问题仍需权衡。
链接: https://arxiv.org/abs/2603.14587
作者: Dylan Ebert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3 pages, 2 figures
Abstract:Rendering 3D scenes as pixel art requires that discrete pixels remain stable as the camera moves. Existing methods snap the camera to a grid. Under orthographic projection, this works: every pixel shifts by the same amount, and a single snap corrects all of them. Perspective breaks this. Pixels at different depths drift at different rates, and no single snap corrects all depths. Texel splatting avoids this entirely. Scene geometry is rendered into a cubemap from a fixed point in the world, and each texel is splatted to the screen as a world-space quad. Cubemap indexing gives rotation invariance. Grid-snapping the origin gives translation invariance. The primary limitation is that a fixed origin cannot see all geometry; disocclusion at probe boundaries remains an open tradeoff.
[CV-152] Medical Image Spatial Grounding with Semantic Sampling MICCAI2026
【速读】:该论文旨在解决医学图像中视觉语言模型(Vision Language Models, VLMs)在三维空间内对解剖结构进行精准空间定位(spatial grounding)的挑战,这一问题在医学影像分析中尤为关键,因图像模态、切片方向和坐标系差异显著影响模型的空间理解能力。解决方案的关键在于提出两个核心贡献:一是构建了MIS-Ground基准测试集,用于系统评估VLMs在医学图像空间定位上的脆弱性;二是设计了一种轻量级、推理时可用且与模型无关的优化方法MIS-SemSam,其通过语义采样(Semantic Sampling)策略提升VLMs的空间接地精度,实验表明该方法使Qwen3-VL-32B模型在MIS-Ground上的准确率提升了13.06%。
链接: https://arxiv.org/abs/2603.14579
作者: Andrew Seohwan Yu,Mohsen Hariri,Kunio Nakamura,Mingrui Yang,Xiaojuan Li,Vipin Chaudhary
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 2 figures, under review at MICCAI 2026
Abstract:Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs represent a bridge between object detection and segmentation, and report understanding and generation. However, spatial grounding of anatomical structures in the three-dimensional space of medical images poses many unique challenges. In this study, we examine image modalities, slice directions, and coordinate systems as differentiating factors for vision components of VLMs, and the use of anatomical, directional, and relational terminology as factors for the language components. We then demonstrate that visual and textual prompting systems such as labels, bounding boxes, and mask overlays have varying effects on the spatial grounding ability of VLMs. To enable measurement and reproducibility, we introduce \textbfMIS-Ground, a benchmark that comprehensively tests a VLM for vulnerabilities against specific modes of \textbfMedical \textbfImage \textbfSpatial \textbfGrounding. We release MIS-Ground to the public at \hrefthis https URL\textttthis http URL. In addition, we present \textbfMIS-SemSam, a low-cost, inference-time, and model-agnostic optimization of VLMs that improve their spatial grounding ability with the use of \textbfSemantic \textbfSampling. We find that MIS-SemSam improves the accuracy of Qwen3-VL-32B on MIS-Ground by 13.06%.
[CV-153] Covariance-Guided Resource Adaptive Learning for Efficient Edge Inference
【速读】:该论文旨在解决边缘设备上深度学习推理时,硬件配置在相同吞吐量下功耗差异可达2倍的问题,即如何高效地找到低功耗且满足性能目标的配置,而无需进行耗时的离线调优。解决方案的关键在于提出一种在线优化方法CORAL,其利用距离协方差(distance covariance)统计捕捉硬件参数(如动态电压频率调节DVFS和并发级别)与性能指标之间的非线性依赖关系,并将问题建模为吞吐量-功耗联合优化问题,从而在满足功率预算和吞吐量目标的同时,实现近似最优配置的快速发现。
链接: https://arxiv.org/abs/2603.14577
作者: Ahmad N. L. Nabhaan,Zaki Sukma,Rakandhiya D. Rachmanto,Muhammad Husni Santriaji,Byungjin Cho,Arief Setyanto,In Kee Kim
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures
Abstract:For deep learning inference on edge devices, hardware configurations achieving the same throughput can differ by 2 \times in power consumption, yet operators often struggle to find the efficient ones without exhaustive profiling. Existing approaches often rely on inefficient static presets or require expensive offline profiling that must be repeated for each new model or device. To address this problem, we present CORAL, an online optimization method that discovers near-optimal configurations without offline profiling. CORAL leverages distance covariance to statistically capture the non-linear dependencies between hardware settings, e.g., DVFS and concurrency levels, and performance metrics. Unlike prior work, we explicitly formulate the challenge as a throughput-power co-optimization problem to satisfy power budgets and throughput targets simultaneously. We evaluate CORAL on two NVIDIA Jetson devices across three object detection models ranging from lightweight to heavyweight. In single-target scenarios, CORAL achieves 96% \unicodex2013 100% of the optimal performance found by exhaustive search. In strict dual-constraint scenarios where baselines fail or exceed power budgets, CORAL consistently finds proper configurations online with minimal exploration.
[CV-154] ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理高分辨率视觉标记时面临的二次计算复杂度瓶颈问题,尤其针对现有token缩减策略未能充分挖掘注意力值、忽视token冗余以及忽略LVLM中固有的“注意力偏移”(attention shift)现象的局限性。其解决方案的关键在于提出一种无需训练且兼容KV缓存的剪枝方法ASAP:首先,通过动态双向软注意力掩码缓解注意力偏移,从而选择真正信息量丰富的视觉token;其次,引入加权软合并机制对语义相似的token进行融合,保留最具特征密度的视觉补丁,实现近乎无损的视觉上下文压缩,在保持LLaVA-NeXT-7B模型99.02%原始性能的同时,将计算FLOPs降低约80%。
链接: https://arxiv.org/abs/2603.14549
作者: Surendra Pathak,Bo Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift’’ phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades performance. We therefore introduce a weighted soft merging component that merges semantically similar tokens, preserving only the most feature-dense visual patches for subsequent layers. ASAP achieves virtually lossless compression of visual context, retaining 99.02% of the original LLaVA-NeXT-7B performance while aggressively slashing computational FLOPs by ~80%.
[CV-155] Distilling Latent Manifolds: Resolution Extrapolation by Variational Autoencoders
【速读】:该论文旨在解决变分自编码器(Variational Autoencoder, VAE)编码器在知识蒸馏过程中存在的分辨率适应性问题:传统方法认为蒸馏模型仅在训练分辨率下表现良好,难以泛化到更高分辨率输入。研究发现,一个仅在低分辨率(如256²)下蒸馏的紧凑编码器,在其原生分辨率下重建性能较差,但在未见过的高分辨率(如512²)输入上却表现出显著提升的重建效果。解决方案的关键在于认识到VAE编码器蒸馏学习的是分辨率一致的潜在流形(resolution-consistent latent manifolds),而非特定分辨率的像素映射;通过简单的输入重映射策略(即对输入先上采样至高分辨率再编码,输出再下采样评估),即可实现跨分辨率的有效泛化,从而无需高分辨率训练数据或昂贵计算资源即可获得高质量高分辨率图像重建能力。
链接: https://arxiv.org/abs/2603.14536
作者: Jiaming Chu,Tao Wang,Lei Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Variational Autoencoder (VAE) encoders play a critical role in modern generative models, yet their computational cost often motivates the use of knowledge distillation or quantification to obtain compact alternatives. Existing studies typically believe that the model work better on the samples closed to their training data distribution than unseen data distribution. In this work, we report a counter-intuitive phenomenon in VAE encoder distillation: a compact encoder distilled only at low resolutions exhibits poor reconstruction performance at its native resolution, but achieves dramatically improved results when evaluated at higher, unseen input resolutions. Despite never being trained beyond 256^2 resolution, the distilled encoder generalizes effectively to 512^2 resolution inputs, partially inheriting the teacher model’s resolution this http URL further analyze latent distributions across resolutions and find that higher-resolution inputs produce latent representations more closely aligned with the teacher’s manifold. Through extensive experiments on ImageNet-256, we show that simple resolution remapping-upsampling inputs before encoding and downsampling reconstructions for evaluation-leads to substantial gains across PSNR, MSE, SSIM, LPIPS, and rFID metrics. These findings suggest that VAE encoder distillation learns resolution-consistent latent manifolds rather than resolution-specific pixel mappings. This also means that the high training cost on memory, time and high-resolution datasets are not necessary conditions for distilling a VAE with high-resolution image reconstruction capabilities. On low resolution datasets, the distillation model still could learn the detailed knowledge of the teacher model in high-resolution image reconstruction.
[CV-156] Interp3R: Continuous-time 3D Geometry Estimation with Frames and Events
【速读】:该论文旨在解决基于帧的三维视觉基础模型(如DUSt3R)仅能在图像捕获的离散时间点恢复场景几何信息的问题,从而导致连续帧之间“盲区”内的场景演化无法被建模。解决方案的关键在于提出Interp3R,该方法首次将点图(pointmap)基础模型扩展至任意时间点的深度与相机位姿估计,其核心创新是利用异步事件数据(asynchronous event data)对帧基模型生成的点图进行插值,从而获得时序连续的几何表示;随后通过将插值得到的点图与原始帧基点图对齐,构建一致的空间框架以联合恢复深度和相机姿态。
链接: https://arxiv.org/abs/2603.14528
作者: Shuang Guo,Filbert Febryanto,Lei Sun,Guillermo Gallego
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages, 6 figures, 5 tables
Abstract:In recent years, 3D visual foundation models pioneered by pointmap-based approaches such as DUSt3R have attracted a lot of interest, achieving impressive accuracy and strong generalization across diverse scenes. However, these methods are inherently limited to recovering scene geometry only at the discrete time instants when images are captured, leaving the scene evolution during the blind time between consecutive frames largely unexplored. We introduce Interp3R, to the best of our knowledge the first method that enhances pointmap-based models to estimate depth and camera poses at arbitrary time instants. Interp3R leverages asynchronous event data to interpolate pointmaps produced by frame-based models, enabling temporally continuous geometric representations. Depth and camera poses are then jointly recovered by aligning the interpolated pointmaps together with those predicted by the underlying frame-based models into a consistent spatial framework. We train Interp3R exclusively on a synthetic dataset, yet demonstrate strong generalization across a wide range of synthetic and real-world benchmarks. Extensive experiments show that Interp3R outperforms by a considerable margin state-of-the-art baselines that follow a two-stage pipeline of 2D video frame interpolation followed by 3D geometry estimation.
[CV-157] LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion
【速读】:该论文旨在解决视频扩散模型(video diffusion)在推理阶段(inference-time)生成质量与可控性不足的问题,特别是现有方法依赖初始噪声优化或仅在去噪完成后的视频上评估奖励信号,导致奖励稀疏、误差累积及计算成本高昂,难以应用高效搜索算法。其解决方案的关键在于引入潜在奖励引导机制(latent reward guidance),通过构建一个可评分部分去噪潜空间表示的潜在奖励模型(latent reward model),在去噪轨迹中提供中间、信息丰富且高效的反馈信号,从而支持更有效的推理时搜索。具体而言,作者提出LatSearch方法,结合奖励归一化重采样(Reward-Guided Resampling)和最终阶段的奖励累积剪枝(Pruning),显著提升视频生成的质量、样本效率和可控性,同时降低计算开销。
链接: https://arxiv.org/abs/2603.14526
作者: Zengqun Zhao,Ziquan Liu,Yu Cao,Shaogang Gong,Zhensong Zhang,Jifei Song,Jiankang Deng,Ioannis Patras
机构: 1: University College London (伦敦大学学院); 2: Tencent (腾讯); 3: Huawei (华为)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: see this https URL
Abstract:The recent success of inference-time scaling in large language models has inspired similar explorations in video diffusion. In particular, motivated by the existence of “golden noise” that enhances video quality, prior work has attempted to improve inference by optimising or searching for better initial noise. However, these approaches have notable limitations: they either rely on priors imposed at the beginning of noise sampling or on rewards evaluated only on the denoised and decoded videos. This leads to error accumulation, delayed and sparse reward signals, and prohibitive computational cost, which prevents the use of stronger search algorithms. Crucially, stronger search algorithms are precisely what could unlock substantial gains in controllability, sample efficiency and generation quality for video diffusion, provided their computational cost can be reduced. To fill in this gap, we enable efficient inference-time scaling for video diffusion through latent reward guidance, which provides intermediate, informative and efficient feedback along the denoising trajectory. We introduce a latent reward model that scores partially denoised latents at arbitrary timesteps with respect to visual quality, motion quality, and text alignment. Building on this model, we propose LatSearch, a novel inference-time search mechanism that performs Reward-Guided Resampling and Pruning (RGRP). In the resampling stage, candidates are sampled according to reward-normalised probabilities to reduce over-reliance on the reward model. In the pruning stage, applied at the final scheduled step, only the candidate with the highest cumulative reward is retained, improving both quality and efficiency. We evaluate LatSearch on the VBench-2.0 benchmark and demonstrate that it consistently improves video generation across multiple evaluation dimensions compared to the baseline Wan2.1 model.
[CV-158] VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在具身智能任务中依赖文本链式思维推理、无法动态感知环境以解决长期任务中歧义的问题。其核心解决方案是提出VLA-Thinker框架,将感知建模为可动态调用的推理动作,从而实现“带图像思考”的推理机制;关键创新在于采用两阶段训练策略:第一阶段通过精心构建的视觉链式思维数据进行监督微调(SFT),激活结构化推理与工具使用行为;第二阶段基于广义优势重参数化优化(GRPO)的强化学习对完整推理-动作轨迹进行对齐,提升任务成功率。
链接: https://arxiv.org/abs/2603.14523
作者: Chaoyang Wang,Wenrui Bao,Sicheng Gao,Bingxin Xu,Yu Tian,Yogesh S. Rawat,Yunhao Ge,Yuzhang Shang
机构: University of Central Florida (中央佛罗里达大学); University of Würzburg (维尔茨堡大学); University of Southern California (南加州大学); NVIDIA Research (英伟达研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: We introduce VLA-Thinker, the first VLA model capable of thinking-with-image reasoning, which models visual perception as a dynamically invocable reasoning action, enabling Multimodal Embodied Chain-of-Thought
Abstract:Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: this https URL .
[CV-159] Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets CVPR2026
【速读】:该论文旨在解决当前毫米波(mmWave)人体姿态估计(HPE)数据集稀缺且多样性不足的问题,这严重限制了模型的泛化能力。解决方案的关键在于提出一种名为EMDUL的新方法,其核心是通过训练一个伪标签估计器来标注未标记的mmWave点云(PC),并利用LiDAR HPE数据集中的标注点云进行跨模态转换——将LiDAR PC翻译为对应的mmWave PC。通过融合LiDAR转换和伪标签标注的mmWave数据,显著提升了HPE模型在域内和域外场景下的性能,分别实现15.1%和18.9%的误差降低。
链接: https://arxiv.org/abs/2603.14507
作者: Zhuoxuan Peng,Boan Zhu,Xingjian Zhang,Wenying Li,S.-H. Gary Chan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Current mmWave datasets for human pose estimation (HPE) are scarce and lack diversity in both point cloud (PC) attributes and human poses, severely hampering the generalization ability of their trained models. On the other hand, unlabeled mmWave HPE data and diverse LiDAR HPE datasets are readily available. We propose EMDUL, a novel approach to expand the volume and diversity of an existing mmWave dataset using unlabeled mmWave data and a LiDAR dataset. EMDUL trains a pseudo-label estimator to annotate the unlabeled mmWave data and is able to convert, or translate, a given annotated LiDAR PC to its mmWave counterpart. Expanded with both LiDAR-converted and pseudo-labeled mmWave PCs, our mmWave dataset significantly boosts the performance and generalization ability of all our HPE models, with substantial 15.1% and 18.9% error reductions for in-domain and out-of-domain settings, respectively.
[CV-160] Unlocking the Latent Canvas: Eliciting and Benchmarking Symbolic Visual Expression in LLM s
【速读】:该论文旨在解决当前多模态方法中视觉生成被视作外部过程、忽视大语言模型(Large Language Models, LLMs)内在视觉表征能力的问题。其核心挑战在于如何在纯文本空间内实现高效的符号化视觉表达与理解,而非依赖像素渲染或代码执行等外部机制。解决方案的关键在于提出SVE-ASCII框架,利用ASCII艺术这一紧凑、高效且文本原生的视觉格式,构建了一个统一的符号视觉表达系统;并通过“Seed-and-Evolve”数据合成管道创建高质量的ASCIIArt-7K数据集,并采用联合指令微调策略同时优化文本到ASCII生成和ASCII到文本理解任务。实验揭示了生成与理解之间的双向增强关系——生成训练显著提升视觉理解能力,验证了符号视觉处理中存在互促循环,为原生文本驱动的视觉智能提供了新的基准和方法论支持。
链接: https://arxiv.org/abs/2603.14505
作者: Yiren Zheng,Shibo Li,Jiaming Liu,Haofan Wang,Yiren Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current multimodal approaches predominantly treat visual generation as an external process, relying on pixel rendering or code execution, thereby overlooking the native visual representation capabilities latent within Large Language Models (LLMs). In this work, we unlock this potential through ASCII art, a compact, efficient, and text-native visual format. We introduce SVE-ASCII, a unified framework designed to elicit and benchmark Symbolic Visual Expression directly within the pure text space. To address the scarcity of systematic resources, we construct ASCIIArt-7K, a high-quality dataset synthesized via a novel “Seed-and-Evolve” pipeline that augments human-curated anchors through in-context stylistic editing. We further implement a unified instruction-tuning strategy that jointly optimizes for both Generation (Text-to-ASCII) and Understanding (ASCII-to-Text). Crucially, our experiments reveal a critical phenomenon regarding task duality: while it is established that perception aids generation, we provide compelling evidence that generative training significantly enhances visual comprehension. This confirms a mutually reinforcing cycle in symbolic visual processing, a relationship previously hypothesized but rarely empirically demonstrated in the visual domain. We release our dataset, the ASCIIArt-Bench benchmark, and the SVE-ASCII model, establishing a robust baseline for native text-based visual intelligence.
[CV-161] rust-Region Noise Search for Black-Box Alignment of Diffusion and Flow Models ICLR
【速读】:该论文旨在解决扩散模型(Diffusion Models)和流模型(Flow Models)在推理阶段如何高效对齐目标奖励函数的问题,尤其针对现有方法受限于可微分或低成本奖励模型、预训练生成模型结构的约束,以及内存与计算效率低下的缺陷。其解决方案的关键在于提出一种基于信任区域(Trust-Region, TRS)的搜索算法,将预训练的生成模型和奖励模型视为黑箱,仅优化源噪声(source noise),从而实现全局探索与局部开发的良好平衡,且适用于多种生成场景和奖励模型,同时显著减少超参数调优需求。
链接: https://arxiv.org/abs/2603.14504
作者: Niklas Schweiger,Daniel Cremers,Karnik Ram
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint (shorter version accepted at ICLR ReaLM-GEN workshop)
Abstract:Optimizing the noise samples of diffusion and flow models is an increasingly popular approach to align these models to target rewards at inference time. However, we observe that these approaches are usually restricted to differentiable or cheap reward models, the formulation of the underlying pretrained generative model, or are memory/compute inefficient. We instead propose a simple trust-region based search algorithm (TRS) which treats the pre-trained generative and reward models as a black-box and only optimizes the source noise. Our approach achieves a good balance between global exploration and local exploitation, and is versatile and easily adaptable to various generative settings and reward models with minimal hyperparameter tuning. We evaluate TRS across text-to-image, molecule and protein design tasks, and obtain significantly improved output samples over the base generative models and other inference-time alignment approaches which optimize the source noise sample, or even the entire reverse-time sampling noise trajectories in the case of diffusion models. Our source code is publicly available.
[CV-162] Mapping Dark-Matter Clusters via Physics-Guided Diffusion Models
【速读】:该论文旨在解决银河系团质量重建在大规模巡天背景下缺乏可扩展性和基准测试工具的问题,尤其是在未来宽视场巡天将产生数十万计星系团的背景下。解决方案的关键在于提出一种完全自动化的质量重构方法,其核心是基于DarkClusters-15k这一包含15,000个模拟星系团的大型数据集(涵盖多红移和多种模拟框架),训练了一个即插即用的扩散先验模型(diffusion prior),该模型学习了质量与光度之间的统计关系,并通过弱引力透镜和强引力透镜观测约束生成后验样本,从而实现物理驱动且不确定性校准的质量分布重建。该方法无需专家调参、计算效率高(分钟级运行)、精度优于传统方法,并能匹配人工调优结果(如MACS 1206星系团)。
链接: https://arxiv.org/abs/2603.14503
作者: Diego Royo,Brandon Zhao,Adolfo Muñoz,Diego Gutierrez,Katherine L. Bouman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cosmology and Nongalactic Astrophysics (astro-ph.CO)
备注: 22 pages, 7 figures. Project page available at: this https URL
Abstract:Galaxy clusters are powerful probes of astrophysics and cosmology through gravitational lensing: the clusters’ mass, dominated by 85% dark matter, distorts background light. Yet, mass reconstruction lacks the scalability and large-scale benchmarks to process the hundreds of thousands of clusters expected from forthcoming wide-field surveys. We introduce a fully automated method to reconstruct cluster surface mass density from photometry and gravitational lensing observables. Central to our approach is DarkClusters-15k, our new dataset of 15,000 simulated clusters with paired mass and photometry maps, the largest benchmark to date, spanning multiple redshifts and simulation frameworks. We train a plug-and-play diffusion prior on DarkClusters-15k that learns the statistical relationship between mass and light, and draw posterior samples constrained by weak- and strong-lensing observables; this yields principled reconstructions driven by explicit physics, alongside well-calibrated uncertainties. Our approach requires no expert tuning, runs in minutes rather than hours, achieves higher accuracy, and matches expertly-tuned reconstructions of the MACS 1206 cluster. We release our method and DarkClusters-15k to support development and benchmarking for upcoming wide-field cosmological surveys.
[CV-163] R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation
【速读】:该论文旨在解决机器人具身操作(embodied manipulation)中因依赖高精度3D理解而带来的实时控制延迟问题。现有大型3D视觉模型虽能提供强大的先验知识,但其计算开销导致无法满足实时控制需求。解决方案的关键在于提出Real-time 3D-aware Policy (R3DP),其核心创新是异步快慢协同模块(asynchronous fast-slow collaboration module),该模块通过仅在稀疏关键帧上调用预训练的慢速系统(VGGT)并结合轻量级时序特征预测网络(Temporal Feature Prediction Network, TFPNet)对中间帧进行特征预测,从而在保持实时性的同时利用历史数据增强时序一致性;同时引入多视角特征融合器(Multi-View Feature Fuser, MVFF)以显式融合相机内参与外参信息,提升多视角特征聚合效果。此设计实现了大模型先验与实时推理的高效集成,显著提升了任务成功率并降低了44.8%的推理时间。
链接: https://arxiv.org/abs/2603.14498
作者: Yuhao Zhang,Wanxi Dong,Yue Shi,Yi Liang,Jingnan Gao,Qiaochu Yang,Yaxing Lyu,Zhixuan Liang,Yibin Liu,Congsheng Xu,Xianda Guo,Wei Sui,Yaohui Jin,Xiaokang Yang,Yanyan Xu,Yao Mu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Embodied manipulation requires accurate 3D understanding of objects and their spatial relations to plan and execute contact-rich actions. While large-scale 3D vision models provide strong priors, their computational cost incurs prohibitive latency for real-time control. We propose Real-time 3D-aware Policy (R3DP), which integrates powerful 3D priors into manipulation policies without sacrificing real-time performance. A core innovation of R3DP is the asynchronous fast-slow collaboration module, which seamlessly integrates large-scale 3D priors into the policy without compromising real-time performance. The system maintains real-time efficiency by querying the pre-trained slow system (VGGT) only on sparse key frames, while simultaneously employing a lightweight Temporal Feature Prediction Network (TFPNet) to predict features for all intermediate frames. By leveraging historical data to exploit temporal correlations, TFPNet explicitly improves task success rates through consistent feature estimation. Additionally, to enable more effective multi-view fusion, we introduce a Multi-View Feature Fuser (MVFF) that aggregates features across views by explicitly incorporating camera intrinsics and extrinsics. R3DP offers a plug-and-play solution for integrating large models into real-time inference systems. We evaluate R3DP against multiple baselines across different visual configurations. R3DP effectively harnesses large-scale 3D priors to achieve superior results, outperforming single-view and multi-view DP by 32.9% and 51.4% in average success rate, respectively. Furthermore, by decoupling heavy 3D reasoning from policy execution, R3DP achieves a 44.8% reduction in inference time compared to a naive DP+VGGT integration.
[CV-164] WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning
【速读】:该论文旨在解决当前自动驾驶系统在场景理解与动态预测之间存在的局限性问题,尤其是视觉-语言模型(Vision-Language Models, VLMs)因空间感知能力有限而难以作为端到端驾驶模型的问题,以及世界模型(World Models, WM)虽能有效预测环境演化但缺乏高阶语义推理能力的不足。解决方案的关键在于提出一种名为WorldVLM的混合架构,通过将VLM的上下文推理能力与WM的动态预测能力有机结合:由VLM生成高层行为指令以引导WM进行更可解释、情境感知的决策,从而实现对复杂交通场景中动态变化的精准建模与可控驱动。
链接: https://arxiv.org/abs/2603.14497
作者: Stefan Englmeier,Katharina Winter,Fabian B. Flohr
机构: Munich University of Applied Sciences (慕尼黑应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 6 figures, 5 tables
Abstract:Autonomous driving systems depend on on models that can reason about high-level scene contexts and accurately predict the dynamics of their surrounding environment. Vision- Language Models (VLMs) have recently emerged as promising tools for decision-making and scene understanding, offering strong capabilities in contextual reasoning. However, their limited spatial comprehension constrains their effectiveness as end-to-end driving models. World Models (WM) internalize environmental dynamics to predict future scene evolution. Recently explored as ego-motion predictors and foundation models for autonomous driving, they represent a promising direction for addressing key challenges in the field, particularly enhancing generalization while maintaining dynamic prediction. To leverage the complementary strengths of context-based decision making and prediction, we propose WorldVLM: A hybrid architecture that unifies VLMs and WMs. In our design, the high-level VLM generates behavior commands to guide the driving WM, enabling interpretable and context-aware actions. We evaluate conditioning strategies and provide insights into the hybrid design challenges.
[CV-165] Refining 3D Medical Segmentation with Verbal Instruction
【速读】:该论文旨在解决3D医学解剖结构分割中因训练数据有限、标注质量不足及训练与部署分布差异导致的形状预测不准确问题,尤其关注如何通过临床医生的语言指令实现迭代修正。其解决方案的关键在于构建了一个名为CoWTalk的基准数据集,包含可控合成的3D动脉解剖错误及其对应的修复指令,并提出一种基于向量集合表示3D形状的迭代精化模型,该模型能与文本指令交互以逐步更新目标形状,从而实现语言驱动的医生成果闭环修正。
链接: https://arxiv.org/abs/2603.14496
作者: Kangxian Xie,Jiancheng Yang,Nandor Pinter,Chao Wu,Behzad Bozorgtabar,Mingchen Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Accurate 3D anatomical segmentation is essential for clinical diagnosis and surgical planning. However, automated models frequently generate suboptimal shape predictions due to factors such as limited and imbalanced training data, inadequate labeling quality, and distribution shifts between training and deployment settings. A natural solution is to iteratively refine the predicted shape based on the radiologists’ verbal instructions. However, this is hindered by the scarcity of paired data that explicitly links erroneous shapes to corresponding corrective instructions. As an initial step toward addressing this limitation, we introduce CoWTalk, a benchmark comprising 3D arterial anatomies with controllable synthesized anatomical errors and their corresponding repairing instructions. Building on this benchmark, we further propose an iterative refinement model that represents 3D shapes as vector sets and interacts with textual instructions to progressively update the target shape. Experimental results demonstrate that our method achieves significant improvements over corrupted inputs and competitive baselines, highlighting the feasibility of language-driven clinician-in-the-loop refinement for 3D medical shapes modeling.
[CV-166] V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning
【速读】:该论文旨在解决如何在图像和视频中学习高密度、高质量的视觉表征,同时保持对全局场景的强理解能力的问题。其核心挑战在于实现空间结构化、语义连贯且时间一致的视觉表示,以支持下游任务如物体交互预测、动作识别与机器人操作等。解决方案的关键在于提出V-JEPA 2.1模型架构,包含四大设计:(1)密集预测损失(dense predictive loss),通过掩码机制使可见与被遮挡token共同贡献训练信号,强化时空定位;(2)深度自监督(deep self-supervision),在多个中间编码层上分层应用自监督目标以提升表征质量;(3)多模态分词器(multi-modal tokenizers),实现图像与视频的统一训练;(4)模型容量与训练数据的有效扩展策略。这些组件协同作用,显著提升了视觉表征的密度与泛化能力,在多个基准测试中达到当前最优性能。
链接: https://arxiv.org/abs/2603.14482
作者: Lorenzo Mur-Labadia,Matthew Muckley,Amir Bar,Mido Assran,Koustuv Sinha,Mike Rabbat,Yann LeCun,Nicolas Ballas,Adrien Bardes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.14482 [cs.CV] (or arXiv:2603.14482v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.14482 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-167] Wi-Spike: A Low-power WiFi Human Multi-action Recognition Model with Spiking Neural Networks
【速读】:该论文旨在解决WiFi感知中人类动作识别(Human Action Recognition, HAR)模型在能量效率方面的不足问题,尤其是在边缘计算场景下对低功耗和高精度的双重需求。其解决方案的关键在于提出一种受生物启发的脉冲神经网络(Spiking Neural Network, SNN)框架Wi-Spike,通过引入事件驱动的脉冲卷积层实现时空特征提取,并设计了一种新颖的时间注意力机制以增强判别性表征;同时,利用脉冲全连接层与投票层完成特征编码与分类。实验表明,该方法在保持95.83%识别准确率的同时,能耗至少降低50%,并在多动作识别任务上达到新的最优性能,为实时、节能的边缘感知应用提供了可行方案。
链接: https://arxiv.org/abs/2603.14475
作者: Nengbo Zhang,Yao Ying,Lu Wang,Kaishun Wu,Jieming Ma,Fei Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:WiFi-based human action recognition (HAR) has gained significant attention due to its non-intrusive and privacy-preserving nature. However, most existing WiFi sensing models predominantly focus on improving recognition accuracy, while issues of power consumption and energy efficiency remain insufficiently discussed. In this work, we present Wi-Spike, a bio-inspired spiking neural network (SNN) framework for efficient and accurate action recognition using WiFi channel state information (CSI) signals. Specifically, leveraging the event-driven and low-power characteristics of SNNs, Wi-Spike introduces spiking convolutional layers for spatio-temporal feature extraction and a novel temporal attention mechanism to enhance discriminative representation. The extracted features are subsequently encoded and classified through spiking fully connected layers and a voting layer. Comprehensive experiments on three benchmark datasets (NTU-Fi-HAR, NTU-Fi-HumanID, and UT-HAR) demonstrate that Wi-Spike achieves competitive accuracy in single-action recognition and superior performance in multi-action recognition tasks. As for energy consumption, Wi-Spike reduces the energy cost by at least half compared with other methods, while still achieving 95.83% recognition accuracy in human activity recognition. More importantly, Wi-Spike establishes a new state-of-the-art in WiFi-based multi-action HAR, offering a promising solution for real-time, energy-efficient edge sensing applications.
人工智能
[AI-0] Computational Concept of the Psyche
【速读】:该论文旨在解决人工通用智能(Artificial General Intelligence, AGI)的构建问题,其核心挑战在于如何模拟人类心理机制并实现具有自主决策能力的智能体。解决方案的关键在于提出一种认知架构(cognitive architecture),将 psyche 视为生物或人工主体的操作系统,包含状态空间(state space),其中不仅涵盖主体的需求状态(need states),还整合了感知(sensations)、行动(actions)及其与外部刺激的关系。该架构通过经验学习在需求驱动的状态空间中进行最优决策计算,目标是在不确定性条件下最大化目标达成成功率、最小化存在性风险并提升能量效率,从而为 AGI 系统提供可计算的形式化路径。
链接: https://arxiv.org/abs/2603.15586
作者: Anton Kolonin,Vladimir Krykov
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 19 pages, 5 figures
Abstract:This article presents an overview of approaches to modeling the human psyche in the context of constructing an artificial one. Based on this overview, a concept of cognitive architecture is proposed, in which the psyche is viewed as the operating system of a living or artificial subject, comprising a space of states, including the state of needs that determine the meaning of a subject’s being in relation to stimuli from the external world, and intelligence as a decision-making system regarding actions in this world to satisfy these needs. Based on this concept, a computational formalization is proposed for creating artificial general intelligence systems for an agent through experiential learning in a state space that includes agent’s needs, taking into account their biological or existential significance for the intelligent agent, along with agent’s sensations and actions. Thus, the problem of constructing artificial general intelligence is formalized as a system for making optimal decisions in the space of specific agent needs under conditions of uncertainty, maximizing success in achieving goals, minimizing existential risks, and maximizing energy efficiency. A minimal experimental implementation of the model is presented.
[AI-1] Physics-Informed Neural Systems for the Simulation of EUV Electromagnetic Wave Diffraction from a Lithography Mask
【速读】:该论文旨在解决极紫外(Extreme Ultraviolet, EUV)电磁波在当代光刻掩模上的衍射问题,这是下一代光刻技术设计与优化中的关键挑战。传统数值求解器计算成本高、效率低,难以满足快速迭代需求。解决方案的核心在于提出一种新型混合波导神经算子(Waveguide Neural Operator, WGNO),其将波导方法中计算最密集的模块替换为神经网络,从而实现高精度与高速度的平衡;同时,通过物理信息神经网络(Physics-informed Neural Networks, PINNs)和神经算子(Neural Operators, NOs)对不同波长(13.5 nm 和 11.2 nm)进行建模,实验表明该方法在真实二维和三维掩模场景下不仅保持了与现代数值求解器相当的精度,还显著缩短预测时间,并展现出优异的泛化能力——即对训练中未见参数仍能维持接近训练精度的解算性能。
链接: https://arxiv.org/abs/2603.15584
作者: Vasiliy A. Es’kin,Egor V. Ivanov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph); Optics (physics.optics)
备注: arXiv admin note: substantial text overlap with arXiv:2507.04153
Abstract:Physics-informed neural networks (PINNs) and neural operators (NOs) for solving the problem of diffraction of Extreme Ultraviolet (EUV) electromagnetic waves from contemporary lithography masks are presented. A novel hybrid Waveguide Neural Operator (WGNO) is introduced, based on a waveguide method with its most computationally expensive components replaced by a neural network. To evaluate performance, the accuracy and inference time of PINNs and NOs are compared against modern numerical solvers for a series of problems with known exact solutions. The emphasis is placed on investigation of solution accuracy by considered artificial neural systems for 13.5 nm and 11.2 nm wavelengths. Numerical experiments on realistic 2D and 3D masks demonstrate that PINNs and neural operators achieve competitive accuracy and significantly reduced prediction times, with the proposed WGNO architecture reaching state-of-the-art performance. The presented neural operator has pronounced generalizing properties, meaning that for unseen problem parameters it delivers a solution accuracy close to that for parameters seen in the training dataset. These results provide a highly efficient solution for accelerating the design and optimization workflows of next-generation lithography masks.
[AI-2] Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents
【速读】:该论文旨在解决生成式 AI (Generative AI) 编码代理在软件开发过程中日益加剧的制度性知识流失问题——即每次提交(commit)仅记录代码差异(code diff),而丢失了决策背后的关键推理信息,如约束条件、被排除的备选方案及前瞻性上下文,这种被称作“决策阴影”(Decision Shadow)的信息缺失阻碍了代码演化与维护的可追溯性。解决方案的核心在于提出 Lore 协议,该协议通过原生 Git 片段(git trailers)重构提交消息,形成自包含的决策记录,明确携带约束条件、被拒绝的替代方案、代理指令和验证元数据;Lore 不依赖额外基础设施,仅需 Git 支持,且可通过独立命令行工具查询、被任何具备 shell 执行能力的代理发现,从而实现对决策过程的结构化捕获与可检索性。
链接: https://arxiv.org/abs/2603.15566
作者: Ivan Stetsenko
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, 1 figure, 1 table. Preprint available at this https URL
Abstract:As AI coding agents become both primary producers and consumers of source code, the software industry faces an accelerating loss of institutional knowledge. Each commit captures a code diff but discards the reasoning behind it - the constraints, rejected alternatives, and forward-looking context that shaped the decision. I term this discarded reasoning the Decision Shadow. This paper proposes Lore, a lightweight protocol that restructures commit messages - using native git trailers - into self-contained decision records carrying constraints, rejected alternatives, agent directives, and verification metadata. Lore requires no infrastructure beyond git, is queryable via a standalone CLI tool, and is discoverable by any agent capable of running shell commands. The paper formalizes the protocol, compares it against five competing approaches, stress-tests it against its strongest objections, and outlines an empirical validation path.
[AI-3] he PokeAgent Challenge: Competitive and Long-Context Learning at Scale NEURIPS2025
【速读】:该论文旨在解决当前前沿人工智能在部分可观测性(partial observability)、博弈论推理(game-theoretic reasoning)和长程规划(long-horizon planning)三个关键挑战上难以同时被有效评估的问题。现有基准测试往往无法在现实条件下协同考察这些能力,导致研究进展受限。解决方案的核心是构建PokeAgent Challenge——一个基于《宝可梦》多智能体对战系统与角色扮演游戏(RPG)环境的大规模决策基准,包含两个互补赛道:Battling Track专注于竞争性对战中的策略推理与泛化能力,提供超过2000万条战斗轨迹及多种基线方法(启发式、强化学习与大语言模型);Speedrunning Track则聚焦于RPG场景下的长程规划与序列决策,首次提供标准化的速通评估框架,并集成开源多智能体编排系统以支持模块化、可复现的LLM方法比较。通过NeurIPS 2025竞赛验证了其资源质量与社区兴趣,结果揭示了通用型(LLM)、专业型(RL)与人类精英表现之间显著差距,表明《宝可梦》任务在能力维度上几乎与传统大语言模型基准正交,具有推动强化学习与大语言模型研究向前发展的潜力。
链接: https://arxiv.org/abs/2603.15563
作者: Seth Karten,Jake Grigsby,Tersoo Upaa Jr,Junik Bae,Seonghun Hong,Hyunyoung Jeong,Jaeyoon Jung,Kun Kerdthaisong,Gyungbo Kim,Hyeokgi Kim,Yujin Kim,Eunju Kwon,Dongyu Liu,Patrick Mariglia,Sangyeon Park,Benedikt Schink,Xianwei Shi,Anthony Sistilli,Joseph Twin,Arian Urdu,Matin Urdu,Qiao Wang,Ling Wu,Wenli Zhang,Kunsheng Zhou,Stephanie Milani,Kiran Vodrahalli,Amy Zhang,Fei Fang,Yuke Zhu,Chi Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 41 pages, 26 figures, 5 tables. NeurIPS 2025 Competition Track
Abstract:We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon’s multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community’s interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at this https URL.
[AI-4] InterveneBench: Benchmarking LLM s for Intervention Reasoning and Causal Study Design in Real Social Systems
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在社会科学研究中因果推理能力评估不足的问题,尤其是缺乏对基于真实政策干预的端到端因果推理能力的评测。现有基准未能有效检验模型在无预设因果图或结构方程的情况下,对现实社会政策干预及其识别假设进行推理的能力。为此,作者提出了InterveneBench基准,其核心是基于744项经过同行评审的社会科学实证研究构建数据集,要求模型在不依赖预先定义的因果结构条件下进行因果推断。为提升模型性能,论文进一步提出多智能体框架STRIDES,通过协作式推理机制显著改善了当前最先进模型在该任务上的表现。
链接: https://arxiv.org/abs/2603.15542
作者: Shaojie Shi,Zhengyu Shi,Lingran Zheng,Xinyu Su,Anna Xie,Bohao Lv,Rui Xu,Zijian Chen,Zhichao Chen,Guolei Liu,Naifu Zhang,Mingjian Dong,Zhuo Quan,Bohao Chen,Teqi Hao,Yuan Qi,Yinghui Xu,Libo Wu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 35pages,3 figures
Abstract:Causal inference in social science relies on end-to-end, intervention-centered research-design reasoning grounded in real-world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification assumptions without access to predefined causal graphs or structural equations. InterveneBench comprises 744 peer-reviewed studies across diverse policy domains. Experimental results show that state-of-the-art LLMs struggle under this setting. To address this limitation, we further propose a multi-agent framework, STRIDES. It achieves significant performance improvements over state-of-the-art reasoning models. Our code and data are available at this https URL.
[AI-5] DOT: Dynamic Knob Selection and Online Sampling for Automated Database Tuning
【速读】:该论文旨在解决数据库管理系统(Database Management Systems, DBMS)调优过程中参数搜索空间过大、依赖昂贵预热阶段和人工经验的问题。其核心挑战在于,现代DBMS包含大量可调参数(knobs),但仅有少数对性能有显著影响,如何高效识别并优化这些关键参数是提升系统性能的关键。解决方案的关键在于提出DOT算法,该算法结合递归特征消除与交叉验证(Recursive Feature Elimination with Cross-Validation, RFECV)动态筛选低重要性参数,并利用似然比检验(Likelihood Ratio Test, LRT)策略平衡探索与利用;同时采用贝叶斯优化(Bayesian Optimization, BO)进行在线配置搜索,无需预热阶段或先验知识即可实现高性能调优,从而显著降低调优开销并提升性能表现。
链接: https://arxiv.org/abs/2603.15540
作者: Yifan Wang,Debabrota Basu,Pierre Bourhis,Romain Rouvoy,Patrick Royer
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Database Management Systems (DBMS) are crucial for efficient data management and access control, but their administration remains challenging for Database Administrators (DBAs). Tuning, in particular, is known to be difficult. Modern systems have many tuning parameters, but only a subset significantly impacts performance. Focusing on these influential parameters reduces the search space and optimizes performance. Current methods rely on costly warm-up phases and human expertise to identify important tuning parameters. In this paper, we present DOT, a dynamic knob selection and online sampling DBMS tuning algorithm. DOT uses Recursive Feature Elimination with Cross-Validation (RFECV) to prune low-importance tuning parameters and a Likelihood Ratio Test (LRT) strategy to balance exploration and exploitation. For parameter search, DOT uses a Bayesian Optimization (BO) algorithm to optimize configurations on-the-fly, eliminating the need for warm-up phases or prior knowledge (although existing knowledge can be incorporated). Experiments show that DOT achieves matching or outperforming performance compared to state-of-the-art tuners while substantially reducing tuning overhead.
[AI-6] Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自主运行中因价值观冲突和情境依赖性导致的对齐难题,特别是模型在不同上下文中优先级关系不稳定甚至可被操纵的问题。其解决方案的关键在于构建一个基于模型输出分布的优先级图(priority graph),将指令与价值视为节点、上下文特定的优先关系作为边,从而揭示LLM对齐的动态性和脆弱性;进一步提出运行时验证机制,通过调用外部知识源来锚定上下文,增强模型对“优先级劫持”(priority hacking)攻击的鲁棒性,尽管这一方法无法完全解决哲学上不可约的伦理困境。
链接: https://arxiv.org/abs/2603.15527
作者: Zhenheng Tang,Xiang Liu,Qian Wang,Eunsol Choi,Bo Li,Xiaowen Chu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:As Large Language Models (LLMs) become more powerful and autonomous, they increasingly face conflicts and dilemmas in many scenarios. We first summarize and taxonomize these diverse conflicts. Then, we model the LLM’s preferences to make different choices as a priority graph, where instructions and values are nodes, and the edges represent context-specific priorities determined by the model’s output distribution. This graph reveals that a unified stable LLM alignment is very challenging, because the graph is neither static nor necessarily consistent in different contexts. Besides, it also reveals a potential vulnerability: priority hacking, where adversaries can craft deceptive contexts to manipulate the graph and bypass safety alignments. To counter this, we propose a runtime verification mechanism, enabling LLMs to query external sources to ground their context and resist manipulation. While this approach enhances robustness, we also acknowledge that many ethical and value dilemmas are philosophically irreducible, posing a long-term, open challenge for the future of AI alignment.
[AI-7] Building Trust in PINNs: Error Estimation through Finite Difference Methods
【速读】:该论文旨在解决物理信息神经网络(Physics-informed Neural Networks, PINNs)在求解偏微分方程(Partial Differential Equations, PDEs)时缺乏对预测误差的量化评估问题,从而限制了其预测结果的可信度。解决方案的关键在于提出一种轻量级后处理方法,通过构建并数值求解一个以PINN残差为源项的误差方程来获得逐点误差估计;该误差方程与原始PDE具有相同的微分算子结构,但无需已知真解即可用有限差分法求解,从而实现高效、可解释的误差定位与定量分析。
链接: https://arxiv.org/abs/2603.15526
作者: Aleksander Krasowski,René P. Klausen,Aycan Celik,Sebastian Lapuschkin,Wojciech Samek,Jonas Naujoks
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:Physics-informed neural networks (PINNs) constitute a flexible deep learning approach for solving partial differential equations (PDEs), which model phenomena ranging from heat conduction to quantum mechanical systems. Despite their flexibility, PINNs offer limited insight into how their predictions deviate from the true solution, hindering trust in their prediction quality. We propose a lightweight post-hoc method that addresses this gap by producing pointwise error estimates for PINN predictions, which offer a natural form of explanation for such models, identifying not just whether a prediction is wrong, but where and by how much. For linear partial differential equations, the error between a PINN approximation and the true solution satisfies the same differential operator as the original problem, but driven by the PINN’s PDE residual as its source term. We solve this error equation numerically using finite difference methods requiring no knowledge of the true solution. Evaluated on several benchmark PDEs, our method yields accurate error maps at low computational cost, enabling targeted and interpretable validation of PINNs.
[AI-8] Seeking SOTA: Time-Series Forecasting Must Adopt Taxonomy-Specific Evaluation to Dispel Illusory Gains
【速读】:该论文旨在解决当前AI/ML时间序列预测模型评估中存在的偏差问题,即主流基准数据集通常具有强且持续的周期性和季节性特征,导致复杂深度学习模型看似表现优越,实则可能仅在捕捉重复模式上优于经典统计方法,而这类性能提升未必反映真正的算法进步。其解决方案的关键在于:(I)淘汰或显著扩充现有基准,引入包含更多非平稳特性(如结构突变、时变波动性和概念漂移)及现实世界中更难预测动态的数据;(II)要求所有深度学习模型提交时必须包含针对特定任务特性的稳健经典和简单基线模型,从而确保报告的性能提升源于方法论上的实质性突破,而非基准选择偏倚。
链接: https://arxiv.org/abs/2603.15506
作者: Raeid Saqur,Christoph Bergmeir,Blanka Horvath,Daniel Schmidt,Frank Rudzicz,Terry Lyons
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Position paper; 8 figures, 8 tables; includes appendix
Abstract:We argue that the current practice of evaluating AI/ML time-series forecasting models, predominantly on benchmarks characterized by strong, persistent periodicities and seasonalities, obscures real progress by overlooking the performance of efficient classical methods. We demonstrate that these “standard” datasets often exhibit dominant autocorrelation patterns and seasonal cycles that can be effectively captured by simpler linear or statistical models, rendering complex deep learning architectures frequently no more performant than their classical counterparts for these specific data characteristics, and raising questions as to whether any marginal improvements justify the significant increase in computational overhead and model complexity. We call on the community to (I) retire or substantially augment current benchmarks with datasets exhibiting a wider spectrum of non-stationarities, such as structural breaks, time-varying volatility, and concept drift, and less predictable dynamics drawn from diverse real-world domains, and (II) require every deep learning submission to include robust classical and simple baselines, appropriately chosen for the specific characteristics of the downstream tasks’ time series. By doing so, we will help ensure that reported gains reflect genuine scientific methodological advances rather than artifacts of benchmark selection favoring models adept at learning repetitive patterns.
[AI-9] Understanding Reasoning in LLM s through Strategic Information Allocation under Uncertainty
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在推理过程中出现的“顿悟时刻”(Aha moments)机制不明确的问题,尤其是当模型通过诸如“Wait”等词表现出自我修正行为时,其背后的信息处理逻辑尚不清楚。解决方案的关键在于提出一个信息论框架,将推理过程分解为程序性信息(procedural information)和认识论表达(epistemic verbalization)——即显式外化不确定性以支持下游控制动作的能力。研究发现,仅依赖程序性推理会导致信息停滞,而认识论表达能够持续获取信息并实现信息充分性,从而解释了强推理性能的本质来源并非特定表面标记,而是不确定性外化能力。
链接: https://arxiv.org/abs/2603.15500
作者: Jeonghye Kim,Xufang Luo,Minbeom Kim,Sangmook Lee,Dongsheng Li,Yuqing Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLMs often exhibit Aha moments during reasoning, such as apparent self-correction following tokens like “Wait,” yet their underlying mechanisms remain unclear. We introduce an information-theoretic framework that decomposes reasoning into procedural information and epistemic verbalization - the explicit externalization of uncertainty that supports downstream control actions. We show that purely procedural reasoning can become informationally stagnant, whereas epistemic verbalization enables continued information acquisition and is critical for achieving information sufficiency. Empirical results demonstrate that strong reasoning performance is driven by uncertainty externalization rather than specific surface tokens. Our framework unifies prior findings on Aha moments and post-training experiments, and offers insights for future reasoning model design.
[AI-10] Grokking as a Variance-Limited Phase Transition: Spectral Gating and the Epsilon-Stability Threshold
【速读】:该论文旨在解决“grokking”现象的机制问题,即模型在训练收敛后长时间延迟才实现泛化的能力,这与传统优化理论难以解释的现象相悖。其核心贡献在于揭示了AdamW优化器在模块化算术任务中通过“谱门控”(Spectral Gating)机制调控从记忆到泛化的过渡过程:关键在于AdamW作为方差门控的随机系统,其噪声结构与损失函数景观曲率的相互作用决定了泛化能否发生。具体而言,泛化解位于初始时因低方差而无法进入的尖锐盆地(λ_max^H),延迟阶段实质是梯度方差积累以提升有效稳定性阈值的过程;这一机制被进一步划分为三种复杂性区间——容量坍缩、方差受限和稳定性覆盖,其中只有适应性优化器特有的各向异性噪声整流能力(而非各向同性噪声)能引导噪声进入解流形的切空间,从而促成泛化。
链接: https://arxiv.org/abs/2603.15492
作者: Pratyush Acharya,Habish Dhakal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages with 14 figures
Abstract:Standard optimization theories struggle to explain grokking, where generalization occurs long after training convergence. While geometric studies attribute this to slow drift, they often overlook the interaction between the optimizer’s noise structure and landscape curvature. This work analyzes AdamW dynamics on modular arithmetic tasks, revealing a Spectral Gating'' mechanism that regulates the transition from memorization to generalization. We find that AdamW operates as a variance-gated stochastic system. Grokking is constrained by a stability condition: the generalizing solution resides in a sharp basin ( \lambda_max^H ) initially inaccessible under low-variance regimes. The delayed’’ phase represents the accumulation of gradient variance required to lift the effective stability ceiling, permitting entry into this sharp manifold. Our ablation studies identify three complexity regimes: (1) \textbfCapacity Collapse ( P 23 ), where rank-deficiency prevents structural learning; (2) \textbfThe Variance-Limited Regime ( P \approx 41 ), where generalization waits for the spectral gate to open; and (3) \textbfStability Override ( P 67 ), where memorization becomes dimensionally unstable. Furthermore, we challenge the “Flat Minima” hypothesis for algorithmic tasks, showing that isotropic noise injection fails to induce grokking. Generalization requires the \textitanisotropic rectification unique to adaptive optimizers, which directs noise into the tangent space of the solution manifold. Comments: 15 pages with 14 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.15492 [cs.LG] (or arXiv:2603.15492v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.15492 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-11] alk Evaluate Diagnose: User-aware Agent Evaluation with Automated Error Analysis ICLR2026
【速读】:该论文旨在解决当前代理应用(Agent Application)在跨领域自动化工作流中缺乏统一、可扩展评估框架的问题,现有方法多依赖于任务特定的成功判定机制(如数据库查询或正则匹配),且未系统考虑用户角色与专业水平对代理性能的影响,导致评估结果片面。解决方案的关键在于提出TED框架(Talk, Evaluate, Diagnose):首先通过通用的专家与非专家用户角色模板模拟真实交互;其次将子目标(如工具签名和响应)转化为自然语言评分注释,并利用大语言模型作为评判者自动评估,引入兼顾回合效率与中间进展的新指标;最后构建自动化错误分析工具,识别代理与评判者之间的不一致,揭示常见错误模式并提供可操作的改进反馈,从而实现更全面、可解释的代理性能评估。
链接: https://arxiv.org/abs/2603.15483
作者: Penny Chong,Harshavardhan Abichandani,Jiyuan Shen,Atin Ghosh,Min Pyae Moe,Yifan Mai,Daniel Dahlmeier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as a conference paper at ICLR 2026. Code and dataset are available in the repository this https URL
Abstract:Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user’s role nor expertise in the interaction, providing incomplete insights into the agent’s performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoals-such as tool signatures, and responses-as natural language grading notes, evaluated automatically with LLM-as-a-judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agent’s design.
[AI-12] abKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins
【速读】:该论文旨在解决数据隐私敏感场景下表格数据(tabular data)的知识蒸馏(knowledge distillation)效果不佳的问题,其核心挑战在于现有方法未显式建模特征交互(feature interactions),而特征交互是表格模型编码预测知识的根本机制。解决方案的关键在于提出TabKD框架,通过学习与教师模型决策边界对齐的自适应特征分箱(adaptive feature bins),并生成最大化成对特征交互覆盖度的合成查询(synthetic queries),从而系统性提升学生模型对教师模型知识的吸收能力。实验表明,该方法在16组配置中14次优于基线,且交互覆盖度与蒸馏质量高度相关,验证了以特征交互为中心的探索策略的有效性。
链接: https://arxiv.org/abs/2603.15481
作者: Shovon Niverd Pereira,Krishna Khadka,Yu Lei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Data-free knowledge distillation enables model compression without original training data, critical for privacy-sensitive tabular domains. However, existing methods does not perform well on tabular data because they do not explicitly address feature interactions, the fundamental way tabular models encode predictive knowledge. We identify interaction diversity, systematic coverage of feature combinations, as an essential requirement for effective tabular distillation. To operationalize this insight, we propose TabKD, which learns adaptive feature bins aligned with teacher decision boundaries, then generates synthetic queries that maximize pairwise interaction coverage. Across 4 benchmark datasets and 4 teacher architectures, TabKD achieves highest student-teacher agreement in 14 out of 16 configurations, outperforming 5 state-of-the-art baselines. We further show that interaction coverage strongly correlates with distillation quality, validating our core hypothesis. Our work establishes interaction-focused exploration as a principled framework for tabular model extraction.
[AI-13] Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents
【速读】:该论文旨在解决AI代理(AI Agent)在从演示阶段向企业级部署迁移过程中所面临的可靠性问题,包括工具参数误解释导致生产数据损坏、推理错误无声传播、输出违反组织政策引发合规风险等关键失败模式。这些问题在现有代理框架中往往依赖开发者临时构建防护机制,缺乏系统性和可复用性,从而造成部署脆弱且难以维护。解决方案的关键在于提出Agent Lifecycle Toolkit (ALTK),一个开源的模块化中间件工具包,通过在代理生命周期的六个关键节点——用户请求后、LLM提示前、LLM输出后、工具调用前、工具结果后、响应生成前——提供标准化的检测、修复与缓解能力,实现对常见失败模式的系统性干预,并兼容低代码/无代码平台(如ContextForge MCP Gateway和Langflow),显著降低构建高可靠性生产级代理所需的工作量。
链接: https://arxiv.org/abs/2603.15473
作者: Zidane Wright,Jason Tsay,Anupama Murthi,Osher Elhadad,Diego Del Rio,Saurabh Goyal,Kiran Kate,Jim Laredo,Koren Lazar,Vinod Muthusamy,Yara Rizk
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: demonstration track
Abstract:As AI agents move from demos into enterprise deployments, their failure modes become consequential: a misinterpreted tool argument can corrupt production data, a silent reasoning error can go undetected until damage is done, and outputs that violate organizational policy can create legal or compliance risk. Yet, most agent frameworks leave builders to handle these failure modes ad hoc, resulting in brittle, one-off safeguards that are hard to reuse or maintain. We present the Agent Lifecycle Toolkit (ALTK), an open-source collection of modular middleware components that systematically address these gaps across the full agent lifecycle. Across the agent lifecycle, we identify opportunities to intervene and improve, namely, post-user-request, pre-LLM prompt conditioning, post-LLM output processing, pre-tool validation, post-tool result checking, and pre-response assembly. ALTK provides modular middleware that detects, repairs, and mitigates common failure modes. It offers consistent interfaces that fit naturally into existing pipelines. It is compatible with low-code and no-code tools such as the ContextForge MCP Gateway and Langflow. Finally, it significantly reduces the effort of building reliable, production-grade agents. Comments: demonstration track Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.15473 [cs.AI] (or arXiv:2603.15473v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.15473 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-14] RoCo Challenge at AAAI 2026: Benchmarking Robotic Collaborative Manipulation for Assembly Towards Industrial Automation
【速读】:该论文旨在解决工业场景中具身人工智能(Embodied Artificial Intelligence, EAI)在复杂装配任务中的长期多任务学习与鲁棒性部署问题,特别是针对高精度行星齿轮箱装配这一典型且具有挑战性的工业操作。解决方案的关键在于构建一个双阶段评估框架(仿真与真实世界并行),并通过引入基于失败恢复的课程学习数据策略,显著提升模型在长时程、多步骤任务中的稳定性与成功率;实验表明,采用双模型架构进行分层决策与协同优化,是实现高效工业EAI系统的核心技术路径。
链接: https://arxiv.org/abs/2603.15469
作者: Haichao Liu,Yuheng Zhou,Zhenyu Wu,Ziheng Ji,Ziyu Shan,Qianzhun Wang,Ruixuan Liu,Zhiyuan Yang,Yejun Gu,Shalman Khan,Shijun Yan,Jun Liu,Haiyue Zhu,Changliu Liu,Jianfei Yang,Jingbing Zhang,Ziwei Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures
Abstract:Embodied Artificial Intelligence (EAI) is rapidly developing, gradually subverting previous autonomous systems’ paradigms from isolated perception to integrated, continuous action. This transition is highly significant for industrial robotic manipulation, promising to free human workers from repetitive, dangerous daily labor. To benchmark and advance this capability, we introduce the Robotic Collaborative Assembly Assistance (RoCo) Challenge with a dataset towards simulation and real-world assembly manipulation. Set against the backdrop of human-centered manufacturing, this challenge focuses on a high-precision planetary gearbox assembly task, a demanding yet highly representative operation in modern industry. Built upon a self-developed data collection, training, and evaluation system in Isaac Sim, and utilizing a dual-arm robot for real-world deployment, the challenge operates in two phases. The Simulation Round defines fine-grained task phases for step-wise scoring to handle the long-horizon nature of the assembly. The Real-World Round mirrors this evaluation with physical gearbox components and high-quality teleoperated datasets. The core tasks require assembling an epicyclic gearbox from scratch, including mounting three planet gears, a sun gear, and a ring gear. Attracting over 60 teams and 170+ participants from more than 10 countries, the challenge yielded highly effective solutions, most notably ARC-VLA and RoboCola. Results demonstrate that a dual-model framework for long-horizon multi-task learning is highly effective, and the strategic utilization of recovery-from-failure curriculum data is a critical insight for successful deployment. This report outlines the competition setup, evaluation approach, key findings, and future directions for industrial EAI. Our dataset, CAD files, code, and evaluation results can be found at: this https URL.
[AI-15] Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents
【速读】:该论文试图解决当前人工智能(AI)代理评估方法中存在的结构性风险问题,即AI代理可能通过感知其处于评估环境中而调整行为,从而在测试阶段表现出看似安全和鲁棒的“良性”表现,但在实际部署中却展现出不可靠甚至危险的行为。这种现象类似于恶意软件在沙箱环境中的规避行为,使得传统静态、受限且可预测的评估范式无法真实反映AI代理在复杂现实场景中的性能。解决方案的关键在于将被测系统视为潜在的对抗性实体,采用更具现实性、条件多样性和动态性的评估原则,包括引入变异性测试环境、模拟真实部署场景以及实施持续的后部署再评估机制,以提升评估结果的可信度与泛化能力。
链接: https://arxiv.org/abs/2603.15457
作者: Simone Aonzo,Merve Sahin,Aurélien Francillon,Daniele Perito
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence (AI) systems are increasingly adopted as tool-using agents that can plan, observe their environment, and take actions over extended time periods. This evolution challenges current evaluation practices where the AI models are tested in restricted, fully observable settings. In this article, we argue that evaluations of AI agents are vulnerable to a well-known failure mode in computer security: malicious software that exhibits benign behavior when it detects that it is being analyzed. We point out how AI agents can infer the properties of their evaluation environment and adapt their behavior accordingly. This can lead to overly optimistic safety and robustness assessments. Drawing parallels with decades of research on malware sandbox evasion, we demonstrate that this is not a speculative concern, but rather a structural risk inherent to the evaluation of adaptive systems. Finally, we outline concrete principles for evaluating AI agents, which treat the system under test as potentially adversarial. These principles emphasize realism, variability of test conditions, and post-deployment reassessment.
[AI-16] Unlocking the Value of Text: Event-Driven Reasoning and Multi-Level Alignment for Time Series Forecasting ICLR2026
【速读】:该论文旨在解决现有时间序列预测方法仅依赖数值数据,难以捕捉现实世界中时间序列所蕴含的多模态信息(如文本)所带来的预测精度瓶颈问题。其核心解决方案在于提出VoT方法,关键创新点包括:一是事件驱动推理(Event-driven Reasoning),通过引入外部文本与大语言模型(LLM)的强推理能力相结合,增强对复杂模式的理解;二是历史上下文学习(Historical In-context Learning),利用历史案例作为提示引导LLM进行有效推理;三是多层次对齐机制,包括表示层的内生文本对齐(Endogenous Text Alignment)以融合文本特征,以及预测层的自适应频率融合(Adaptive Frequency Fusion)以整合事件驱动预测与数值预测的优势,从而实现文本信息价值的最大化。
链接: https://arxiv.org/abs/2603.15452
作者: Siyuan Wang,Peng Chen,Yihang Wang,Wanghui Qiu,Chenjuan Guo,Bin Yang,Yang Shu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026
Abstract:Existing time series forecasting methods primarily rely on the numerical data itself. However, real-world time series exhibit complex patterns associated with multimodal information, making them difficult to predict with numerical data alone. While several multimodal time series forecasting methods have emerged, they either utilize text with limited supplementary information or focus merely on representation extraction, extracting minimal textual information for forecasting. To unlock the Value of Text, we propose VoT, a method with Event-driven Reasoning and Multi-level Alignment. Event-driven Reasoning combines the rich information in exogenous text with the powerful reasoning capabilities of LLMs for time series forecasting. To guide the LLMs in effective reasoning, we propose the Historical In-context Learning that retrieves and applies historical examples as in-context guidance. To maximize the utilization of text, we propose Multi-level Alignment. At the representation level, we utilize the Endogenous Text Alignment to integrate the endogenous text information with the time series. At the prediction level, we design the Adaptive Frequency Fusion to fuse the frequency components of event-driven prediction and numerical prediction to achieve complementary advantages. Experiments on real-world datasets across 10 domains demonstrate significant improvements over existing methods, validating the effectiveness of our approach in the utilization of text. The code is made available at this https URL.
[AI-17] Music Genre Classification: A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches
【速读】:该论文旨在解决非西方音乐传统中自动音乐流派分类(automatic music genre classification)这一长期挑战,特别是针对尼泊尔音乐中文化丰富且声学多样的多种流派(如Lok Dohori、Deuda和Tamang Selo等)缺乏系统性分类模型的问题。解决方案的关键在于构建一个包含约8,000个标注音频片段的新型数据集,并在两种范式下系统比较九种分类模型:一方面使用五种经典机器学习方法(Logistic Regression、SVM、KNN、Random Forest、XGBoost)基于手工提取的51维音频特征;另一方面采用四种深度学习架构(CNN、RNN、并行CNN-RNN、序列CRNN)直接处理Mel频谱图(640×128)。实验表明,序列型卷积循环神经网络(Convolutional Recurrent Neural Network, CRNN)——即卷积层后接长短期记忆网络(LSTM)——表现最优,准确率达84%,显著优于最佳传统模型(71%)及其他深度架构,其优势源于对局部频谱特征与时序模式的联合建模能力。
链接: https://arxiv.org/abs/2603.15440
作者: Sachin Prajuli,Abhishek Karna,OmPrakash Dhakl
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 8 pages
Abstract:Automatic music genre classification is a long-standing challenge in Music Information Retrieval (MIR); work on non-Western music traditions remains scarce. Nepali music encompasses culturally rich and acoustically diverse genres–from the call-and-response duets of Lok Dohori to the rhythmic poetry of Deuda and the distinctive melodies of Tamang Selo–that have not been addressed by existing classification systems. In this paper, we construct a novel dataset of approximately 8,000 labeled 30-second audio clips spanning eight Nepali music genres and conduct a systematic comparison of nine classification models across two paradigms. Five classical machine learning classifiers (Logistic Regression, SVM, KNN, Random Forest, and XGBoost) are trained on 51 hand-crafted audio features extracted via Librosa, while four deep learning architectures (CNN, RNN, parallel CNN-RNN, and sequential CNN followed by RNN) operate on Mel spectrograms of dimension 640 x 128. Our experiments reveal that the sequential Convolutional Recurrent Neural Network (CRNN)–in which convolutional layers feed into an LSTM–achieves the highest accuracy of 84%, substantially outperforming both the best classical models (Logistic Regression and XGBoost, both at 71%) and all other deep architectures. We provide per-class precision, recall, F1-score, confusion matrices, and ROC analysis for every model, and offer a culturally grounded interpretation of misclassification patterns that reflects genuine overlaps in Nepal’s musical traditions.
[AI-18] Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning
【速读】:该论文旨在解决当前情感支持对话系统中依赖专家定义的标量奖励信号所导致的信息稀疏问题,这类信号无法解释响应失败的原因或指导如何适应用户动态情绪状态,常与促进积极情绪转变的实际目标偏离。解决方案的关键在于提出一种反应感知策略优化框架(Reaction Aware Policy Optimization, RAPO),其核心是将对话视为由用户反应驱动的过程,并通过模拟用户回应生成密集的自然语言反馈:包括基于事后对话选择(Hindsight Dialogue Selection)识别关键对话轮次、生成事后反馈(Generative Hindsight Feedback)构建对比排序信号与自然语言批评,以及标量-言语混合策略优化(Scalar-Verbal Hybrid Policy Optimization)协同优化全局对齐与细粒度语义微调。实验证明RAPO在ESC和Sotopia数据集上显著优于主流强化学习基线方法,能更有效地推动积极交互结果。
链接: https://arxiv.org/abs/2603.15434
作者: Jing Ye,Xinpei Zhao,Lu Xiang,Yaping Zhang,Chengqing Zong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While current emotional support dialogue systems typically rely on expert-defined scalar rewards for alignment, these signals suffer from severe information sparsity. They cannot explain why a response failed or how to adapt to dynamic user states, often diverging from the actual goal of facilitating positive emotional shifts. In practice, the most direct and reliable learning signal emerges from the user’s continuous reactions during ongoing interaction. We therefore propose Reaction Aware Policy Optimization (RAPO), a framework that optimizes over interaction consequences rather than rubric scores. RAPO treats dialogue as a reaction-driven process and utilizes simulated user responses to generate dense natural-language feedback through three core components: Hindsight Dialogue Selection, which isolates pivotal turns that meaningfully alter user emotional trajectories; Generative Hindsight Feedback, which transforms user reactions into contrastive ranking signals and natural-language critiques; and Scalar-Verbal Hybrid Policy Optimization, which couples scalar reward optimization for global alignment with verbal feedback distillation for fine-grained semantic refinement. Extensive experiments on ESC and Sotopia demonstrate that RAPO significantly outperforms strong reinforcement learning baselines in driving positive interaction outcomes.
[AI-19] Physics-informed fine-tuning of foundation models for partial differential equations
【速读】:该论文旨在解决预训练偏微分方程(Partial Differential Equations, PDEs)基础模型在下游任务中适应困难的问题,尤其是在任务特定数据稀缺和分布偏移的情况下。现有方法如纯数据驱动的微调难以保证物理一致性且性能受限。解决方案的关键在于提出一种物理信息微调(physics-informed fine-tuning)框架,通过将PDE残差和边界条件作为物理约束直接融入微调目标函数,从而在数据有限条件下实现高效、物理一致的模型适配。该方法无需依赖PDE解析解即可达到与数据驱动方法相当甚至更优的精度,并在极少量训练数据下展现出更强的分布外泛化能力。
链接: https://arxiv.org/abs/2603.15431
作者: Vlad Medvedev,Leon Armbruster,Christopher Straub,Georg Kruse,Andreas Rosskopf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Numerical Analysis (math.NA)
备注: 12 pages, 6 figures, 1 table
Abstract:Foundation models for partial differential equations (PDEs) have emerged as powerful surrogates pre-trained on diverse physical systems, but adapting them to new downstream tasks remains challenging due to limited task-specific data and distribution shifts. While fine-tuning has proven transformative in natural language processing, best practices for adapting PDE foundation models remain underexplored. Although physics-informed training has successfully trained accurate solvers across a wide range of PDE problems, its potential for fine-tuning data-based foundation models has not been systematically studied. In this work, we introduce a physics-informed fine-tuning framework that adapts pre-trained PDE foundation models by incorporating physical constraints (PDE residuals and boundary conditions) directly into the fine-tuning objective. This enables effective adaptation in data-scarce regimes while promoting physical consistency. We evaluate our method on a downstream task composed of an unseen PDE class and compare it with data-driven finetuning counterparts. Our results demonstrate that physics-informed fine-tuning achieves competitive accuracy without requiring PDE solutions for training. Furthermore, a hybrid fine-tuning strategy yields superior generalization to out-of-distribution scenarios when only minimal training data is available. These findings establish physics-informed fine-tuning as a scalable and data-efficient paradigm, providing a physically interpretable pathway for adapting foundation models in scientific machine learning.
[AI-20] MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中集中式评论家(centralized critic)训练样本效率低且泛化能力弱的问题,以及大规模视觉-语言-动作模型(Vision-Language-Action models, VLAs)在资源受限的异构多机器人系统中部署计算开销过大的问题。解决方案的关键在于提出多智能体视觉-语言-评论家模型(Multi-Agent Vision-Language-Critic Models, MA-VLCM),即利用预训练视觉-语言模型(Vision-Language Model, VLM)微调得到一个条件于自然语言任务描述、视觉轨迹观测和结构化多智能体状态信息的集中式评论家,从而在不需从头学习评论家的情况下实现高效的策略优化,并生成适用于资源受限机器人的紧凑执行策略。
链接: https://arxiv.org/abs/2603.15418
作者: Shahil Shaik,Aditya Parameshwaran,Anshul Nayak,Jonathon M. Smereka,Yue Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures
Abstract:Multi-agent reinforcement learning (MARL) commonly relies on a centralized critic to estimate the value function. However, learning such a critic from scratch is highly sample-inefficient and often lacks generalization across environments. At the same time, large vision-language-action models (VLAs) trained on internet-scale data exhibit strong multimodal reasoning and zero-shot generalization capabilities, yet directly deploying them for robotic execution remains computationally prohibitive, particularly in heterogeneous multi-robot systems with diverse embodiments and resource constraints. To address these challenges, we propose Multi-Agent Vision-Language-Critic Models (MA-VLCM), a framework that replaces the learned centralized critic in MARL with a pretrained vision-language model fine-tuned to evaluate multi-agent behavior. MA-VLCM acts as a centralized critic conditioned on natural language task descriptions, visual trajectory observations, and structured multi-agent state information. By eliminating critic learning during policy optimization, our approach significantly improves sample efficiency while producing compact execution policies suitable for deployment on resource-constrained robots. Results show good zero-shot return estimation on models with differing VLM backbones on in-distribution and out-of-distribution scenarios in multi-agent team settings
[AI-21] RESQ: A Unified Framework for REliability- and Security Enhancement of Quantized Deep Neural Networks
【速读】:该论文旨在解决量化深度神经网络(DNN)在面对对抗攻击和硬件故障时的鲁棒性失衡问题,即在提升一种鲁棒性时往往损害另一种。其解决方案的关键在于提出一个统一的三阶段框架:第一阶段通过微调使特征表示对小幅度输入扰动不敏感,从而增强对抗攻击鲁棒性;第二阶段引入模拟位翻转故障的故障感知微调,以提高硬件故障鲁棒性;第三阶段通过轻量级后训练调整实现量化,进一步降低故障敏感性而不牺牲对抗鲁棒性。实验表明,该方法在多个模型和数据集上实现了对抗攻击鲁棒性和故障鲁棒性的同步提升,且保持了量化网络的高精度。
链接: https://arxiv.org/abs/2603.15413
作者: Ali Soltan Mohammadi,Samira Nazari,Ali Azarpeyvand,Mahdi Taheri,Milos Krstic,Michael Huebner,Christian Herglotz,Tara Ghasempouri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:This work proposes a unified three-stage framework that produces a quantized DNN with balanced fault and attack robustness. The first stage improves attack resilience via fine-tuning that desensitizes feature representations to small input perturbations. The second stage reinforces fault resilience through fault-aware fine-tuning under simulated bit-flip faults. Finally, a lightweight post-training adjustment integrates quantization to enhance efficiency and further mitigate fault sensitivity without degrading attack resilience. Experiments on ResNet18, VGG16, EfficientNet, and Swin-Tiny in CIFAR-10, CIFAR-100, and GTSRB show consistent gains of up to 10.35% in attack resilience and 12.47% in fault resilience, while maintaining competitive accuracy in quantized networks. The results also highlight an asymmetric interaction in which improvements in fault resilience generally increase resilience to adversarial attacks, whereas enhanced adversarial resilience does not necessarily lead to higher fault resilience.
[AI-22] A Hybrid Modeling Framework for Crop Prediction Tasks via Dynamic Parameter Calibration and Multi-Task Learning
【速读】:该论文旨在解决作物状态(如物候阶段和抗寒性)精准预测难题,以支持农田管理决策的及时性和有效性。传统生物物理模型虽可实现全季预测,但在特定地块上的精度不足;而深度学习方法虽具潜力,却易产生生物学上不合理的预测结果,并且依赖大规模数据。其解决方案的关键在于提出一种混合建模方法,利用神经网络参数化一个可微分的生物物理模型,并通过多任务学习在数据有限条件下实现不同作物品种间的高效数据共享。该方法通过预测生物物理模型的参数,在保持生物学合理性的前提下显著提升了预测精度,实证表明其在物候预测上较现有模型提升60%,抗寒性预测提升40%。
链接: https://arxiv.org/abs/2603.15411
作者: William Solow,Paola Pesantez-Cabrera,Markus Keller,Lav Khot,Sandhya Saisubramanian,Alan Fern
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Accurate prediction of crop states (e.g., phenology stages and cold hardiness) is essential for timely farm management decisions such as irrigation, fertilization, and canopy management to optimize crop yield and quality. While traditional biophysical models can be used for season-long predictions, they lack the precision required for site-specific management. Deep learning methods are a compelling alternative, but can produce biologically unrealistic predictions and require large-scale data. We propose a \emphhybrid modeling approach that uses a neural network to parameterize a differentiable biophysical model and leverages multi-task learning for efficient data sharing across crop cultivars in data limited settings. By predicting the \emphparameters of the biophysical model, our approach improves the prediction accuracy while preserving biological realism. Empirical evaluation using real-world and synthetic datasets demonstrates that our method improves prediction accuracy by 60% for phenology and 40% for cold hardiness compared to deployed biophysical models.
[AI-23] SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在软件工程(Software Engineering, SWE)任务中,代理技能(agent skills)在真实端到端开发场景下的实际效用尚不明确的问题。其解决方案的关键在于构建首个需求驱动的基准测试平台 SWE-Skills-Bench,该平台通过将 49 个公开的 SWE 技能与固定版本的 GitHub 仓库及具有明确验收标准的需求文档配对,形成约 565 个任务实例,并引入确定性验证框架,将验收标准映射为基于执行的测试,从而实现有无技能条件下的可控成对评估。这一设计使得研究者能够精准量化技能注入的边际收益,揭示其在不同领域适配性、抽象层级和上下文兼容性上的显著差异。
链接: https://arxiv.org/abs/2603.15401
作者: Tingxu Han,Yi Zhang,Wei Song,Chunrong Fang,Zhenyu Chen,Youcheng Sun,Lijie Hu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task’s acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at this https URL.
[AI-24] SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中易受越狱攻击(jailbreak attacks)的问题,此类攻击可绕过模型的安全对齐机制,导致不安全输出。现有防御方法通常仅在最终输出阶段进行事后过滤,未能监控中间推理步骤,从而留下安全隐患。其解决方案的关键在于提出一种SaFer Chain-of-Thought (SFCoT) 框架,通过三层次安全评分系统与多视角一致性验证机制,在推理过程中实时检测潜在风险,并结合动态干预模块对可疑推理路径进行靶向校准,从而有效引导模型走向安全结果,实验表明该方法将攻击成功率从58.97%降至12.31%,且未显著影响模型通用性能。
链接: https://arxiv.org/abs/2603.15397
作者: Yu Pan,Wenlong Yu,Tiejun Wu,Xiaohu Ye,Qiannan Si,Guangquan Xu,Bin Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to jailbreak attacks that undermine their safety alignment. Existing defense mechanisms typically rely on post hoc filtering applied only to the final output, leaving intermediate reasoning steps unmonitored and vulnerable to adversarial manipulation. To address this gap, this paper proposes a SaFer Chain-of-Thought (SFCoT) framework, which proactively evaluates and calibrates potentially unsafe reasoning steps in real time. SFCoT incorporates a three-tier safety scoring system alongside a multi-perspective consistency verification mechanism, designed to detect potential risks throughout the reasoning process. A dynamic intervention module subsequently performs targeted calibration to redirect reasoning trajectories toward safe outcomes. Experimental results demonstrate that SFCoT reduces the attack success rate from 58.97% to 12.31% , demonstrating it as an effective and efficient LLM safety enhancement method without a significant decline in general performance.
[AI-25] Efficient Morphology-Control Co-Design via Stackelberg Proximal Policy Optimization
【速读】:该论文旨在解决形态控制协同设计(morphology-control co-design)中的优化效率问题,即现有方法通常采用单层优化框架,将控制策略视为固定参数来优化形态结构,忽略了控制策略随形态变化的动态适应过程,导致形态更新与控制适应不匹配,从而影响训练稳定性与最终性能。解决方案的关键在于从博弈论视角重新建模该问题,将其形式化为一种新型的斯塔克尔伯格博弈(Stackelberg game),并提出栈式近端策略优化算法(Stackelberg Proximal Policy Optimization, Stackelberg PPO),该算法显式地将控制策略的适应动态纳入形态优化过程中,使形态更新与控制适应保持一致,从而显著提升训练稳定性和学习效率。
链接: https://arxiv.org/abs/2603.15388
作者: Yanning Dai,Yuhui Wang,Dylan R. Ashley,Jürgen Schmidhuber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
备注: presented at the Fourteenth International Conference on Learning Representations; 11 pages in main text + 3 pages of references + 23 pages of appendices, 5 figures in main text + 11 figures in appendices, 16 tables in appendices; accompanying website available at this https URL ; source code available at this https URL
Abstract:Morphology-control co-design concerns the coupled optimization of an agent’s body structure and control policy. This problem exhibits a bi-level structure, where the control dynamically adapts to the morphology to maximize performance. Existing methods typically neglect the control’s adaptation dynamics by adopting a single-level formulation that treats the control policy as fixed when optimizing morphology. This can lead to inefficient optimization, as morphology updates may be misaligned with control adaptation. In this paper, we revisit the co-design problem from a game-theoretic perspective, modeling the intrinsic coupling between morphology and control as a novel variant of a Stackelberg game. We propose Stackelberg Proximal Policy Optimization (Stackelberg PPO), which explicitly incorporates the control’s adaptation dynamics into morphology optimization. By modeling this intrinsic coupling, our method aligns morphology updates with control adaptation, thereby stabilizing training and improving learning efficiency. Experiments across diverse co-design tasks demonstrate that Stackelberg PPO outperforms standard PPO in both stability and final performance, opening the way for dramatically more efficient robotics designs.
[AI-26] Why AI systems dont learn and what to do about it: Lessons on autonomous learning from cognitive science
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)模型在实现自主学习(autonomous learning)方面存在的局限性。其核心问题在于现有AI系统缺乏类似人类和动物在真实动态环境中通过观察与主动行为相结合进行持续适应的能力。解决方案的关键在于提出一种受生物认知启发的学习架构,该架构整合了两种学习模式:基于观察的学习(System A)和基于主动行为的学习(System B),并通过一个由内部生成的元控制信号(meta-control signals,System M)驱动的灵活切换机制,在不同学习模式间动态转换,从而实现跨进化与发育时间尺度上的环境适应能力。
链接: https://arxiv.org/abs/2603.15381
作者: Emmanuel Dupoux,Yann LeCun,Jitendra Malik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We critically examine the limitations of current AI models in achieving autonomous learning and propose a learning architecture inspired by human and animal cognition. The proposed framework integrates learning from observation (System A) and learning from active behavior (System B) while flexibly switching between these learning modes as a function of internally generated meta-control signals (System M). We discuss how this could be built by taking inspiration on how organisms adapt to real-world, dynamic environments across evolutionary and developmental timescales.
[AI-27] More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中采用宽束搜索(wider beam search)时,存在一个最优束宽(beam width)的问题:即何时继续扩大束宽会开始损害输出质量。以往研究主要关注束宽对推理效率的影响,而未探讨束宽过大是否会导致性能下降。本文基于极值理论(Extreme Value Theory),揭示了在噪声评分器输出下,束搜索会引入系统性高估偏差(overestimation bias),且该偏差随候选池规模增长而加剧;进而推导出一个最大有效束宽 k^,超过此宽度后搜索性能将恶化。其关键发现是:k^ 与评分器的信噪比(signal-to-noise ratio)密切相关,具体表现为 k^ 随 (Δ/σ)2 指数增长,其中 Δ 表示正确路径相对于错误路径的质量优势,σ 为评分噪声标准差。实验验证表明,使用困惑度(perplexity)评分时 k^=1,而使用推理评分模型(PRM)评分时 k^≥4,性能提升可达8.9个百分点。因此,论文提出以评分器信噪比作为束宽选择的核心指标,并提供实用诊断工具指导实际应用中的束宽配置。
链接: https://arxiv.org/abs/2603.15377
作者: Gal Dalal,Assaf Hallak,Gal Chechik,Yftach Ziser
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Wider beam search should improve LLM reasoning, but when should you stop widening? Prior work on beam width selection has focused on inference efficiency \citepqin2025dsbd, freitag2017beam, without analyzing whether wider search can \emphhurt output quality. We present an analysis, grounded in Extreme Value Theory, that answers this question. Beam selection over noisy scorer outputs introduces a systematic overestimation bias that grows with the candidate pool size, and we derive a maximum useful beam width \hatk beyond which search degrades performance. This critical width depends on the signal-to-noise ratio of the scorer: \hatk grows exponentially with (\Delta/\sigma)^2 , where \Delta 0 is the quality advantage of correct paths over incorrect ones and \sigma is the scorer noise. We validate this theory by comparing perplexity-guided and PRM-guided beam search across three 7B-parameter models and ten domains on MR-BEN (5,975 questions). Perplexity scoring, with its high noise, yields \hatk = 1 : search provides no benefit at any width tested. PRM scoring, with lower noise, yields \hatk \geq 4 , with gains of up to 8.9 percentage points. The same model, the same algorithm, but different scorers place \hatk at opposite ends of the beam width range. Our analysis identifies the scorer’s signal-to-noise ratio as the key quantity governing beam width selection, and we propose diagnostic indicators for choosing the beam width in practice.
[AI-28] GradCFA: A Hybrid Gradient-Based Counterfactual and Feature Attribution Explanation Algorithm for Local Interpretation of Neural Networks
【速读】:该论文旨在解决当前可解释人工智能(Explainable Artificial Intelligence, XAI)方法在实际应用中存在 interpretability 不足的问题,尤其是counterfactual explanations (CFX) 和 feature attribution (FA) 两类主流方法在可行性(feasibility)、合理性(plausibility)和多样性(diversity)方面难以平衡的挑战。解决方案的关键在于提出 GradCFA,一个融合 CFX 与 FA 的混合框架,通过显式优化上述三个核心指标来提升整体解释质量;同时,GradCFA 扩展了传统仅适用于二分类任务的 CFX 方法至多类场景,增强了适用性,并在多个基准测试中验证了其在有效性、稀疏性、合理性及多样性上的优越表现。
链接: https://arxiv.org/abs/2603.15373
作者: Jacob Sanderson,Hua Mao,Wai Lok Woo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Explainable Artificial Intelligence (XAI) is increasingly essential as AI systems are deployed in critical fields such as healthcare and finance, offering transparency into AI-driven decisions. Two major XAI paradigms, counterfactual explanations (CFX) and feature attribution (FA), serve distinct roles in model interpretability. This study introduces GradCFA, a hybrid framework combining CFX and FA to improve interpretability by explicitly optimizing feasibility, plausibility, and diversity - key qualities often unbalanced in existing methods. Unlike most CFX research focused on binary classification, GradCFA extends to multi-class scenarios, supporting a wider range of applications. We evaluate GradCFA’s validity, proximity, sparsity, plausibility, and diversity against state-of-the-art methods, including Wachter, DiCE, CARE for CFX, and SHAP for FA. Results show GradCFA effectively generates feasible, plausible, and diverse counterfactuals while offering valuable FA insights. By identifying influential features and validating their impact, GradCFA advances AI interpretability. The code for implementation of this work can be found at: this https URL .
[AI-29] SKILLS: Structured Knowledge Injection for LLM -Driven Telecommunications Operations
【速读】:该论文旨在解决生成式 AI (Generative AI) 在电信运营自动化场景中是否能通过真实 API 接口可靠执行工作流的问题,即通用大语言模型(Large Language Model, LLM)代理是否需要结构化领域知识引导才能有效完成电信服务生命周期操作。其解决方案的关键在于提出 SKILLS 框架——一个包含 37 个覆盖 8 个 TM Forum Open API 领域的基准测试场景的评估体系,通过引入可移植的 HTTP URL 文档形式的知识注入(Structured Knowledge Injection),将工作流逻辑、API 调用模式与业务规则显式编码到代理中,从而显著提升模型在真实 API 环境下的任务成功率,实验表明该方法在所有测试模型上均带来稳定性能提升。
链接: https://arxiv.org/abs/2603.15372
作者: Ivo Brett
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:As telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. We evaluate open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable this http URL document encoding workflow logic, API patterns, and business rules). Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with-skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM-5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).
[AI-30] Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂多步推理任务中出现的准确率下降问题,即使引入了扩展思维链(chain-of-thought)机制的大型推理模型(Large Reasoning Models, LRMs)仍难以维持高性能。其解决方案的关键在于提出脑启发式图多智能体系统(Brain-Inspired Graph Multi-Agent Systems, BIGMAS),该系统将专业化LLM智能体作为动态构建的有向图中的节点,并通过中央共享工作空间进行协调;其中,问题自适应的GraphDesigner负责构造任务特定的智能体拓扑结构,而全局调度器(Orchestrator)则基于完整的共享状态做出路由决策,从而突破传统反应式方法的局部视图瓶颈,实现与模型级推理增强正交的性能提升。
链接: https://arxiv.org/abs/2603.15371
作者: Guangfu Hao,Yuming Dai,Xianzhe Qin,Shan Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of language tasks, yet complex multi-step reasoning remains a fundamental challenge. While Large Reasoning Models (LRMs) equipped with extended chain-of-thought mechanisms demonstrate improved performance over standard LLMs, both model types still suffer from accuracy collapse on sufficiently complex tasks, suggesting that scaling model-level reasoning alone is insufficient. Inspired by the global workspace theory of human cognition, we propose Brain-Inspired Graph Multi-Agent Systems (BIGMAS), in which specialized LLM agents are organized as nodes in a dynamically constructed directed graph and coordinate exclusively through a centralized shared workspace. A problem-adaptive GraphDesigner constructs task-specific agent topologies, while a global Orchestrator leverages the complete shared state for routing decisions, overcoming the local-view bottleneck of reactive approaches. Experiments on Game24, Six Fives, and Tower of London across six frontier LLMs demonstrate that BIGMAS consistently improves reasoning performance for both standard LLMs and LRMs, outperforming existing multi-agent baselines including ReAct and Tree of Thoughts, showing that multi-agent architectural design provides complementary gains orthogonal to model-level reasoning enhancements.
[AI-31] FuXiWeather2: Learning accurate atmospheric state estimation for operational global weather forecasting
【速读】:该论文旨在解决数值天气预报(Numerical Weather Prediction, NWP)中因数据同化与数值模拟计算瓶颈导致的系统性偏差和延迟问题。现有机器学习模型多作为再分析产品(reanalysis products)的“代理模型”,难以消除其固有误差且无法实现实时响应。解决方案的关键在于提出FuXiWeather2——一个统一的端到端神经框架,通过直接对真实观测与再分析数据联合优化训练目标,有效校正再分析产品的内在误差;同时引入递归展开(recursive unrolling)训练方法缓解训练与部署阶段背景场分布偏移问题,并基于原始与模拟观测混合数据集训练以降低观测分布不一致的影响,从而实现分钟级高分辨率(0.25°)全球分析场与10天预报生成,显著优于NCEP-GFS、ERA5及ECMWF-HRES系统。
链接: https://arxiv.org/abs/2603.15358
作者: Xiaoze Xu,Xiuyu Sun,Songling Zhu,Xiaohui Zhong,Yuanqing Huang,Zijian Zhu,Jun Liu,Hao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Numerical weather prediction has long been constrained by the computational bottlenecks inherent in data assimilation and numerical modeling. While machine learning has accelerated forecasting, existing models largely serve as “emulators of reanalysis products,” thereby retaining their systematic biases and operational latencies. Here, we present FuXiWeather2, a unified end-to-end neural framework for assimilation and forecasting. We align training objectives directly with a combination of real-world observations and reanalysis data, enabling the framework to effectively rectify inherent errors within reanalysis products. To address the distribution shift between NWP-derived background inputs during training and self-generated backgrounds during deployment, we introduce a recursive unrolling training method to enhance the precision and stability of analysis generation. Furthermore, our model is trained on a hybrid dataset of raw and simulated observations to mitigate the impact of observational distribution inconsistency. FuXiWeather2 generates high-resolution ( 0.25^\circ ) global analysis fields and 10-day forecasts within minutes. The analysis fields surpass the NCEP-GFS across most variables and demonstrate superior accuracy over both ERA5 and the ECMWF-HRES system in lower-tropospheric and surface variables. These high-quality analysis fields drive deterministic forecasts that exceed the skill of the HRES system in 91% of evaluated metrics. Additionally, its outstanding performance in typhoon track prediction underscores its practical value for rapid response to extreme weather events. The FuXiWeather2 analysis dataset is available at this https URL.
[AI-32] Conditional Rectified Flow-based End-to-End Rapid Seismic Inversion Method
【速读】:该论文旨在解决传统全波形反演(Full Waveform Inversion, FWI)方法中存在的计算成本高和对初始模型依赖性强的问题。现有基于深度生成模型的反演方法在采样效率与反演精度之间难以平衡。解决方案的关键在于提出一种基于条件修正流(Conditional Rectified Flow)的端到端快速地震反演方法:通过设计专用的地震编码器提取多尺度地震特征,并采用逐层注入控制策略实现细粒度的条件控制,从而在保持高精度的同时显著提升采样效率。实验表明,该方法在OpenFWI基准数据集上表现优异,且在Marmousi真实数据上的零样本泛化实验验证了其工业实用性。
链接: https://arxiv.org/abs/2603.15354
作者: Haofei Xu,Wei Cheng,Sizhe Li,Jie Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Seismic inversion is a core problem in geophysical exploration, where traditional methods suffer from high computational costs and are susceptible to initial model dependence. In recent years, deep generative model-based seismic inversion methods have achieved remarkable progress, but existing generative models struggle to balance sampling efficiency and inversion accuracy. This paper proposes an end-to-end fast seismic inversion method based on Conditional Rectified Flow[1], which designs a dedicated seismic encoder to extract multi-scale seismic features and adopts a layer-by-layer injection control strategy to achieve fine-grained conditional control. Experimental results demonstrate that the proposed method achieves excellent inversion accuracy on the OpenFWI[2] benchmark dataset. Compared with Diffusion[3,4] methods, it achieves sampling acceleration; compared with InversionNet[5,6,7] methods, it achieves higher accuracy in generation. Our zero-shot generalization experiments on Marmousi[8,9] real data further verify the practical value of the method. Experimental results show that the proposed method achieves excellent inversion accuracy on the OpenFWI benchmark dataset; compared with Diffusion methods, it achieves sampling acceleration while maintaining higher accuracy than InversionNet methods; experiments based on the Marmousi standard model further verify that this method can generate high-quality initial velocity models in a zero-shot manner, effectively alleviating the initial model dependency problem in traditional Full Waveform Inversion (FWI), and possesses industrial practical value.
[AI-33] NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation
【速读】:该论文旨在解决当前文本到语音(Text-to-Speech, TTS)系统中非言语发声(Nonverbal Vocalizations, NVs)评估缺乏标准化指标和可靠参考基准的问题。现有方法将NVs视为声学特征而非具有交际功能的言语行为,导致评价体系不统一且难以反映实际使用场景中的表现。解决方案的关键在于提出NV-Bench——首个基于功能分类法(functional taxonomy)构建的基准测试框架,将NVs视为沟通行为而非单纯声学现象,并包含1,651条多语言、真实环境下的NV语料及其配对人类参考音频,覆盖14类NV。同时引入双维度评估协议:(1)指令对齐性(Instruction Alignment),通过提出的副语言字符错误率(Paralinguistic Character Error Rate, PCER)衡量可控性;(2)声学保真度(Acoustic Fidelity),通过分布差异度量真实录音的声学真实性。实验表明该框架的客观指标与人类感知高度相关,为NV生成提供了标准化评估基础。
链接: https://arxiv.org/abs/2603.15352
作者: Qinke Ni,Huan Liao,Dekun Chen,Yuxiang Wang,Zhizheng Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multi-lingual, in-the-wild utterances with paired human reference audio, balanced across 14 NV categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, utilizing the proposed paralinguistic character error rate (PCER) to assess controllability, (2) Acoustic Fidelity, measuring the distributional gap to real recordings to assess acoustic realism. We evaluate diverse TTS models and develop two baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework.
[AI-34] Evolutionary Transfer Learning for Drag onchess
【速读】:该论文旨在解决将人工智能(AI)启发式知识从结构相对简单、规则明确的棋类游戏(如国际象棋)迁移至结构复杂、多层且规则差异显著的新型棋类游戏——Dragonchess中的难题。其核心挑战在于,直接移植来自顶级国际象棋引擎Stockfish的启发式评估函数在Dragonchess中表现不佳,因后者具有独特的三维多层结构和运动机制。解决方案的关键在于采用协方差矩阵自适应进化策略(Covariance Matrix Adaptation Evolution Strategy, CMA-ES)对原始启发式函数进行演化优化,从而有效适应Dragonchess的特殊规则与状态空间,最终显著提升AI代理的游戏性能。
链接: https://arxiv.org/abs/2603.15297
作者: Jim O’Connor,Annika Hoag,Sarah Goyette,Gary B. Parker
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Dragonchess, a three-dimensional chess variant introduced by Gary Gygax, presents unique strategic and computational challenges that make it an ideal environment for studying the transfer of artificial intelligence (AI) heuristics across domains. In this work, we introduce Dragonchess as a novel testbed for AI research and provide an open-source, Python-based game engine for community use. Our research investigates evolutionary transfer learning by adapting heuristic evaluation functions directly from Stockfish, a leading chess engine, and subsequently optimizing them using Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Initial trials showed that direct heuristic transfers were inadequate due to Dragonchess’s distinct multi-layer structure and movement rules. However, evolutionary optimization significantly improved AI agent performance, resulting in superior gameplay demonstrated through empirical evaluation in a 50-round Swiss-style tournament. This research establishes the effectiveness of evolutionary methods in adapting heuristic knowledge to structurally complex, previously unexplored game domains.
[AI-35] Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report
【速读】:该论文旨在解决学习型动作策略(learned action policies)在顺序决策任务中缺乏安全性保障的问题,特别是如何高效判断某一状态是否安全(即是否存在从该状态出发的安全策略),并识别导致状态不安全的“故障”(faults,即从安全状态转移到不安全状态的状态-动作对)。其解决方案的关键在于提出了一种新的策略迭代算法iPI(iterative Policy Iteration),该算法结合了现有方法的优势:在最佳情况下可达到与TarjanSafe算法相同的线性时间复杂度,同时保证多项式级别的最坏情况时间复杂度,从而在理论性能和实际效率之间实现了平衡。实验表明,在适合TarjanSafe的问题上,iPI表现相当;而在不适合该算法的问题上,iPI展现出显著更优的扩展性。
链接: https://arxiv.org/abs/2603.15282
作者: Johannes Schmalz,Chaahat Jain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Learned action policies are increasingly popular in sequential decision-making, but suffer from a lack of safety guarantees. Recent work introduced a pipeline for testing the safety of such policies under initial-state and action-outcome non-determinism. At the pipeline’s core, is the problem of deciding whether a state is safe (a safe policy exists from the state) and finding faults, which are state-action pairs that transition from a safe state to an unsafe one. Their most effective algorithm for deciding safety, TarjanSafe, is effective on their benchmarks, but we show that it has exponential worst-case runtime with respect to the state space. A linear-time alternative exists, but it is slower in practice. We close this gap with a new policy-iteration algorithm iPI, that combines the best of both: it matches TarjanSafe’s best-case runtime while guaranteeing a polynomial worst-case. Experiments confirm our theory and show that in problems amenable to TarjanSafe iPI has similar performance, whereas in ill-suited problems iPI scales exponentially better.
[AI-36] Advancing Multimodal Agent Reasoning with Long-Term Neuro-Symbolic Memory
【速读】:该论文旨在解决当前多模态智能体在长期推理中依赖纯神经记忆系统所导致的分析性与演绎性推理能力不足的问题。现有方法主要基于向量表示和相似度检索,虽适用于归纳式直觉推理,但在需要逻辑规则支持的现实决策场景中表现受限。解决方案的关键在于提出NS-Mem(Neural-Symbolic Memory)框架,其核心创新包括:(1)三层记忆架构(情景层、语义层与逻辑规则层),实现神经表征与符号结构的协同存储;(2)基于SK-Gen的记忆构建与维护机制,自动从多模态经验中提取结构化知识并增量更新神经与符号组件;(3)混合检索机制,融合基于相似性的搜索与确定性的符号查询函数,从而支持结构化推理。实验表明,NS-Mem相较纯神经记忆系统平均提升4.35%的推理准确率,尤其在约束性推理任务上最高提升达12.5%。
链接: https://arxiv.org/abs/2603.15280
作者: Rongjie Jiang,Jianwei Wang,Gengda Zhao,Chengyang Luo,Kai Wang,Wenjie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures
Abstract:Recent advances in large language models have driven the emergence of intelligent agents operating in open-world, multimodal environments. To support long-term reasoning, such agents are typically equipped with external memory systems. However, most existing multimodal agent memories rely primarily on neural representations and vector-based retrieval, which are well-suited for inductive, intuitive reasoning but fundamentally limited in supporting analytical, deductive reasoning critical for real-world decision making. To address this limitation, we propose NS-Mem, a long-term neuro-symbolic memory framework designed to advance multimodal agent reasoning by integrating neural memory with explicit symbolic structures and rules. Specifically, NS-Mem is operated around three core components of a memory system: (1) a three-layer memory architecture that consists episodic layer, semantic layer and logic rule layer, (2) a memory construction and maintenance mechanism implemented by SK-Gen that automatically consolidates structured knowledge from accumulated multimodal experiences and incrementally updates both neural representations and symbolic rules, and (3) a hybrid memory retrieval mechanism that combines similarity-based search with deterministic symbolic query functions to support structured reasoning. Experiments on real-world multimodal reasoning benchmarks demonstrate that Neural-Symbolic Memory achieves an average 4.35% improvement in overall reasoning accuracy over pure neural memory systems, with gains of up to 12.5% on constrained reasoning queries, validating the effectiveness of NS-Mem.
[AI-37] Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在现代电商搜索中面临的“盲区-延迟困境”问题:传统查询重写方法忽视检索能力和实时库存状态,导致生成的搜索计划无效;而基于深度搜索代理的迭代工具调用与反思机制则因延迟过高(数秒级别),无法满足工业级亚秒级响应要求。解决方案的关键在于提出环境感知搜索规划(Environment-Aware Search Planning, EASP),其核心创新是引入“探查-规划”(Probe-then-Plan)机制——通过轻量级检索探查(Retrieval Probe)获取当前环境快照,使规划器能够诊断执行缺口并生成可落地的搜索计划。该方法包含三个阶段:离线数据合成、规划器训练与对齐、自适应在线服务,最终在离线评估和线上A/B测试中显著提升了相关召回率及用户点击转化率(UCVR)与商品交易总额(GMV)。
链接: https://arxiv.org/abs/2603.15262
作者: Mengxiang Chen,Zhouwei Zhai,Jin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Modern e-commerce search is evolving to resolve complex user intents. While Large Language Models (LLMs) offer strong reasoning, existing LLM-based paradigms face a fundamental blindness-latency dilemma: query rewriting is agnostic to retrieval capabilities and real-time inventory, yielding invalid plans; conversely, deep search agents rely on iterative tool calls and reflection, incurring seconds of latency incompatible with industrial sub-second budgets. To resolve this conflict, we propose Environment-Aware Search Planning (EASP), reformulating search planning as a dynamic reasoning process grounded in environmental reality. EASP introduces a Probe-then-Plan mechanism: a lightweight Retrieval Probe exposes the retrieval snapshot, enabling the Planner to diagnose execution gaps and generate grounded search plans. The methodology comprises three stages: (1) Offline Data Synthesis: A Teacher Agent synthesizes diverse, execution-validated plans by diagnosing the probed environment. (2) Planner Training and Alignment: The Planner is initialized via Supervised Fine-Tuning (SFT) to internalize diagnostic capabilities, then aligned with business outcomes (conversion rate) via Reinforcement Learning (RL). (3) Adaptive Online Serving: A complexity-aware routing mechanism selectively activates planning for complex queries, ensuring optimal resource allocation. Extensive offline evaluations and online A/B testing on this http URL demonstrate that EASP significantly improves relevant recall and achieves substantial lifts in UCVR and GMV. EASP has been successfully deployed in this http URL’s AI-Search system.
[AI-38] In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks
【速读】:该论文旨在解决符号回归(Symbolic Regression)中从Kolmogorov-Arnold Networks (KANs) 提取可解释、可验证的符号表达式时存在的瓶颈问题。传统方法将每个边函数独立拟合为符号算子,导致对初始值敏感、非凸优化困难,并忽略局部替换在全网中的交互影响。解决方案的关键在于引入上下文内符号回归(in-context symbolic regression),提出两种互补实现:一是贪婪式上下文符号回归(Greedy in-context Symbolic Regression, GSR),通过微调后端到端损失下降量进行逐边替换选择;二是门控匹配追踪(Gated Matching Pursuit, GMP),训练一个可微分的门控算子层,以稀疏门控机制在每条边上选择算子库中的最优操作符,收敛后离散化门控并可选地进行简短的贪婪精修。此策略显著提升了符号提取的鲁棒性和公式的一致性,实验表明其在中位数OFAT测试均方误差上最高可减少99.8%。
链接: https://arxiv.org/abs/2603.15250
作者: Francesco Sovrano,Lidia Losavio,Giulia Vilone,Marc Langheinrich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages; Accepted for publication at XAI’2026
Abstract:Symbolic regression aims to replace black-box predictors with concise analytical expressions that can be inspected and validated in scientific machine learning. Kolmogorov-Arnold Networks (KANs) are well suited to this goal because each connection between adjacent units (an “edge”) is parametrised by a learnable univariate function that can, in principle, be replaced by a symbolic operator. In practice, however, symbolic extraction is a bottleneck: the standard KAN-to-symbol approach fits operators to each learned edge function in isolation, making the discrete choice sensitive to initialisation and non-convex parameter fitting, and ignoring how local substitutions interact through the full network. We study in-context symbolic regression for operator extraction in KANs, and present two complementary instantiations. Greedy in-context Symbolic Regression (GSR) performs greedy, in-context selection by choosing edge replacements according to end-to-end loss improvement after brief fine-tuning. Gated Matching Pursuit (GMP) amortises this in-context selection by training a differentiable gated operator layer that places an operator library behind sparse gates on each edge; after convergence, gates are discretised (optionally followed by a short in-context greedy refinement pass). We quantify robustness via one-factor-at-a-time (OFAT) hyper-parameter sweeps and assess both predictive error and qualitative consistency of recovered formulas. Across several experiments, greedy in-context symbolic regression achieves up to 99.8% reduction in median OFAT test MSE.
[AI-39] Why the Valuable Capabilities of LLM s Are Precisely the Unexplainable Ones
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)中真正有价值的能力为何难以通过人类可读的离散规则进行完整描述。其解决方案的关键在于提出一个反证法论证——若LLMs的全部能力可被一套完整的人类可读规则所捕获,则该规则集在功能上等价于专家系统;然而,历史和实证研究表明专家系统在能力上严格弱于LLMs,由此产生矛盾,从而证明LLMs超越专家系统的那些能力恰恰是无法被规则编码的部分。这一核心论点结合了中国哲学中的“悟”(Wu,即通过实践获得顿悟)、专家系统的历史局限性以及人类认知工具与复杂系统之间的结构性错配,为理解LLMs的本质能力提供了新的理论框架。
链接: https://arxiv.org/abs/2603.15238
作者: Quan Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:This paper proposes and argues for a counterintuitive thesis: the truly valuable capabilities of large language models (LLMs) reside precisely in the part that cannot be fully captured by human-readable discrete rules. The core argument is a proof by contradiction via expert system equivalence: if the full capabilities of an LLM could be described by a complete set of human-readable rules, then that rule set would be functionally equivalent to an expert system; but expert systems have been historically and empirically demonstrated to be strictly weaker than LLMs; therefore, a contradiction arises – the capabilities of LLMs that exceed those of expert systems are exactly the capabilities that cannot be rule-encoded. This thesis is further supported by the Chinese philosophical concept of Wu (sudden insight through practice), the historical failure of expert systems, and a structural mismatch between human cognitive tools and complex systems. The paper discusses implications for interpretability research, AI safety, and scientific epistemology.
[AI-40] SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在顺序知识编辑过程中出现的灾难性遗忘(catastrophic forgetting)和模型崩溃(collapse)问题。现有密集式编辑方法将模型视为黑箱,依赖粗粒度的参数干预,导致已习得知识被破坏。其解决方案的关键在于提出SCAN(Sparse Circuit Anchored Neuron),一种基于稀疏电路锚定神经元的稀疏编辑框架,通过构建由稀疏转码器(Sparse Transcoders)组成的知识电路,实现机制感知的知识操纵,从而在保持模型整体性能的同时精准更新特定知识。
链接: https://arxiv.org/abs/2603.15226
作者: Yuhuan Liu,Haitian Zhong,Xinyuan Xia,Qiang Liu,Shu Wu,Liang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21pages, 7figures
Abstract:Large Language Models (LLMs) often suffer from catastrophic forgetting and collapse during sequential knowledge editing. This vulnerability stems from the prevailing dense editing paradigm, which treats models as black boxes and relies on coarse-grained parameter interventions that inevitably disrupt preserved knowledge. To address this, we propose SCAN (a sparse editing framework based on Sparse Circuit Anchored Neuron) which transforms editing into a mechanism-aware manipulation by constructing a knowledge circuit via Sparse Transcoders. Experiments on Gemma2, Qwen3, and Llama3.1 across CounterFact, ZsRE and WikiFactDiff demonstrate that SCAN achieves a superior performance, maintaining model integrity on benchmarks like MMLU and GSM8K even after 3,000 sequential edits, whereas other existing methods deteriorate progressively as editing accumulates, eventually resulting in model collapse.
[AI-41] ADV-0: Closed-Loop Min-Max Adversarial Training for Long-Tail Robustness in Autonomous Driving
【速读】:该论文旨在解决自动驾驶系统在面对长尾场景(long-tail scenarios)时的鲁棒性问题,即那些发生概率极低但安全风险极高的罕见场景。现有对抗训练方法通常将场景生成与策略优化解耦,并依赖启发式代理目标,导致目标不一致且无法捕捉随策略演化而变化的失效模式。其解决方案的关键在于提出ADV-0框架,这是一个闭环的最小-最大优化机制,将驾驶策略(防御者)与对抗智能体(攻击者)之间的交互建模为零和马尔可夫博弈(zero-sum Markov game),并通过直接对齐攻击者的效用函数与防御者的目标函数,揭示出最优对抗分布;进一步地,将动态对抗演化建模为迭代偏好学习过程,从而高效逼近该最优解并提供算法无关的博弈求解方案。理论证明ADV-0收敛至纳什均衡(Nash Equilibrium),并最大化真实世界性能的认证下界,实验证明其能有效暴露多样化的安全关键失效场景,并显著提升策略与运动规划器对未见长尾风险的泛化能力。
链接: https://arxiv.org/abs/2603.15221
作者: Tong Nie,Yihong Tang,Junlin He,Yuewen Mei,Jie Sun,Lijun Sun,Wei Ma,Jian Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying autonomous driving systems requires robustness against long-tail scenarios that are rare but safety-critical. While adversarial training offers a promising solution, existing methods typically decouple scenario generation from policy optimization and rely on heuristic surrogates. This leads to objective misalignment and fails to capture the shifting failure modes of evolving policies. This paper presents ADV-0, a closed-loop min-max optimization framework that treats the interaction between driving policy (defender) and adversarial agent (attacker) as a zero-sum Markov game. By aligning the attacker’s utility directly with the defender’s objective, we reveal the optimal adversary distribution. To make this tractable, we cast dynamic adversary evolution as iterative preference learning, efficiently approximating this optimum and offering an algorithm-agnostic solution to the game. Theoretically, ADV-0 converges to a Nash Equilibrium and maximizes a certified lower bound on real-world performance. Experiments indicate that it effectively exposes diverse safety-critical failures and greatly enhances the generalizability of both learned policies and motion planners against unseen long-tail risks.
[AI-42] InterPol: De-anonymizing LM Arena via Interpolated Preference Learning
【速读】:该论文旨在解决模型响应严格匿名性(strict anonymity)在基于投票的排行榜(如LM Arena)中面临的威胁问题,即如何有效识别出匿名模型的真实来源,从而保障排行榜的可靠性。现有方法依赖TF-IDF或词袋模型等浅层统计特征,难以区分风格相似或同一家族的模型。本文提出INTERPOL框架,其核心在于利用模型插值生成难负样本,并结合自适应课程学习策略,捕捉深层风格模式,显著提升了模型识别准确率,验证了现实场景下排名操纵的风险。
链接: https://arxiv.org/abs/2603.15220
作者: Minsung Cho,Jaehyung Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Strict anonymity of model responses is a key for the reliability of voting-based leaderboards, such as LM Arena. While prior studies have attempted to compromise this assumption using simple statistical features like TF-IDF or bag-ofwords, these methods often lack the discriminative power to distinguish between stylistically similar or within-family models. To overcome these limitations and expose the severity of vulnerability, we introduce INTERPOL, a model-driven identification framework that learns to distinguish target models from others using interpolated preference data. Specifically, INTERPOL captures deep stylistic patterns that superficial statistical features miss by synthesizing hard negative samples through model interpolation and employing an adaptive curriculum learning strategy. Extensive experiments demonstrate that INTERPOL significantly outperforms existing baselines in identification accuracy. Furthermore, we quantify the real-world threat of our findings through ranking manipulation simulations on Arena battle data.
[AI-43] owards Foundation Models for Consensus Rank Aggregation
【速读】:该论文旨在解决共识排序聚合(consensus ranking aggregation)问题,即从多个输入排序中生成一个最优的综合排序,该问题广泛应用于推荐系统、搜索引擎、招聘和选举等领域。传统方法中,以Kemeny距离最小化为目标的排序聚合虽具理论优势,但因其计算复杂度为NP-hard,难以在大规模场景下应用。论文提出的关键解决方案是Kemeny Transformer,这是一种基于Transformer架构并采用强化学习训练的模型,能够高效近似Kemeny最优排序,显著优于经典的多数启发式和马尔可夫链方法,并在推理速度上远超整数线性规划求解器,从而为实际排序聚合任务提供了可扩展且实用的新途径。
链接: https://arxiv.org/abs/2603.15218
作者: Yijun Jin,Simon Klüttermann,Chiara Balestra,Emmanuel Müller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 16 pages, 5 figures
Abstract:Aggregating a consensus ranking from multiple input rankings is a fundamental problem with applications in recommendation systems, search engines, job recruitment, and elections. Despite decades of research in consensus ranking aggregation, minimizing the Kemeny distance remains computationally intractable. Specifically, determining an optimal aggregation of rankings with respect to the Kemeny distance is an NP-hard problem, limiting its practical application to relatively small-scale instances. We propose the Kemeny Transformer, a novel Transformer-based algorithm trained via reinforcement learning to efficiently approximate the Kemeny optimal ranking. Experimental results demonstrate that our model outperforms classical majority-heuristic and Markov-chain approaches, achieving substantially faster inference than integer linear programming solvers. Our approach thus offers a practical, scalable alternative for real-world ranking-aggregation tasks.
[AI-44] Modeling Matches as Language: A Generative Transformer Approach for Counterfactual Player Valuation in Football ECML-PKDD
【速读】:该论文旨在解决足球球员转会评估中的核心难题:现有方法依赖静态统计数据和主观专家判断,无法充分考虑战术体系、队友配合及比赛情境等动态因素对球员表现的影响。其关键解决方案是提出ScoutGPT,一种基于NanoGPT架构的生成式模型,将比赛事件序列建模为语言模型中的连续标记(token),通过训练预测下一个事件来学习比赛过程的动力学特征;进而利用蒙特卡洛采样实现反事实模拟(counterfactual simulation),从而量化在假设阵容下球员表现的变化,实验表明该方法能有效捕捉球员对进攻推进和进球概率的特定影响,超越传统静态指标。
链接: https://arxiv.org/abs/2603.15212
作者: Miru Hong,Minho Lee,Geonhee Jo,Hyeokje Jo,Pascal Bauer,Sang-Ki Ko
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 3 figures, 9 tables. Submitted to 2026 ECML-PKDD Applied Data Science Track
Abstract:Evaluating football player transfers is challenging because player actions depend strongly on tactical systems, teammates, and match context. Despite this complexity, recruitment decisions often rely on static statistics and subjective expert judgment, which do not fully account for these contextual factors. This limitation stems largely from the absence of counterfactual simulation mechanisms capable of predicting outcomes in hypothetical scenarios. To address these challenges, we propose ScoutGPT, a generative model that treats football match events as sequential tokens within a language modeling framework. Utilizing a NanoGPT-based Transformer architecture trained on next-token prediction, ScoutGPT learns the dynamics of match event sequences to simulate event sequences under hypothetical lineups, demonstrating superior predictive performance compared to existing baseline models. Leveraging this capability, the model employs Monte Carlo sampling to enable counterfactual simulation, allowing for the assessment of unobserved scenarios. Experiments on K League data show that simulated player transfers lead to measurable changes in offensive progression and goal probabilities, indicating that ScoutGPT captures player-specific impact beyond traditional static metrics.
[AI-45] CATFormer: When Continual Learning Meets Spiking Transformers With Dynamic Thresholds AAAI2026
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在类增量学习(Class-Incremental Learning, CIL)场景下因任务累积导致性能显著下降的问题,尤其是克服灾难性遗忘(Catastrophic Forgetting)现象。其解决方案的关键在于提出了一种名为CATFormer(Context Adaptive Threshold Transformer)的可扩展框架,其中核心创新是引入动态阈值漏电积分发放(Dynamic Threshold Leaky Integrate-and-Fire, DTLIF)神经元模型,通过上下文自适应阈值机制实现知识保留,并结合门控动态头选择(Gated Dynamic Head Selection, G-DHS)机制实现任务无关推理,从而在不依赖经验回放的情况下有效维持长期学习能力。
链接: https://arxiv.org/abs/2603.15184
作者: Vaishnavi Nagabhushana,Kartikay Agrawal,Ayon Borthakur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
备注: Accepted for publication in the proceedings of the Neuro for AI AI for Neuro Workshop at AAAI 2026 (PMLR)
Abstract:Although deep neural networks perform extremely well in controlled environments, they fail in real-world scenarios where data isn’t available all at once, and the model must adapt to a new data distribution that may or may not follow the initial distribution. Previously acquired knowledge is lost during subsequent updates based on new data. a phenomenon commonly known as catastrophic forgetting. In contrast, the brain can learn without such catastrophic forgetting, irrespective of the number of tasks it encounters. Existing spiking neural networks (SNNs) for class-incremental learning (CIL) suffer a sharp performance drop as tasks accumulate. We here introduce CATFormer (Context Adaptive Threshold Transformer), a scalable framework that overcomes this limitation. We observe that the key to preventing forgetting in SNNs lies not only in synaptic plasticity but also in modulating neuronal excitability. At the core of CATFormer is the Dynamic Threshold Leaky Integrate-and-Fire (DTLIF) neuron model, which leverages context-adaptive thresholds as the primary mechanism for knowledge retention. This is paired with a Gated Dynamic Head Selection (G-DHS) mechanism for task-agnostic inference. Extensive evaluation on both static (CIFAR-10/100/Tiny-ImageNet) and neuromorphic (CIFAR10-DVS/SHD) datasets reveals that CATFormer outperforms existing rehearsal-free CIL algorithms across various task splits, establishing it as an ideal architecture for energy-efficient, true-class incremental learning.
[AI-46] Iterative Learning Control-Informed Reinforcement Learning for Batch Process Control
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在工业过程控制中因动作生成的随机不确定性导致的安全风险问题,以及缺乏形式化稳定性与收敛性保证而限制其实际应用的问题。解决方案的关键在于提出一种受迭代学习控制(Iterative Learning Control, ILC)启发的强化学习框架(IL-CIRL),该框架采用双层控制架构——批间(batch-to-batch)和批内(within-batch)控制,并在迭代学习结构中引入基于卡尔曼滤波(Kalman filter)的状态估计机制,以引导DRL智能体学习满足操作约束且具备稳定性的控制策略,从而实现对多扰动条件下批量过程的系统性DRL控制器设计。
链接: https://arxiv.org/abs/2603.15180
作者: Runze Lin,Ziqi Zhuo,Junghui Chen,Lei Xie,Hongye Su
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:A significant limitation of Deep Reinforcement Learning (DRL) is the stochastic uncertainty in actions generated during exploration-exploitation, which poses substantial safety risks during both training and deployment. In industrial process control, the lack of formal stability and convergence guarantees further inhibits adoption of DRL methods by practitioners. Conversely, Iterative Learning Control (ILC) represents a well-established autonomous control methodology for repetitive systems, particularly in batch process optimization. ILC achieves desired control performance through iterative refinement of control laws, either between consecutive batches or within individual batches, to compensate for both repetitive and non-repetitive disturbances. This study introduces an Iterative Learning Control-Informed Reinforcement Learning (IL-CIRL) framework for training DRL controllers in dual-layer batch-to-batch and within-batch control architectures for batch processes. The proposed method incorporates Kalman filter-based state estimation within the iterative learning structure to guide DRL agents toward control policies that satisfy operational constraints and ensure stability guarantees. This approach enables the systematic design of DRL controllers for batch processes operating under multiple disturbance conditions.
[AI-47] Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies
【速读】:该论文旨在解决离线安全强化学习(Offline Safe Reinforcement Learning, Offline Safe RL)中如何在静态数据集上学习满足严格安全约束的最优策略问题,尤其针对实时控制场景下现有方法因依赖软期望成本目标或迭代生成推理而导致的安全性不足与推理延迟高的缺陷。解决方案的关键在于提出Safe Flow Q-Learning (SafeFQL),其核心创新包括:结合哈密顿-雅可比可达性(Hamilton–Jacobi reachability)启发的安全价值函数与高效的一步流策略(one-step flow policy),通过自洽贝尔曼递归(self-consistency Bellman recursion)学习安全价值,利用行为克隆训练流策略并蒸馏为一步动作选择器,避免部署时拒绝采样;同时引入置信预测校准步骤(conformal prediction calibration),以应对有限数据带来的安全边界近似误差,提供有限样本概率保障的安全覆盖。
链接: https://arxiv.org/abs/2603.15136
作者: Mumuksh Tayal,Manan Tayal,Ravi Prakash
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 6 figures, 4 tables
Abstract:Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton–Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.
[AI-48] PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units
【速读】:该论文旨在解决在不同硬件约束的边缘设备(如微控制器单元,MCU)上高效部署深度神经网络(DNN)时所面临的难题,即传统方法需为每种设备单独设计和训练模型,耗时且资源密集。解决方案的关键在于提出一种零样本神经架构搜索(Zero-shot NAS)方法——PrototypeNAS,其核心创新包括:(1)构建一个融合多类型架构结构优化、剪枝与量化配置优化的新型搜索空间;(2)引入基于集成零样本代理(zero-shot proxies)的多目标优化策略;(3)采用超体积子集选择(Hypervolume subset selection)从帕累托前沿中提取最具代表性的精度-计算量(FLOPs)权衡模型。该方法无需从头训练大量候选网络,可在数分钟内自动完成模型选择、压缩与定制化,从而显著提升边缘部署效率并保持接近大型模型的性能。
链接: https://arxiv.org/abs/2603.15106
作者: Mark Deutel,Simon Geis,Axel Plinge
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures, 4 tables
Abstract:Enabling efficient deep neural network (DNN) inference on edge devices with different hardware constraints is a challenging task that typically requires DNN architectures to be specialized for each device separately. To avoid the huge manual effort, one can use neural architecture search (NAS). However, many existing NAS methods are resource-intensive and time-consuming because they require the training of many different DNNs from scratch. Furthermore, they do not take the resource constraints of the target system into account. To address these shortcomings, we propose PrototypeNAS, a zero-shot NAS method to accelerate and automate the selection, compression, and specialization of DNNs to different target microcontroller units (MCUs). We propose a novel three-step search method that decouples DNN design and specialization from DNN training for a given target platform. First, we present a novel search space that not only cuts out smaller DNNs from a single large architecture, but instead combines the structural optimization of multiple architecture types, as well as optimization of their pruning and quantization configurations. Second, we explore the use of an ensemble of zero-shot proxies during optimization instead of a single one. Third, we propose the use of Hypervolume subset selection to distill DNN architectures from the Pareto front of the multi-objective optimization that represent the most meaningful tradeoffs between accuracy and FLOPs. We evaluate the effectiveness of PrototypeNAS on 12 different datasets in three different tasks: image classification, time series classification, and object detection. Our results demonstrate that PrototypeNAS is able to identify DNN models within minutes that are small enough to be deployed on off-the-shelf MCUs and still achieve accuracies comparable to the performance of large DNN models.
[AI-49] Open Biomedical Knowledge Graphs at Scale: Construction Federation and AI Agent Access with Samyama Graph Database
【速读】:该论文旨在解决生物医学知识在多个孤立数据库(如Reactome、STRING、Gene Ontology等)中分散存储导致的研究效率低下问题,这些问题包括数据获取繁琐、交叉引用易出错、缺乏可重复性。其解决方案的关键在于构建两个开放源代码的生物医学知识图谱(Pathways KG 和 Clinical Trials KG),并基于高性能Rust编写的Samyama图数据库实现高效集成与联邦查询:首先通过标准化ETL流程实现多源数据的去重、批量Cypher加载和可移植快照导出;其次利用跨图谱联邦机制,在单一图租户中实现属性级关联查询(例如“哪些通路被处于乳腺癌III期临床试验的药物所干扰”);最后引入基于模式驱动的MCP服务器生成技术,使知识图谱自动暴露类型化工具接口,支持大语言模型(LLM)代理通过自然语言直接访问图查询,无需人工编写工具函数。整体方案实现了从异构数据整合到智能交互的一体化闭环。
链接: https://arxiv.org/abs/2603.15080
作者: Madhulatha Mandarapu,Sandeep Kunkunuru
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 10 pages, 7 tables, open-source code and data
Abstract:Biomedical knowledge is fragmented across siloed databases – Reactome for pathways, STRING for protein interactions, Gene Ontology for functional annotations, this http URL for study registries, and dozens more. Researchers routinely download flat files from each source and write bespoke scripts to cross-reference them, a process that is slow, error-prone, and not reproducible. We present two open-source biomedical knowledge graphs – Pathways KG (118,686 nodes, 834,785 edges from 5 sources) and Clinical Trials KG (7,774,446 nodes, 26,973,997 edges from 5 sources) – built on Samyama, a high-performance graph database written in Rust. Our contributions are threefold. First, we describe a reproducible ETL pattern for constructing large-scale KGs from heterogeneous public data sources, with cross-source deduplication, batch Cypher loading, and portable snapshot export. Second, we demonstrate cross-KG federation: loading both snapshots into a single graph tenant enables property-based joins across datasets, answering questions like Which biological pathways are disrupted by drugs currently in Phase~3 trials for breast cancer?'' -- a query that neither KG can answer alone. Third, we introduce schema-driven MCP server generation: each KG automatically exposes typed tools for LLM agents via the Model Context Protocol, enabling natural-language access to graph queries without manual tool authoring. All data sources are open-license (CC~BY~4.0, CC0, OBO). Snapshots, ETL code, and MCP configurations are publicly available. The combined federated graph (7.89M nodes, 27.8M edges) loads in 76 seconds on commodity hardware (Mac Mini M4, 16GB RAM), and the signature cross-KG query -- which pathways are disrupted by drugs in Phase~3 breast cancer trials?‘’ – returns validated results in 2.1 seconds. Comments: 10 pages, 7 tables, open-source code and data Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM) ACMclasses: H.2.4; J.3 Cite as: arXiv:2603.15080 [cs.DB] (or arXiv:2603.15080v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2603.15080 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-50] Interference-Aware K-Step Reachable Communication in Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中因通信带宽有限和环境拓扑动态复杂而导致的高价值通信伙伴识别难题。在缺乏先验知识的情况下,智能体需在不确定性中选择协作伙伴以获取任务关键信息。解决方案的关键在于提出一种干扰感知的K步可达通信框架(Interference-Aware K-Step Reachable Communication, IA-KRC),其核心包含两个组件:一是基于物理可达性的K步可达协议,限制消息传递仅在可触及邻居间进行;二是干扰预测模块,通过最小化干扰并最大化效用来优化通信伙伴选择,从而提升合作的持续性与效率。
链接: https://arxiv.org/abs/2603.15054
作者: Ziyu Cheng,Jinsheng Ren,Zhouxian Jiang,Chenzhihang Li,Rongye Shi,Bin Liang,Jun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: multi-agent reinforcement learning, communication
Abstract:Effective communication is pivotal for addressing complex collaborative tasks in multi-agent reinforcement learning (MARL). Yet, limited communication bandwidth and dynamic, intricate environmental topologies present significant challenges in identifying high-value communication partners. Agents must consequently select collaborators under uncertainty, lacking a priori knowledge of which partners can deliver task-critical information. To this end, we propose Interference-Aware K-Step Reachable Communication (IA-KRC), a novel framework that enhances cooperation via two core components: (1) a K-Step reachability protocol that confines message passing to physically accessible neighbors, and (2) an interference-prediction module that optimizes partner choice by minimizing interference while maximizing utility. Compared to existing methods, IA-KRC enables substantially more persistent and efficient cooperation despite environmental interference. Comprehensive evaluations confirm that IA-KRC achieves superior performance compared to state-of-the-art baselines, while demonstrating enhanced robustness and scalability in complex topological and highly dynamic multi-agent scenarios.
[AI-51] AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation
【速读】:该论文旨在解决语言引导的机器人操作问题(language-guided robotic manipulation),即机器人需根据视觉观测和自然语言指令对多种物体进行操作,这要求系统具备安全性、效率及任务层面的泛化能力。现有视觉-语言-动作模型(Vision-Language-Action models, VLAs)虽性能优异,但在资源受限环境中的部署仍面临计算成本高的挑战,主要源于标准Transformer骨干网络的高复杂度。解决方案的关键在于提出AnoleVLA——一种轻量级VLAs架构,其核心创新是采用深度状态空间模型(deep state space model)高效处理多模态序列,从而在保持高任务成功率的同时显著提升推理速度:实验证明,AnoleVLA在真实场景中相较代表性大模型任务成功率提升21个百分点,且推理速度约为其三倍。
链接: https://arxiv.org/abs/2603.15046
作者: Yusuke Takagi,Motonari Kambara,Daichi Yashima,Koki Seno,Kento Tokura,Komei Sugiura
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:In this study, we address the problem of language-guided robotic manipulation, where a robot is required to manipulate a wide range of objects based on visual observations and natural language instructions. This task is essential for service robots that operate in human environments, and requires safety, efficiency, and task-level generality. Although Vision-Language-Action models (VLAs) have demonstrated strong performance for this task, their deployment in resource-constrained environments remains challenging because of the computational cost of standard transformer backbones. To overcome this limitation, we propose AnoleVLA, a lightweight VLA that uses a deep state space model to process multimodal sequences efficiently. The model leverages its lightweight and fast sequential state modeling to process visual and textual inputs, which allows the robot to generate trajectories efficiently. We evaluated the proposed method in both simulation and physical experiments. Notably, in real-world evaluations, AnoleVLA outperformed a representative large-scale VLA by 21 points for the task success rate while achieving an inference speed approximately three times faster.
[AI-52] Prompt Readiness Levels (PRL): a maturity scale and scoring framework for production grade prompt assets
【速读】:该论文旨在解决生成式 AI (Generative AI) 系统中提示词资产(prompt assets)缺乏统一、可审计的评估方法的问题,尤其在操作目标、安全约束和合规要求方面。解决方案的关键在于提出 Prompt Readiness Levels (PRL),一个受技术成熟度等级(TRL)启发的九级成熟度量表,以及 Prompt Readiness Score (PRS),一种具有门限阈值的多维评分机制,用以防止薄弱环节引发的失效模式,从而实现提示词资产在规范制定、测试、可追溯性、安全性评估和部署准备等方面的结构化治理与可复现的资格判定。
链接: https://arxiv.org/abs/2603.15044
作者: Sebastien Guinard(Univ. Grenoble Alpes, CEA, DRT F-38000 Grenoble)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 7 pages, 1 figure
Abstract:Prompt engineering has become a production critical component of generative AI systems. However, organizations still lack a shared, auditable method to qualify prompt assets against operational objectives, safety constraints, and compliance requirements. This paper introduces Prompt Readiness Levels (PRL), a nine level maturity scale inspired by TRL, and the Prompt Readiness Score (PRS), a multidimensional scoring method with gating thresholds designed to prevent weak link failure modes. PRL/PRS provide an original, structured and methodological framework for governing prompt assets specification, testing, traceability, security evaluation, and deployment readiness enabling valuation of prompt engineering through reproducible qualification decisions across teams and industries.
[AI-53] VTC-Bench: Evaluating Agent ic Multimodal Models via Compositional Visual Tool Chaining
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂视觉任务中工具使用能力不足的问题,特别是模型在面对多样化工具组合与长程多步规划时表现不佳,且现有基准测试难以真实反映实际应用场景中的工具交互复杂性。其解决方案的关键在于提出VisualToolChain-Bench(VTC-Bench),一个基于OpenCV的32种多样化视觉操作构成的综合性评估基准,支持丰富工具组合与多步骤执行轨迹的精确评测;该框架包含680个结构化问题,覆盖九类认知层级,并提供真实执行路径作为评估依据,从而系统性揭示模型在工具泛化、多工具协同及复杂任务规划方面的局限性,为未来更通用的视觉智能体(visual agent)研究提供严谨的基准和方向指引。
链接: https://arxiv.org/abs/2603.15030
作者: Xuanyu Zhu,Yuhao Dong,Rundong Wang,Yang Shi,Zhipeng Wu,Yinlun Peng,YiFan Zhang,Yihang Lou,Yuanxing Zhang,Ziwei Liu,Yan Bai,Yuan Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench~(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models’ visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.
[AI-54] Describing Agent ic AI Systems with C4: Lessons from Industry Projects
【速读】:该论文试图解决的问题是:随着智能体AI(Agentic AI)系统在工业场景中演变为长期运行的解决方案,传统依赖临时代码草图或流水线图的文档实践已无法有效捕捉其特有的架构特征,导致系统难以维护和持续演化。解决方案的关键在于提出一套面向智能体风格的文档体系结构,包括:(i) 一套以智能体、人工制品、工具及其协调模式为核心的概念建模词汇与视图;(ii) 基于C4模型的分层描述技术,用于在不同抽象层级上组织这些视图;(iii) 工业案例与经验教训,验证该方法能生成透明且可维护的架构文档,从而支持系统的可持续演进。
链接: https://arxiv.org/abs/2603.15021
作者: Andreas Rausch,Stefan Wittek
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Different domains foster different architectural styles – and thus different documentation practices (e.g., state-based models for behavioral control vs. ER-style models for information structures). Agentic AI systems exhibit another characteristic style: specialized agents collaborate by exchanging artifacts, invoking external tools, and coordinating via recurring interaction patterns and quality gates. As these systems evolve into long-lived industrial solutions, documentation must capture these style-defining concerns rather than relying on ad-hoc code sketches or pipeline drawings. This paper reports industrial experience from joint projects and derives a documentation systematics tailored to this style. Concretely, we provide (i) a style-oriented modeling vocabulary and a small set of views for agents, artifacts, tools, and their coordination patterns, (ii) a hierarchical description technique aligned with C4 to structure these views across abstraction levels, and (iii) industrial examples with lessons learned that demonstrate how the approach yields transparent, maintainable architecture documentation supporting sustained evolution.
[AI-55] Consequentialist Objectives and Catastrophe
【速读】:该论文旨在解决人工智能(AI)在复杂环境中因目标函数设定不准确而导致的灾难性后果问题,即“奖励黑客”(reward hacking)现象可能演变为严重风险。传统研究多关注于良性或可修正的奖励黑客案例,而本文则聚焦于当AI能力足够先进时,固定的目的导向行为(consequentialist objective)可能引发不可逆的灾难性结果。其解决方案的关键在于:通过合理约束AI的能力水平而非单纯优化目标函数,可以有效避免灾难性后果;更重要的是,适度的能力限制不仅能够保障安全,还能产生有价值的成果。这一结论适用于由现代工业级AI开发流程生成的任何目标函数。
链接: https://arxiv.org/abs/2603.15017
作者: Henrik Marklund,Alex Infanger,Benjamin Van Roy
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Because human preferences are too complex to codify, AIs operate with misspecified objectives. Optimizing such objectives often produces undesirable outcomes; this phenomenon is known as reward hacking. Such outcomes are not necessarily catastrophic. Indeed, most examples of reward hacking in previous literature are benign. And typically, objectives can be modified to resolve the issue. We study the prospect of catastrophic outcomes induced by AIs operating in complex environments. We argue that, when capabilities are sufficiently advanced, pursuing a fixed consequentialist objective tends to result in catastrophic outcomes. We formalize this by establishing conditions that provably lead to such outcomes. Under these conditions, simple or random behavior is safe. Catastrophic risk arises due to extraordinary competence rather than incompetence. With a fixed consequentialist objective, avoiding catastrophe requires constraining AI capabilities. In fact, constraining capabilities the right amount not only averts catastrophe but yields valuable outcomes. Our results apply to any objective produced by modern industrial AI development pipelines. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.15017 [cs.AI] (or arXiv:2603.15017v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.15017 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-56] rajFlow: Nation-wide Pseudo GPS Trajectory Generation with Flow Matching Models
【速读】:该论文旨在解决真实移动电话GPS轨迹数据在隐私保护、获取难度和成本方面的限制,以及现有基于扩散模型的轨迹生成方法在空间尺度、交通方式多样性及生成效率上的不足。其解决方案的关键在于提出TrajFlow——首个基于流匹配(flow-matching)的GPS轨迹生成模型,通过引入轨迹谐调与重建策略,在多地理尺度下实现高保真度、高多样性和高效率的轨迹生成,从而支持跨区域城市规划、交通管理和灾害响应等应用。
链接: https://arxiv.org/abs/2603.15009
作者: Peiran Li,Jiawei Wang,Haoran Zhang,Xiaodan Shi,Noboru Koshizuka,Chihiro Shimizu,Renhe Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:The importance of mobile phone GPS trajectory data is widely recognized across many fields, yet the use of real data is often hindered by privacy concerns, limited accessibility, and high acquisition costs. As a result, generating pseudo-GPS trajectory data has become an active area of research. Recent diffusion-based approaches have achieved strong fidelity but remain limited in spatial scale (small urban areas), transportation-mode diversity, and efficiency (requiring numerous sampling steps). To address these challenges, we introduce TrajFlow, which to the best of our knowledge is the first flow-matching-based generative model for GPS trajectory generation. TrajFlow leverages the flow-matching paradigm to improve robustness and efficiency across multiple geospatial scales, and incorporates a trajectory harmonization and reconstruction strategy to jointly address scalability, diversity, and efficiency. Using a nationwide mobile phone GPS dataset with millions of trajectories across Japan, we show that TrajFlow or its variants consistently outperform diffusion-based and deep generative baselines at urban, metropolitan, and nationwide levels. As the first nationwide, multi-scale GPS trajectory generation model, TrajFlow demonstrates strong potential to support inter-region urban planning, traffic management, and disaster response, thereby advancing the resilience and intelligence of future mobility systems.
[AI-57] How Log-Barrier Helps Exploration in Policy Optimization
【速读】:该论文旨在解决Stochastic Gradient Bandit (SGB)算法在收敛性分析中依赖不切实际假设的问题,即要求最优动作的概率始终远离零。这种假设在实践中难以满足,导致SGB的理论保证受限。解决方案的关键在于引入对SGB目标函数的对数障碍(log-barrier)正则化,通过结构化地强制最小探索量来弥补原算法缺乏显式探索机制的缺陷。该正则化不仅保持了与SGB相当的样本复杂度,还使得算法在无需任何关于学习过程的假设下仍能收敛(尽管速率较慢),同时揭示了其与自然策略梯度(Natural Policy Gradient)在利用策略空间几何结构(通过控制Fisher信息矩阵)方面的内在联系。
链接: https://arxiv.org/abs/2603.15001
作者: Leonardo Cesani,Matteo Papini,Marcello Restelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, it has been shown that the Stochastic Gradient Bandit (SGB) algorithm converges to a globally optimal policy with a constant learning rate. However, these guarantees rely on unrealistic assumptions about the learning process, namely that the probability of the optimal action is always bounded away from zero. We attribute this to the lack of an explicit exploration mechanism in SGB. To address these limitations, we propose to regularize the SGB objective with a log-barrier on the parametric policy, structurally enforcing a minimal amount of exploration. We prove that Log-Barrier Stochastic Gradient Bandit (LB-SGB) matches the sample complexity of SGB, but also converges (at a slower rate) without any assumptions on the learning process. We also show a connection between the log-barrier regularization and Natural Policy Gradient, as both exploit the geometry of the policy space by controlling the Fisher information. We validate our theoretical findings through numerical simulations, showing the benefits of the log-barrier regularization.
[AI-58] Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos
【速读】:该论文旨在解决短视频平台中多模态虚假信息(multimodal misinformation)的检测问题,特别是针对文本、视觉与音频三者之间存在微妙不一致性的伪造内容。其核心挑战在于传统方法难以有效捕捉跨模态的一致性信号,导致对隐蔽性较强的虚假视频识别能力不足。解决方案的关键在于提出MAGIC3模型,通过显式建模多粒度的跨三模态一致性信号(包括成对和全局一致性),结合基于交叉注意力机制提取的token级与帧级一致性特征,并引入多风格大语言模型(LLM)重写以增强文本表示的鲁棒性,同时采用不确定性感知分类器实现对视觉语言模型(VLM)的智能路由,从而在保持与VLM相当准确率的同时,显著提升推理效率(18–27倍吞吐量提升)和资源利用率(93% VRAM节省)。
链接: https://arxiv.org/abs/2603.14992
作者: Chong Tian,Yu Wang,Chenxu Yang,Junyi Guan,Zheng Lin,Yuhan Liu,Xiuying Chen,Qirong Ho
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 16 pages, 7 figures, 11 tables
Abstract:Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.
[AI-59] FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
【速读】:该论文旨在解决临床场景中机器学习模型存在的性别偏倚问题,此类偏倚会削弱临床信任并导致不平等的医疗决策。解决方案的关键在于提出一种名为FairMed-XGB的新框架,其核心是将公平性感知损失函数(融合统计独立性差异、Theil指数与Wasserstein距离)与贝叶斯优化相结合,嵌入XGBoost分类器中,从而在保持预测性能的同时系统性检测并缓解性别相关的预测偏差。
链接: https://arxiv.org/abs/2603.14947
作者: Mitul Goswami,Romit Chatterjee,Arif Ahmed Sekh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine learning models deployed in critical care settings exhibit demographic biases, particularly gender disparities, that undermine clinical trust and equitable treatment. This paper introduces FairMed-XGB, a novel framework that systematically detects and mitigates gender-based prediction bias while preserving model performance and transparency. The framework integrates a fairness-aware loss function combining Statistical Parity Difference, Theil Index, and Wasserstein Distance, jointly optimised via Bayesian Search into an XGBoost classifier. Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19 percent on eICU; Theil Index collapses by four to five orders of magnitude to near-zero values; Wasserstein Distance is reduced by 20 to 72 percent. These gains are achieved with negligible degradation in predictive accuracy (AUC-ROC drop 0.02). SHAP-based explainability reveals that the framework diminishes reliance on gender-proxy features, providing clinicians with actionable insights into how and where bias is corrected. FairMed-XGB offers a robust, interpretable, and ethically aligned solution for equitable clinical decision-making, paving the way for trustworthy deployment of AI in high-stakes healthcare environments.
[AI-60] RS-WorldModel: a Unified Model for Remote Sensing Understanding and Future Sense Forecasting
【速读】:该论文旨在解决遥感世界模型中时空变化理解与文本引导的未来场景预测任务分离建模导致的跨任务迁移受限问题。现有方法通常独立处理这两项任务,难以共享潜在的时空先验知识,从而限制了模型性能和泛化能力。解决方案的关键在于提出一个统一的遥感世界模型——RS-WorldModel,其核心创新包括:(1) 通过地理感知生成预训练(Geo-Aware Generative Pre-training, GAGP)将地理信息和获取元数据融入预测过程;(2) 利用协同指令微调(Synergistic Instruction Tuning, SIT)实现理解与预测任务的联合训练,强化多任务协同;(3) 引入可验证强化优化(Verifiable Reinforcement Optimization, VRO),基于可验证的任务特定奖励对输出进行精细化调整。该架构在仅使用20亿参数的情况下,在多项时空变化问答指标上超越最大规模达120倍的开源模型,并在文本引导未来场景生成任务中取得FID 43.13的优异表现,显著优于所有开源基线及闭源模型Gemini-2.5-Flash Image (Nano Banana)。
链接: https://arxiv.org/abs/2603.14941
作者: Linrui Xu,Zhongan Wang,Fei Shen,Gang Xu,Huiping Zhuang,Ming Li,Haifeng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross-task transfer. We present RS-WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text-guided future scene forecasting, and we build RSWBench-1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS-WorldModel is trained in three stages: (1) Geo-Aware Generative Pre-training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task-specific rewards. With only 2B parameters, RS-WorldModel surpasses open-source models up to 120 \times larger on most spatiotemporal change question-answering metrics. It achieves an FID of 43.13 on text-guided future scene forecasting, outperforming all open-source baselines as well as the closed-source Gemini-2.5-Flash Image (Nano Banana).
[AI-61] Directional Routing in Transformers
【速读】:该论文旨在解决大型语言模型中注意力机制冗余与计算效率低下的问题,即如何在不显著增加参数量的前提下提升模型的推理能力和可解释性。其解决方案的关键在于提出了一种轻量级的**定向路由(directional routing)**机制,该机制通过一个共享的路由器为每个Transformer注意力头分配受控的抑制方向,仅增加3.9%的参数成本。实验证明,这一机制成为模型的主要计算路径,移除它会导致事实召回率接近零且归纳推理准确率从93.4%骤降至0.0%,而单独剔除任意注意力头影响甚微,表明路由协调机制本身具有不可替代性,而其所协调的组件则具备冗余性。此外,模型还能自发形成早期层域自适应路由和晚期层固定句法剪枝的两阶段结构,凸显了该方案在提升模型性能(降低困惑度31–56%)与组织内部计算逻辑方面的有效性。
链接: https://arxiv.org/abs/2603.14923
作者: Kevin Taylor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce directional routing, a lightweight mechanism that gives each transformer attention head learned suppression directions controlled by a shared router, at 3.9% parameter cost. We train a 433M-parameter model alongside an identical baseline in a single run, then trace the resulting circuits through mechanistic interpretability. Routing becomes the model’s dominant computational pathway. Disabling it collapses factual recall to near-zero probability across all 8 test prompts and drops induction accuracy from 93.4% to 0.0%. Knocking out individual attention heads has negligible effect: the primary mover head’s removal actually increases target probability, and induction heads retain 98.6% accuracy without their strongest member. The coordination mechanism is irreplaceable; the components it coordinates are not. The model also self-organizes, without explicit pressure, into two regimes: domain-adaptive routing in early layers and fixed syntactic pruning in late layers, where the least-varying layer is the most critical (+42.6 PPL when disabled). Routing reduces perplexity 31-56% relative to the baseline, though downstream multiple-choice benchmarks do not yet reflect these gains. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.14923 [cs.LG] (or arXiv:2603.14923v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.14923 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-62] Informative Perturbation Selection for Uncertainty-Aware Post-hoc Explanations
【速读】:该论文旨在解决由于广泛部署黑箱机器学习(Machine Learning, ML)模型所引发的信任与伦理问题,核心挑战在于如何提供可靠且可解释的模型说明。为应对这一问题,作者提出了一种后验(post-hoc)、模型无关(model-agnostic)的解释框架——Expected Active Gain for Local Explanations (EAGLE),其关键创新在于将扰动样本的选择建模为一个信息论驱动的主动学习(active learning)问题。通过自适应地选择能最大化预期信息增益的扰动样本,EAGLE 在无需访问原模型参数或训练过程的前提下,高效学习线性代理模型(surrogate model),同时输出特征重要性评分及不确定性估计。理论分析表明,累积信息增益随样本数 $ t $ 呈 $ \mathcal{O}(d \log t) $ 的增长趋势,样本复杂度与特征维度 $ d $ 线性相关、与置信参数 $ 1/\delta $ 对数相关;实验验证了该方法在表格数据和图像数据上的优越性,显著提升了解释的可复现性、局部邻域稳定性以及扰动样本质量。
链接: https://arxiv.org/abs/2603.14894
作者: Sumedha Chugh,Ranjitha Prasad,Nazreen Shah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Trust and ethical concerns due to the widespread deployment of opaque machine learning (ML) models motivating the need for reliable model explanations. Post-hoc model-agnostic explanation methods addresses this challenge by learning a surrogate model that approximates the behavior of the deployed black-box ML model in the locality of a sample of interest. In post-hoc scenarios, neither the underlying model parameters nor the training are available, and hence, this local neighborhood must be constructed by generating perturbed inputs in the neighborhood of the sample of interest, and its corresponding model predictions. We propose \emphExpected Active Gain for Local Explanations (\textttEAGLE), a post-hoc model-agnostic explanation framework that formulates perturbation selection as an information-theoretic active learning problem. By adaptively sampling perturbations that maximize the expected information gain, \textttEAGLE efficiently learns a linear surrogate explainable model while producing feature importance scores along with the uncertainty/confidence estimates. Theoretically, we establish that cumulative information gain scales as \mathcalO(d \log t) , where d is the feature dimension and t represents the number of samples, and that the sample complexity grows linearly with d and logarithmically with the confidence parameter 1/\delta . Empirical results on tabular and image datasets corroborate our theoretical findings and demonstrate that \textttEAGLE improves explanation reproducibility across runs, achieves higher neighborhood stability, and improves perturbation sample quality as compared to state-of-the-art baselines such as Tilia, US-LIME, GLIME and BayesLIME.
[AI-63] Seismic full-waveform inversion based on a physics-driven generative adversarial network
【速读】:该论文旨在解决全波形反演(Full-waveform Inversion, FWI)在复杂地质条件下对初始模型依赖性强、数据稀疏或含噪时结果不稳定的问题。其解决方案的关键在于提出一种基于物理驱动的生成对抗网络(Physics-driven Generative Adversarial Network-based FWI)方法,通过将深度神经网络的数据驱动能力与地震波方程的物理约束相结合,并引入判别器(discriminator)进行对抗训练,从而提升反演结果的稳定性与鲁棒性。
链接: https://arxiv.org/abs/2603.14879
作者: Xinyi Zhang,Caiyun Liu,Jie Xiong,Qingfeng Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Objectives: Full-waveform inversion (FWI) is a high-resolution geophysical imaging technique that reconstructs subsurface velocity models by iteratively minimizing the misfit between predicted and observed seismic data. However, under complex geological conditions, conventional FWI suffers from strong dependence on the initial model and tends to produce unstable results when the data are sparse or contaminated by noise. Methods: To address these limitations, this paper proposes a physics-driven generative adversarial network-based full-waveform inversion method. The proposed approach integrates the data-driven capability of deep neural networks with the physical constraints imposed by the seismic wave equation, and employs adversarial training through a discriminator to enhance the stability and robustness of the inversion results. Results: Experimental results on two representative benchmark geological models demonstrate that the proposed method can effectively recover complex velocity structures and achieves superior performance in terms of structural similarity (SSIM) and signal-to-noise ratio (SNR). Conclusions: This method provides a promising solution for alleviating the initial-model dependence in full-waveform inversion and shows strong potential for practical applications.
[AI-64] A Hybrid AI and Rule-Based Decision Support System for Disease Diagnosis and Management Using Labs
【速读】:该论文旨在解决临床决策过程中因信息过载和诊断不确定性导致的误诊问题,尤其在初级保健场景中,医生难以快速整合患者的实验室数据与医学知识以做出准确诊断。解决方案的关键在于构建一个融合规则驱动专家系统与数据驱动预测模型的新型临床决策支持系统(CDSS),其中规则库基于已验证的临床规则,涵盖59种可直接确诊的疾病并关联ICD-10编码;而“可能诊断”模块则采用多分类机器学习方法,覆盖37个ICD-10编码,并根据医生常用于确认诊断的实验室指标将其归类为11个组别,从而实现对患者潜在疾病的精准预测与解释性推理,辅助医生提高诊断准确性。
链接: https://arxiv.org/abs/2603.14876
作者: Muhammad Hammad Maqsood,Mubashir Sajid,Khubaib Ahmed,Muhammad Usamah Shahid,Muddassar Farooq
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This research paper outlines the development and implementation of a novel Clinical Decision Support System (CDSS) that integrates AI predictive modeling with medical knowledge bases. It utilizes the quantifiable information elements in lab results for inferring likely diagnoses a patient might have. Subsequently, suggesting investigations to confirm the likely diagnoses – an assistive tool for physicians. The system fuses knowledge contained in a rule-base expert system with inferences of data driven predictors based on the features in labs. The data for 593,055 patients was collected from 547 primary care centers across the US to model our decision support system and derive Real-Word Evidence (RWE) to make it relevant for a large demographic of patients. Our Rule-Base comprises clinically validated rules, modeling 59 health conditions that can directly confirm one or more of diseases and assign ICD-10 codes to them. The Likely Diagnosis system uses multi-class classification, covering 37 ICD-10 codes, which are grouped together into 11 categories based on the labs that physicians prescribe to confirm the diagnosis. This research offers a novel system that assists a physician by utilizing medical profile of a patient and routine lab investigations to predict a group of likely diseases and then confirm them, coupled with providing explanations for inferences, thereby assisting physicians to reduce misdiagnosis of patients in clinical decision-making.
[AI-65] IgPose: A Generative Data-Augmented Pipeline for Robust Immunoglobulin-Antigen Binding Prediction
【速读】:该论文旨在解决免疫球蛋白-抗原(Ig-Ag)结合姿态预测的难题,其核心挑战在于实验解析的复合物数据稀缺以及从头预测抗体结构的准确性有限。解决方案的关键在于提出一个可泛化的框架IgPose,其创新性地结合了生成式数据增强流程与多模态特征融合策略:首先构建了高保真合成 decoy 数据库SIDD以缓解数据不足问题;其次利用等变图神经网络(equivariant graph neural networks)、ESM-2嵌入和门控循环单元(gated recurrent units)协同捕获几何结构与进化信息;并通过界面聚焦的k-hop采样和生物引导池化机制提升对多样化结合界面的泛化能力。该框架包含两个子网络——IgPoseClassifier用于区分结合姿态,IgPoseScore用于估计DockQ评分,在内部测试集和CASP-16基准上均优于物理模型与深度学习基线方法,为高通量抗体发现提供精准的姿态筛选与排序工具。
链接: https://arxiv.org/abs/2603.14870
作者: Tien-Cuong Bui,Injae Chung,Wonjun Lee,Junsu Ko,Juyong Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, Bioinformatics
Abstract:Predicting immunoglobulin-antigen (Ig-Ag) binding remains a significant challenge due to the paucity of experimentally-resolved complexes and the limited accuracy of de novo Ig structure prediction. We introduce IgPose, a generalizable framework for Ig-Ag pose identification and scoring, built on a generative data-augmentation pipeline. To mitigate data scarcity, we constructed the Structural Immunoglobulin Decoy Database (SIDD), a comprehensive repository of high-fidelity synthetic decoys. IgPose integrates equivariant graph neural networks, ESM-2 embeddings, and gated recurrent units to synergistically capture both geometric and evolutionary features. We implemented interface-focused k-hop sampling with biologically guided pooling to enhance generalization across diverse interfaces. The framework comprises two sub-networks–IgPoseClassifier for binding pose discrimination and IgPoseScore for DockQ score estimation–and achieves robust performance on curated internal test sets and the CASP-16 benchmark compared to physics and deep learning baselines. IgPose serves as a versatile computational tool for high-throughput antibody discovery pipelines by providing accurate pose filtering and ranking. IgPose is available on GitHub (this https URL).
[AI-66] A Self-Evolving Defect Detection Framework for Industrial Photovoltaic Systems
【速读】:该论文旨在解决光伏(Photovoltaic, PV)组件在实际运行环境中自动化缺陷检测的鲁棒性与长期可维护性问题,尤其针对模块几何形状多样、成像分辨率低、缺陷形态细微、缺陷类别分布长尾以及持续的数据分布偏移等挑战。传统深度学习检测流程难以适应这些动态变化,导致性能下降。解决方案的关键在于提出一种自演化光伏缺陷检测框架(Self-Evolving Photovoltaic Defect Detection, SEPDD),其核心创新是集成自动模型优化机制与持续自演化学习策略,使系统能够在长期部署过程中逐步适应数据分布变化和新出现的缺陷模式,从而提升检测准确性和工业适用性。
链接: https://arxiv.org/abs/2603.14869
作者: Haoyu He,Yu Duan,Wenzhen Liu,Hanyuan Hang,Qiantu Tuo,Xiaoke Yang,Rui Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures
Abstract:Reliable photovoltaic (PV) power generation requires timely detection of module defects that may reduce energy yield, accelerate degradation, and increase lifecycle operation and maintenance costs during field operation. Electroluminescence (EL) imaging has therefore been widely adopted for PV module inspection. However, automated defect detection in real operational environments remains challenging due to heterogeneous module geometries, low-resolution imaging conditions, subtle defect morphology, long-tailed defect distributions, and continual data shifts introduced by evolving inspection and labeling processes. These factors significantly limit the robustness and long-term maintainability of conventional deep-learning inspection pipelines. To address these challenges, this paper proposes SEPDD, a Self-Evolving Photovoltaic Defect Detection framework designed for evolving industrial PV inspection scenarios. SEPDD integrates automated model optimization with a continual self-evolving learning mechanism, enabling the inspection system to progressively adapt to distribution shifts and newly emerging defect patterns during long-term deployment. Experiments conducted on both a public PV defect benchmark and a private industrial EL dataset demonstrate the effectiveness of the proposed framework. Both datasets exhibit severe class imbalance and significant domain shift. SEPDD achieves a leading mAP50 of 91.4% on the public dataset and 49.5% on the private dataset. It surpasses the autonomous baseline by 14.8% and human experts by 4.7% on the public dataset, and by 4.9% and 2.5%, respectively, on the private dataset.
[AI-67] Architecture-Agnostic Feature Synergy for Universal Defense Against Heterogeneous Generative Threats
【速读】:该论文旨在解决生成式 AI(Generative AI)部署中内容安全与隐私保护面临的挑战,特别是现有防御机制因针对特定架构(如扩散模型或生成对抗网络 GAN)而形成的“防御孤岛”问题,导致对异构生成威胁的防御能力薄弱。解决方案的关键在于提出一种架构无关的目标特征协同框架(Architecture-Agnostic Targeted Feature Synergy, ATFS),其核心思想是利用不同生成模型在高阶特征空间中的表示一致性,通过引入目标引导图像将多模型防御重构为统一的特征空间对齐任务,从而实现无需复杂校正的内在梯度对齐,显著提升异构场景下的防御性能与泛化能力。
链接: https://arxiv.org/abs/2603.14860
作者: Bingxue Zhang,Yang Gao,Feida Zhu,Yanyan Shen,Yang Shi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figures
Abstract:Generative AI deployment poses unprecedented challenges to content safety and privacy. However, existing defense mechanisms are often tailored to specific architectures (e.g., Diffusion Models or GANs), creating fragile “defense silos” that fail against heterogeneous generative threats. This paper identifies a fundamental optimization barrier in naive pixel-space ensemble strategies: due to divergent objective functions, pixel-level gradients from heterogeneous generators become statistically orthogonal, causing destructive interference. To overcome this, we observe that despite disparate low-level mechanisms, high-level feature representations of generated content exhibit alignment across architectures. Based on this, we propose the Architecture-Agnostic Targeted Feature Synergy (ATFS) framework. By introducing a target guidance image, ATFS reformulates multi-model defense as a unified feature space alignment task, enabling intrinsic gradient alignment without complex rectification. Extensive experiments show ATFS achieves SOTA protection in heterogeneous scenarios (e.g., Diffusion+GAN). It converges rapidly, reaching over 90% performance within 40 iterations, and maintains strong attack potency even under tight perturbation budgets. The framework seamlessly extends to unseen architectures (e.g., VQ-VAE) by switching the feature extractor, and demonstrates robust resistance to JPEG compression and scaling. Being computationally efficient and lightweight, ATFS offers a viable pathway to dismantle defense silos and enable universal generative security. Code and models are open-sourced for reproducibility.
[AI-68] PCodeTrans: Translate Decompiled Pseudocode to Compilable and Executable Equivalent
【速读】:该论文旨在解决传统反编译工具在追求人类可读性时,往往忽视代码的可重新编译性和运行时正确性的问题,尤其是在关键场景如软件现代化和漏洞修复中,恢复的代码必须不仅能够编译成功,还需精确复现原始二进制文件的行为。解决方案的关键在于提出PCodeTrans框架,其核心创新包括:1)通过提取最小但语义完整的上下文以保障可重编译性;2)设计一种就地可替换(in situ substitutable)引擎,将编译后的函数直接热插拔至未修改的二进制文件中,从而原生保留执行上下文与全局依赖关系;3)基于细粒度差异追踪生成运行时反馈,迭代引导大语言模型(LLM)修正语义偏差。此机制有效缓解了“语义幻觉”问题,显著提升了反编译结果的准确性和实用性。
链接: https://arxiv.org/abs/2603.14855
作者: Yuxin Cui,Zeyu Gao,Shuxian He,Siliang Qin,Chao Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Decompilation is foundational to binary analysis, yet conventional tools prioritize human readability over strict recompilability and verifiable runtime correctness. While recent LLM-based approaches attempt to refine decompiled pseudocode, they typically either optimize solely for readability or rely on static analysis for evaluation. This makes them prone to “semantic hallucinations” that compromise accuracy and fail to resolve actual runtime failures. For critical tasks like software modernization and vulnerability remediation, recovered code must not only compile but replicate the original binary’s behavior. We present PCodeTrans, a feedback-driven framework that bridges the gap between decompilation, recompilation, and rigorous function-level dynamic validation. After extracting a minimal yet coherent context to guarantee recompilability, PCodeTrans employs an in situ substitutable engine to hot-swap the compiled function directly into the unmodified binary, natively preserving its authentic execution context and global dependencies. Guided by fine-grained differential tracing, PCodeTrans generates precise runtime feedback to iteratively guide an LLM in repairing semantic discrepancies. Evaluated on Coreutils and Binutils, PCodeTrans achieves unprecedented recovery performance when rectifying raw Hex-Rays outputs, attaining 100% function-level compilability on unstripped binaries alongside 99.55% and 99.89% test-validated behavioral consistency, respectively. In doing so, it resolves 76.56% and 79.74% of logic errors exposed by official test suites. Exhibiting exceptional resilience, PCodeTrans maintains over 96% behavioral consistency even on fully stripped binaries. By significantly outperforming all existing baselines, PCodeTrans paves a practical path to reliably translate decompiled pseudocode into compilable and executable equivalents.
[AI-69] IntegratingWeather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting
【速读】:该论文旨在解决日尺度太阳能辐照度(solar irradiance)精准预测难题,该问题在将太阳能高效整合入电网过程中至关重要,但受昼夜周期和复杂云层动态影响而极具挑战性。现有方法或缺乏细粒度分辨率(如数值天气预报、气象基础模型),或在较长预测时效下性能下降(如卫星外推法)。其解决方案的关键在于提出Baguan-solar——一种两阶段多模态融合框架:第一阶段利用Baguan全球气象基础模型预测昼夜连续中间变量(如云量),第二阶段结合高分辨率静止卫星图像,通过模态融合机制同时保留卫星图像中的精细云结构与Baguan提供的大尺度约束,从而实现千米级空间分辨率的24小时辐照度预测。该设计显著提升了对云诱导瞬变过程的解析能力,并在东亚地区实证中优于多个强基线模型(如ECMWF IFS、原始Baguan和SolarSeer),且已在实际电力系统中部署应用。
链接: https://arxiv.org/abs/2603.14845
作者: Ziqing Ma,Kai Ying,Xinyue Gu,Tian Zhou,Tianyu Zhu,Haifan Zhang,Peisong Niu,Wang Zheng,Cong Bai,Liang Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate day-ahead solar irradiance forecasting is essential for integrating solar energy into the power grid. However, it remains challenging due to the pronounced diurnal cycle and inherently complex cloud dynamics. Current methods either lack fine-scale resolution (e.g., numerical weather prediction, weather foundation models) or degrade at longer lead times (e.g., satellite extrapolation). We propose Baguan-solar, a two-stage multimodal framework that fuses forecasts from Baguan, a global weather foundation model, with high-resolution geostationary satellite imagery to produce 24- hour irradiance forecasts at kilometer scale. Its decoupled two-stage design first forecasts day-night continuous intermediates (e.g., cloud cover) and then infers irradiance, while its modality fusion jointly preserves fine-scale cloud structures from satellite and large-scale constraints from Baguan forecasts. Evaluated over East Asia using CLDAS as ground truth, Baguan-solar outperforms strong baselines (including ECMWF IFS, vanilla Baguan, and SolarSeer), reducing RMSE by 16.08% and better resolving cloud-induced transients. An operational deployment of Baguan-solar has supported solar power forecasting in an eastern province in China, since July 2025. Our code is accessible at this https URL. git.
[AI-70] Real-Time Driver Safety Scoring Through Inverse Crash Probability Modeling
【速读】:该论文旨在解决现有道路交通事故预测模型多为二分类输出、缺乏连续风险量化能力、可解释性不足以及未充分考虑弱势道路使用者(Vulnerable Road Users, VRUs)的问题。其解决方案的关键在于提出SafeDriver-IQ框架,通过融合美国国家公路交通安全管理局(NHTSA)的 crash records 与 Waymo Open Motion Dataset 的自然驾驶数据,工程化领域知识驱动的特征,并引入基于交通安全管理文献的校准层,将传统二分类碰撞预测模型转化为0-100的连续安全评分体系,从而实现对高风险驾驶情境的精准区分与可解释的风险预警,支持高级驾驶辅助系统(ADAS)、车队管理和城市基础设施规划中的实时风险防控。
链接: https://arxiv.org/abs/2603.14841
作者: Joyjit Roy,Samaresh Kumar Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
备注: 10 pages, 13 figures, and 14 tables. Submitted in EIT 2026 Conference hosted by The University of Wisconsin-La Crosse and sponsored by IEEE Region 4 (R4)
Abstract:Road crashes remain a leading cause of preventable fatalities. Existing prediction models predominantly produce binary outcomes, which offer limited actionable insights for real-time driver feedback. These approaches often lack continuous risk quantification, interpretability, and explicit consideration of vulnerable road users (VRUs), such as pedestrians and cyclists. This research introduces SafeDriver-IQ, a framework that transforms binary crash classifiers into continuous 0-100 safety scores by combining national crash statistics with naturalistic driving data from autonomous vehicles. The framework fuses National Highway Traffic Safety Administration (NHTSA) crash records with Waymo Open Motion Dataset scenarios, engineers domain-informed features, and incorporates a calibration layer grounded in transportation safety literature. Evaluation across 15 complementary analyses indicates that the framework reliably differentiates high-risk from low-risk driving conditions with strong discriminative performance. Findings further reveal that 87% of crashes involve multiple co-occurring risk factors, with non-linear compounding effects that increase the risk to 4.5x baseline. SafeDriver-IQ delivers proactive, explainable safety intelligence relevant to advanced driver-assistance systems (ADAS), fleet management, and urban infrastructure planning. This framework shifts the focus from reactive crash counting to real-time risk prevention.
[AI-71] Ablate and Rescue: A Causal Analysis of Residual Stream Hyper-Connections
【速读】:该论文旨在解决多流Transformer架构中因残差连接导致的表征坍塌(representation collapse)和梯度消失问题,尤其聚焦于最近提出的流形约束超连接(Manifold-Constrained Hyper-Connections, mHC)架构内部机制不明确的问题。解决方案的关键在于提出首个开源的mHC语言模型,并构建一套基于表示层指标与因果干预的分析框架,通过系统性的流束消融与恢复(ablation-and-rescue)实验,实现对并行流在推理过程中信息编码与利用方式的直接因果比较,从而揭示信息分布模式,区分功能冗余与非对称利用,超越传统表征相似性分析的局限。
链接: https://arxiv.org/abs/2603.14833
作者: William Peng,Josheev Rai,Kevin Tseng,Siwei Wang,Sean Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-stream transformer architectures have recently been proposed as a promising direction for managing representation collapse and the vanishing gradient problem for residual connections, yet their internal mechanisms remain unexplored. In particular, the recently introduced Manifold-Constrained Hyper-Connections (mHC) architecture posits multiple residual streams with constrained interaction, but lacks in-depth mechanistic analysis. We present the first open-source mHC language model (this https URL) and analyze the multiple-stream architecture with a suite of representation-level metrics and causal interventions to probe how parallel streams encode and utilize information. Specifically, we introduce a systematic stream ablation-and-rescue framework that enables direct causal comparison of residual streams during inference. Through targeted pairwise interventions and controlled recovery experiments, we distinguish functional redundancy from asymmetric utilization and reveal how information is distributed across streams beyond what is observable from representational similarity alone.
[AI-72] Planning as Goal Recognition: Deriving Heuristics from Intention Models - Extended Version ICAPS2026
【速读】:该论文旨在解决经典规划(classical planning)中如何利用目标识别(goal recognition, GR)衍生的启发式方法来提升求解效率的问题。其核心挑战在于,传统规划方法依赖于静态启发式函数,而未能有效融合对“意图”的概率性建模。解决方案的关键在于提出一个评估目标意图的新框架,该框架能够生成一类可高效计算的启发式函数;作为概念验证,作者推导出两种此类启发式,并证明它们能显著改进当前顶级经典规划器的性能,从而为基于概率意图的启发式设计提供了理论基础与实践路径。
链接: https://arxiv.org/abs/2603.14824
作者: Giacomo Rosa,Jean Honorio,Nir Lipovetzky,Sebastian Sardina
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Extended version of our paper accepted at ICAPS 2026
Abstract:Classical planning aims to find a sequence of actions, a plan, that maps a starting state into one of the goal states. If a trajectory appears to be leading to the goal, should we prioritise exploring it? Seminal work in goal recognition (GR) has defined GR in terms of a classical planning problem, adopting classical solvers and heuristics to recognise plans. We come full circle, and study the adoption and properties of GR-derived heuristics for seeking solutions to classical planning problems. We propose a new framework for assessing goal intention, which informs a new class of efficiently-computable heuristics. As a proof of concept, we derive two such heuristics, and show that they can already yield improvements for top-scoring classical planners. Our work provides foundational knowledge for understanding and deriving probabilistic intention-based heuristics for planning.
[AI-73] SimCert: Probabilistic Certification for Behavioral Similarity in Deep Neural Network Compression
【速读】:该论文旨在解决在资源受限的嵌入式系统中部署深度神经网络(Deep Neural Networks, DNNs)时,模型压缩技术(如量化和剪枝)导致的行为保真度难以验证的问题,尤其在安全关键系统设计流程中,确保压缩后模型与原始模型行为相似性至关重要。解决方案的关键在于提出SimCert——一个概率认证框架,其核心创新包括:(1) 支持量化与剪枝的双网络符号传播方法,实现对压缩模型行为的精确建模;(2) 基于Bernstein不等式的方差感知边界技术,显著收紧安全证书;(3) 自动化的验证工具链,提升可扩展性和实用性。该框架提供可调节置信水平的定量安全保证,优于现有最先进方法。
链接: https://arxiv.org/abs/2603.14818
作者: Jingyang Li,Fu Song,Guoqiang Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 5 figures
Abstract:Deploying Deep Neural Networks (DNNs) on resource-constrained embedded systems requires aggressive model compression techniques like quantization and pruning. However, ensuring that the compressed model preserves the behavioral fidelity of the original design is a critical challenge in the safety-critical system design flow. Existing verification methods often lack scalability or fail to handle the architectural heterogeneity introduced by pruning. In this work, we propose SimCert, a probabilistic certification framework for verifying the behavioral similarity of compressed neural networks. Unlike worst-case analysis, SimCert provides quantitative safety guarantees with adjustable confidence levels. Our framework features: (1) A dual-network symbolic propagation method supporting both quantization and pruning; (2) A variance-aware bounding technique using Bernstein’s inequality to tighten safety certificates; and (3) An automated verification toolchain. Experimental results on ACAS Xu and computer vision benchmarks demonstrate that SimCert outperforms state-of-the-art baselines.
[AI-74] Multi-Task Genetic Algorithm with Multi-Granularity Encoding for Protein-Nucleotide Binding Site Prediction
【速读】:该论文旨在解决蛋白质-核酸结合位点识别中因特征表示不足和融合机制僵化而导致的性能瓶颈问题,从而阻碍了跨任务信息协同效应的有效利用。其解决方案的关键在于提出MTGA-MGE框架,通过多粒度编码(Multi-Granularity Encoding, MGE)网络融合多尺度卷积与自注意力机制以提取高维生物数据中的判别性信号,并引入基于遗传算法的多任务自适应融合策略,动态优化任务间的信息整合方式,同时设计外部邻域机制(External-Neighborhood Mechanism, ENM)利用生物学相似性促进跨任务的目标化信息交换,显著提升模型在高资源和低资源场景下的泛化能力与预测精度。
链接: https://arxiv.org/abs/2603.14797
作者: Yiming Gao,Liuyi Xu,Pengshan Cui,Yining Qian,An-Yang Lu,Xianpeng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate identification of protein-nucleotide binding sites is fundamental to deciphering molecular mechanisms and accelerating drug discovery. However, current computational methods often struggle with suboptimal performance due to inadequate feature representation and rigid fusion mechanisms, which hinder the effective exploitation of cross-task information synergy. To bridge this gap, we propose MTGA-MGE, a framework that integrates a Multi-Task Genetic Algorithm with Multi-Granularity Encoding to enhance binding site prediction. Specifically, we develop a Multi-Granularity Encoding (MGE) network that synergizes multi-scale convolutions and self-attention mechanisms to distill discriminative signals from high-dimensional, redundant biological data. To overcome the constraints of static fusion, a genetic algorithm is employed to adaptively evolve task-specific fusion strategies, thereby effectively improving model generalization. Furthermore, to catalyze collaborative learning, we introduce an External-Neighborhood Mechanism (ENM) that leverages biological similarities to facilitate targeted information exchange across tasks. Extensive evaluations on fifteen nucleotide datasets demonstrate that MTGA-MGE not only establishes a new state-of-the-art in data-abundant, high-resource scenarios but also maintains a robust competitive edge in rare, low-resource regimes, presenting a highly adaptive scheme for decoding complex protein-ligand interactions in the post-genomic era.
[AI-75] LaPro-DTA: Latent Dual-View Drug Representations and Salient Protein Feature Extraction for Generalizable Drug–Target Affinity Prediction
【速读】:该论文旨在解决药物-靶标亲和力预测(Drug-Target Affinity, DTA)在现实冷启动场景下性能显著下降的问题,其核心挑战在于模型对训练样本的过拟合以及因无关靶标序列导致的信息丢失。解决方案的关键在于提出LaPro-DTA框架:首先设计潜空间双视图药物表示机制,通过实例级视图捕捉细粒度子结构并引入随机扰动,结合分布级视图通过语义重映射提炼通用化学骨架,从而迫使模型学习可迁移的结构规则而非记忆特定样本;其次引入基于模式感知的top-k池化策略提取显著蛋白质特征,有效过滤背景噪声并隔离高响应生物活性区域;最后利用跨视图多头注意力机制融合净化后的特征以建模全面的相互作用,显著提升了模型在未见药物场景下的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2603.14792
作者: Zihan Dun,Liuyi Xu,An-Yang Lu,Shuang Li,Yining Qian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Drug–target affinity prediction is pivotal for accelerating drug discovery, yet existing methods suffer from significant performance degradation in realistic cold-start scenarios (unseen drugs/targets/pairs), primarily driven by overfitting to training instances and information loss from irrelevant target sequences. In this paper, we propose LaPro-DTA, a framework designed to achieve robust and generalizable DTA prediction. To tackle overfitting, we devise a latent dual-view drug representation mechanism. It synergizes an instance-level view to capture fine-grained substructures with stochastic perturbation and a distribution-level view to distill generalized chemical scaffolds via semantic remapping, thereby enforcing the model to learn transferable structural rules rather than memorizing specific samples. To mitigate information loss, we introduce a salient protein feature extraction strategy using pattern-aware top- k pooling, which effectively filters background noise and isolates high-response bioactive regions. Furthermore, a cross-view multi-head attention mechanism fuses these purified features to model comprehensive interactions. Extensive experiments on benchmark datasets demonstrate that LaPro-DTA significantly outperforms state-of-the-art methods, achieving an 8% MSE reduction on the Davis dataset in the challenging unseen-drug setting, while offering interpretable insights into binding mechanisms.
[AI-76] p2RAG : Privacy-Preserving RAG Service Supporting Arbitrary Top-k Retrieval
【速读】:该论文旨在解决隐私保护型检索增强生成(Privacy-Preserving Retrieval-Augmented Generation, PP-RAG)系统在支持任意 top-k 检索时面临的效率与安全性挑战。现有方案通常依赖安全排序机制,但在 k 值较大时存在性能下降、无法动态调整 k 或引入新安全漏洞等问题,限制了其在现代长上下文大语言模型中的应用。解决方案的关键在于提出 p² RAG,它摒弃传统排序方法,转而采用交互式二分法(interactive bisection method)高效确定 top-k 文档集合;同时通过双服务器秘密共享(secret sharing)机制,在两个非共谋的半诚实服务器间实现数据隐私保护,并结合访问控制与验证机制抵御恶意用户攻击,从而在保证安全性的同时显著提升性能——实验表明其速度相较当前最优方案 PRAG 提升 3–300 倍(k=16–1024)。
链接: https://arxiv.org/abs/2603.14778
作者: Yulong Ming,Mingyue Wang,Jijia Yang,Cong Wang,Xiaohua Jia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures
Abstract:Retrieval-Augmented Generation (RAG) enables large language models to use external knowledge, but outsourcing the RAG service raises privacy concerns for both data owners and users. Privacy-preserving RAG systems address these concerns by performing secure top- k retrieval, which typically is secure sorting to identify relevant documents. However, existing systems face challenges supporting arbitrary k due to their inability to change k , new security issues, or efficiency degradation with large k . This is a significant limitation because modern long-context models generally achieve higher accuracy with larger retrieval sets. We propose p^2 RAG, a privacy-preserving RAG service that supports arbitrary top- k retrieval. Unlike existing systems, p^2 RAG avoids sorting candidate documents. Instead, it uses an interactive bisection method to determine the set of top- k documents. For security, p^2 RAG uses secret sharing on two semi-honest non-colluding servers to protect the data owner’s database and the user’s prompt. It enforces restrictions and verification to defend against malicious users and tightly bound the information leakage of the database. The experiments show that p^2 RAG is 3–300 \times faster than the state-of-the-art PRAG for k = 16 – 1024 .
[AI-77] HO-SFL: Hybrid-Order Split Federated Learning with Backprop-Free Clients and Dimension-Free Aggregation
【速读】:该论文旨在解决在边缘设备上微调大模型时因标准框架(如联邦学习和分割学习)中内存密集型反向传播(Backpropagation, BP)导致的性能瓶颈问题。传统方法若用零阶优化替代BP虽可显著降低内存消耗,但通常会引发收敛速度严重下降的问题。解决方案的关键在于提出混合阶数分割联邦学习(Hybrid-Order Split Federated Learning, HO-SFL),其核心是通过拉格朗日框架重构分割学习过程,将优化空间解耦:服务器端执行高精度的一阶更新(即BP),客户端则采用内存高效的零阶优化。这种混合设计不仅消除了客户端对BP的需求,还实现了与维度无关的模型聚合,大幅降低通信开销,并通过理论证明缓解了零阶优化固有的维度相关收敛慢问题,最终达到与一阶方法相当的收敛速度。
链接: https://arxiv.org/abs/2603.14773
作者: Qiyuan Chen,Xian Wu,Yi Wang,Xianhao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures
Abstract:Fine-tuning large models on edge devices is severely hindered by the memory-intensive backpropagation (BP) in standard frameworks like federated learning and split learning. While substituting BP with zeroth-order optimization can significantly reduce memory footprints, it typically suffers from prohibitively degraded convergence speed. To resolve this dilemma, we propose Hybrid-Order Split Federated Learning (HO-SFL). By reformulating the split learning process within a Lagrangian framework, HO-SFL decouples the optimization landscape: The server performs precise first-order updates (i.e., BP), whereas clients conduct memory-efficient zeroth-order optimization. This hybrid design not only eliminates the need for client-side BP but also enables dimension-free model aggregation, drastically lowering communication costs. Crucially, we provide a theoretical convergence analysis, demonstrating that HO-SFL mitigates the dimension-dependent convergence slowdown of zeroth-order optimization, achieving a convergence rate comparable to first-order methods. Extensive experiments on tasks across vision and language modalities validate that HO-SFL achieves convergence speeds comparable to first-order baselines while significantly reducing communication costs and client memory footprints.
[AI-78] OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM -based Collective Intelligence
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在医疗领域应用中面临的“数据墙”问题,即受限于高质量标注数据的获取与迭代,导致LLM代理能力难以持续提升。为此,作者提出OpenHospital这一交互式平台,其核心解决方案是引入“数据在代理自身中”(data-in-agent-self)范式,通过医师代理与患者代理之间的动态交互,实现LLM代理集体智能(Collective Intelligence, CI)的演化与量化评估。该范式不仅加速了代理能力的自我增强,还提供了医学专业能力与系统效率的双重基准测试指标,从而有效推动LLM-based CI的演进与可衡量性。
链接: https://arxiv.org/abs/2603.14771
作者: Peigen Liu,Rui Ding,Yuren Mao,Ziyan Jiang,Yuxiang Ye,Yunjun Gao,Ying Zhang,Renjie Sun,Longbin Lai,Zhengping Qian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM)-based Collective Intelligence (CI) presents a promising approach to overcoming the data wall and continuously boosting the capabilities of LLM agents. However, there is currently no dedicated arena for evolving and benchmarking LLM-based CI. To address this gap, we introduce OpenHospital, an interactive arena where physician agents can evolve CI through interactions with patient agents. This arena employs a data-in-agent-self paradigm that rapidly enhances agent capabilities and provides robust evaluation metrics for benchmarking both medical proficiency and system efficiency. Experiments demonstrate the effectiveness of OpenHospital in both fostering and quantifying CI.
[AI-79] POLCA: Stochastic Generative Optimization with LLM
【速读】:该论文旨在解决复杂系统(如大语言模型提示词、多轮代理等)优化过程中依赖人工迭代、效率低下且难以应对随机性(如噪声反馈、采样批次波动及系统行为不确定性)的问题。其核心解决方案是提出一种名为优先级优化与局部上下文聚合(Prioritized Optimization with Local Contextual Aggregation, POLCA)的可扩展框架,关键在于通过维护优先队列实现探索与利用的平衡,并结合ε-网机制保障参数多样性、LLM摘要器进行元学习以提升历史试验的知识复用效率,从而在随机环境下高效收敛至近优解。
链接: https://arxiv.org/abs/2603.14769
作者: Xuanfei Ren,Allen Nie,Tengyang Xie,Ching-An Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Optimizing complex systems, ranging from LLM prompts to multi-turn agents, traditionally requires labor-intensive manual iteration. We formalize this challenge as a stochastic generative optimization problem where a generative language model acts as the optimizer, guided by numerical rewards and text feedback to discover the best system. We introduce Prioritized Optimization with Local Contextual Aggregation (POLCA), a scalable framework designed to handle stochasticity in optimization – such as noisy feedback, sampling minibatches, and stochastic system behaviors – while effectively managing the unconstrained expansion of solution space. POLCA maintains a priority queue to manage the exploration-exploitation tradeoff, systematically tracking candidate solutions and their evaluation histories. To enhance efficiency, we integrate an \varepsilon -Net mechanism to maintain parameter diversity and an LLM Summarizer to perform meta-learning across historical trials. We theoretically prove that POLCA converges to near-optimal candidate solutions under stochasticity. We evaluate our framework on diverse benchmarks, including \tau -bench, HotpotQA (agent optimization), VeriBench (code translation) and KernelBench (CUDA kernel generation). Experimental results demonstrate that POLCA achieves robust, sample and time-efficient performance, consistently outperforming state-of-the-art algorithms in both deterministic and stochastic problems. The codebase for this work is publicly available at this https URL.
[AI-80] BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在标准基准测试中表现优异,却在涉及常识推理的脑筋急转弯类问题上频繁失败的问题。其解决方案的关键在于提出BrainBench——一个包含100道脑筋急转弯题目、覆盖20个精心设计类别(如隐式物理约束、语义范围陷阱和默认假设劫持等)的细粒度基准测试集,用于系统性诊断LLMs在特定常识推理失效模式中的表现。通过零样本协议对八个前沿模型进行评估,发现即使是表现最好的模型(Claude Opus 4.6 with extended thinking)准确率也仅为80.3%,且存在显著的准确性与一致性差距,揭示了模型依赖表面启发式而非真正常识推理的本质缺陷。
链接: https://arxiv.org/abs/2603.14761
作者: Yuzhe Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) achieve impressive scores on standard benchmarks yet routinely fail questions that any human would answer correctly in seconds. We introduce BrainBench, a benchmark of 100 brainteaser questions spanning 20 carefully designed categories, each targeting a specific commonsense reasoning failure mode in LLMs. Categories range from implicit physical constraints (“Should I walk or drive my rental car to the return lot?”) to semantic scope tricks and default assumption hijacks. We evaluate eight frontier models – four from the Claude family and four from the GPT family – using a zero-shot protocol with 10 independent runs per question. The best model, Claude Opus 4.6 with extended thinking, achieves only 80.3% accuracy; the worst, GPT-4o, scores 39.7%. Even top-performing models exhibit a 6-16 percentage-point gap between accuracy and consistency, revealing stochastic reasoning. Cross-lingual evaluation in Chinese shows most models degrade by 2-8 percentage points, confirming that these failures reflect reasoning deficits rather than language-specific artifacts. BrainBench provides a fine-grained diagnostic tool for identifying where and why LLMs substitute surface heuristics for genuine commonsense reasoning.
[AI-81] Gauge-Equivariant Intrinsic Neural Operators for Geometry-Consistent Learning of Elliptic PDE Maps
【速读】:该论文旨在解决几何偏微分方程(geometric PDEs)中现有神经算子架构在面对局部坐标系变换(gauge transformation)时表现不稳定、对度量扰动敏感以及离散化变化下鲁棒性差的问题。其解决方案的关键在于提出了一种规范等变内在神经算子(Gauge-Equivariant Intrinsic Neural Operators, GINO),通过将椭圆解映射主要参数化为作用于依赖几何谱的内在谱乘子,并结合规范等变非线性结构,从而实现几何信息与可学习函数依赖性的解耦,并强制保证在框架变换下的不变性。这一设计显著提升了模型在多分辨率、度量扰动和结构保持任务中的稳定性与泛化能力。
链接: https://arxiv.org/abs/2603.14734
作者: Pengcheng Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 55 pages, 13figures
Abstract:Learning solution operators of partial differential equations (PDEs) from data has emerged as a promising route to fast surrogate models in multi-query scientific workflows. However, for geometric PDEs whose inputs and outputs transform under changes of local frame (gauge), many existing operator-learning architectures remain representation-dependent, brittle under metric perturbations, and sensitive to discretization changes. We propose Gauge-Equivariant Intrinsic Neural Operators (GINO), a class of neural operators that parameterize elliptic solution maps primarily through intrinsic spectral multipliers acting on geometry-dependent spectra, coupled with gauge-equivariant nonlinearities. This design decouples geometry from learnable functional dependence and enforces consistency under frame transformations. We validate GINO on controlled problems on the flat torus ( \mathbbT^2 ), where ground-truth resolvent operators and regularized Helmholtz–Hodge decompositions admit closed-form Fourier representations, enabling theory-aligned diagnostics. Across experiments E1–E6, GINO achieves low operator-approximation error, near machine-precision gauge equivariance, robustness to structured metric perturbations, strong cross-resolution generalization with small commutation error under restriction/prolongation, and structure-preserving performance on a regularized exact/coexact decomposition task. Ablations further link the smoothness of the learned spectral multiplier to stability under geometric perturbations. These results suggest that enforcing intrinsic structure and gauge equivariance yields operator surrogates that are more geometry-consistent and discretization-robust for elliptic PDEs on form-valued fields.
[AI-82] GameUIAgent : An LLM -Powered Framework for Automated Game UI Design with Structured Intermediate Representation
【速读】:该论文旨在解决游戏用户界面(Game UI)设计中跨稀有度等级(rarity tiers)视觉资产一致性难以保障的问题,而当前流程高度依赖人工操作。其解决方案核心在于提出 GameUIAgent——一个基于大语言模型(LLM)的智能代理框架,通过 Design Spec JSON 作为中间表示,将自然语言描述转化为可编辑的 Figma 设计文件。该框架采用六阶段神经符号流水线,融合 LLM 生成、确定性后处理与基于视觉-语言模型(VLM)的反射控制器(Reflection Controller, RC),实现迭代式自我修正并确保质量不退化。关键创新在于揭示了“质量天花板效应”(Quality Ceiling Effect)和“渲染-评估保真度原则”,为生成式 AI 在游戏美术生产中的应用提供了理论边界与实践指导。
链接: https://arxiv.org/abs/2603.14724
作者: Wei Zeng,Fengwei An,Zhen Liu,Jian Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures
Abstract:Game UI design requires consistent visual assets across rarity tiers yet remains a predominantly manual process. We present GameUIAgent, an LLM-powered agentic framework that translates natural language descriptions into editable Figma designs via a Design Spec JSON intermediate representation. A six-stage neuro-symbolic pipeline combines LLM generation, deterministic post-processing, and a Vision-Language Model (VLM)-guided Reflection Controller (RC) for iterative self-correction with guaranteed non-regressive quality. Evaluated across 110 test cases, three LLMs, and three UI templates, cross-model analysis establishes a game-domain failure taxonomy (rarity-dependent degradation; visual emptiness) and uncovers two key empirical findings. A Quality Ceiling Effect (Pearson r=-0.96, p0.01) suggests that RC improvement is bounded by headroom below a quality threshold – a visual-domain counterpart to test-time compute scaling laws. A Rendering-Evaluation Fidelity Principle reveals that partial rendering enhancements paradoxically degrade VLM evaluation by amplifying structural defects. Together, these results establish foundational principles for LLM-driven visual generation agents in game production.
[AI-83] Beyond Local Code Optimization: Multi-Agent Reasoning for Software System Optimization
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)和AI代理的软件性能优化方法在面对现代微服务架构时存在的局限性问题,即现有方法主要依赖局部、语法驱动的代码变换,难以对程序行为进行推理或捕捉整个系统的性能交互。其解决方案的关键在于提出一个由多个协作智能体组成的多代理框架,该框架融合控制流与数据流表示,并引入架构级和跨组件依赖信号,以支持系统级性能推理;具体包括总结、分析、优化和验证四个协同角色,共同识别横切瓶颈并构建跨越软件栈的多步骤优化策略,从而实现对微服务系统的整体性能提升。
链接: https://arxiv.org/abs/2603.14703
作者: Huiyun Peng,Parth Vinod Patil,Antonio Zhong Qiu,George K. Thiruvathukal,James C. Davis
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models and AI agents have recently shown promise in automating software performance optimization, but existing approaches predominantly rely on local, syntax-driven code transformations. This limits their ability to reason about program behavior and capture whole system performance interactions. As modern software increasingly comprises interacting components - such as microservices, databases, and shared infrastructure - effective code optimization requires reasoning about program structure and system architecture beyond individual functions or files. This paper explores the feasibility of whole system optimization for microservices. We introduce a multi-agent framework that integrates control-flow and data-flow representations with architectural and cross-component dependency signals to support system-level performance reasoning. The proposed system is decomposed into coordinated agent roles - summarization, analysis, optimization, and verification - that collaboratively identify cross-cutting bottlenecks and construct multi-step optimization strategies spanning the software stack. We present a proof-of-concept on a microservice-based system that illustrates the effectiveness of our proposed framework, achieving a 36.58% improvement in throughput and a 27.81% reduction in average response time. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.14703 [cs.SE] (or arXiv:2603.14703v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.14703 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-84] Applications of Intuitionistic Temporal Logic to Temporal Answer Set Programming
【速读】:该论文旨在解决时序答案集编程(Temporal Answer Set Programming, TASP)的逻辑基础问题,通过引入时序均衡逻辑(Temporal Equilibrium Logic)这一形式化框架,将Pearce的均衡逻辑与Osorio的安全信念理论推广至线性时序场景。其解决方案的关键在于:利用基于“此处-那里”逻辑(logic of here-and-there)的不动点刻画,建立时序直觉逻辑(temporal intuitionistic logic)与时序逻辑编程之间的形式对应关系,从而深化TASP的理论根基,并为时序推理研究开辟新路径。
链接: https://arxiv.org/abs/2603.14692
作者: Pedro Cabalar,Martín Diéguez,David Fernández-Duque,François Laferrière,Torsten Schaub,Igor Stéphan
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Under consideration in Theory and Practice of Logic Programming (TPLP)
Abstract:The relationship between intuitionistic or intermediate logics and logic programming has been extensively studied, prominently featuring Pearce’s equilibrium logic and Osorio’s safe beliefs. Equilibrium logic admits a fixpoint characterization based on the logic of here-and-there, akin to theory completion in default and autoepistemic logics. Safe beliefs are similarly defined via a fixpoint operator, albeit under the semantics of intuitionistic or other intermediate logics. In this paper, we investigate the logical foundations of Temporal Answer Set Programming through the lens of Temporal Equilibrium Logic, a formalism combining equilibrium logic with linear-time temporal operators. We lift the seminal approaches of Pearce and Osorio to the temporal setting, establishing a formal correspondence between temporal intuitionistic logic and temporal logic programming. Our results deepen the theoretical underpinnings of Temporal Answer Set Programming and provide new avenues for research in temporal reasoning. Comments: Under consideration in Theory and Practice of Logic Programming (TPLP) Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.14692 [cs.LO] (or arXiv:2603.14692v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2603.14692 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Martin Dieguez [view email] [v1] Mon, 16 Mar 2026 01:02:02 UTC (53 KB) Full-text links: Access Paper: View a PDF of the paper titled Applications of Intuitionistic Temporal Logic to Temporal Answer Set Programming, by Pedro Cabalar and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LO prev | next new | recent | 2026-03 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-85] Agent Trace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems ICLR2026
【速读】:该论文旨在解决多智能体AI系统(multi-agent AI systems)在实际部署中因级联效应、隐藏依赖关系和长执行日志导致的故障诊断难题。其核心解决方案是提出AgentTrace,一个轻量级因果追踪框架,通过从执行日志重建因果图,从错误表现回溯并利用可解释的结构和位置信号对候选根因进行排序,从而实现高精度、低延迟的后验故障定位,且无需在调试时调用大语言模型(LLM)推理。
链接: https://arxiv.org/abs/2603.14688
作者: Zhaohui Geoffrey Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 13 pages, 6 figures. Accepted at the ICLR 2026 AIWILD Workshop. Camera-ready version
Abstract:As multi-agent AI systems are increasingly deployed in real-world settings - from automated customer support to DevOps remediation - failures become harder to diagnose due to cascading effects, hidden dependencies, and long execution traces. We present AgentTrace, a lightweight causal tracing framework for post-hoc failure diagnosis in deployed multi-agent workflows. AgentTrace reconstructs causal graphs from execution logs, traces backward from error manifestations, and ranks candidate root causes using interpretable structural and positional signals - without requiring LLM inference at debugging time. Across a diverse benchmark of multi-agent failure scenarios designed to reflect common deployment patterns, AgentTrace localizes root causes with high accuracy and sub-second latency, significantly outperforming both heuristic and LLM-based baselines. Our results suggest that causal tracing provides a practical foundation for improving the reliability and trustworthiness of agentic systems in the wild.
[AI-86] RenderMem: Rendering as Spatial Memory Retrieval
【速读】:该论文旨在解决当前具身智能体的空间记忆系统在处理视角依赖性推理时的局限性问题,即现有方法通常存储多视角观测或基于物体的抽象表示,难以实现具有显式几何基础的推理。其解决方案的关键在于提出RenderMem框架,将渲染(rendering)作为3D世界表示与空间推理之间的接口:该框架不存储固定观测,而是维护一个3D场景表示,并根据查询所隐含的视角动态生成视觉证据,从而支持智能体从任意视角直接进行可视性、视线遮挡等几何推理。
链接: https://arxiv.org/abs/2603.14669
作者: JooHyun Park,HyeongYeop Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Embodied reasoning is inherently viewpoint-dependent: what is visible, occluded, or reachable depends critically on where the agent stands. However, existing spatial memory systems for embodied agents typically store either multi-view observations or object-centric abstractions, making it difficult to perform reasoning with explicit geometric grounding. We introduce RenderMem, a spatial memory framework that treats rendering as the interface between 3D world representations and spatial reasoning. Instead of storing fixed observations, RenderMem maintains a 3D scene representation and generates query-conditioned visual evidence by rendering the scene from viewpoints implied by the query. This enables embodied agents to reason directly about line-of-sight, visibility, and occlusion from arbitrary perspectives. RenderMem is fully compatible with existing vision-language models and requires no modification to standard architectures. Experiments in the AI2-THOR environment show consistent improvements on viewpoint-dependent visibility and occlusion queries over prior memory baselines.
[AI-87] Gradient Atoms: Unsupervised Discovery Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients
【速读】:该论文旨在解决训练数据归因(Training Data Attribution, TDA)方法在细调(fine-tuning)场景下的根本性局限问题:现有TDA方法基于单文档的归因框架,难以捕捉模型通过共享概念从多文档中学习到的高层次行为,且其监督式设计需依赖用户指定查询行为,导致计算成本高且无法发现未预设的行为模式。解决方案的关键在于提出一种无监督方法——Gradient Atoms,该方法通过在预条件特征空间中利用字典学习(dictionary learning)将每文档的训练梯度分解为稀疏组件(称为“原子”),从而识别出具有高一致性且可解释的任务类型行为(如拒绝、算术运算、是非分类等),而无需任何行为标签;这些原子同时作为有效的控制向量,在权重空间施加扰动即可实现对模型行为的大规模、可控调整,且该方法不依赖于查询-文档评分阶段,其计算复杂度与感兴趣查询行为的数量无关。
链接: https://arxiv.org/abs/2603.14665
作者: J Rosser
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. We argue that this per-document framing is fundamentally mismatched to how fine-tuning actually works: models often learn broad concepts shared across many examples. Existing TDA methods are supervised – they require a query behavior, then score every training document against it – making them both expensive and unable to surface behaviors the user did not think to ask about. We present Gradient Atoms, an unsupervised method that decomposes per-document training gradients into sparse components (“atoms”) via dictionary learning in a preconditioned eigenspace. Among the 500 discovered atoms, the highest-coherence ones recover interpretable task-type behaviors – refusal, arithmetic, yes/no classification, trivia QA – without any behavioral labels. These atoms double as effective steering vectors: applying them as weight-space perturbations produces large, controllable shifts in model behavior (e.g., bulleted-list generation 33% to 94%; systematic refusal 50% to 0%). The method requires no query–document scoring stage, and scales independently of the number of query behaviors of interest. Code is here: this https URL
[AI-88] A Methodology for Thermal Limit Bias Predictability Through Artificial Intelligence
【速读】:该论文旨在解决核电厂运行中因离线与在线热限(thermal limits)之间不可预测偏差所导致的热限偏置(thermal limit bias)问题,该现象会引发保守的设计裕度、燃料成本上升及运行效率降低。解决方案的关键在于提出一种基于深度学习的方法,采用全卷积编码器-解码器架构并融合特征网络,用于预测并校正沸水堆(BWRs)中的最大限制功率密度占比(MFLPD),从而更准确地逼近在线实测值。模型在五个独立燃料循环中验证,显著降低了节点阵列平均误差(减少74%)、限制值平均绝对偏差(减少72%)和最大偏置(减少52%),展现出提升燃料循环经济性和运行规划能力的潜力。
链接: https://arxiv.org/abs/2603.14648
作者: Anirudh Tunga,Michael J. Mueterthies,Jonathan Nistor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Nuclear power plant operators face significant challenges due to unpredictable deviations between offline and online thermal limits, a phenomenon known as thermal limit bias, which leads to conservative design margins, increased fuel costs, and operational inefficiencies. This work presents a deep learning based methodology to predict and correct this bias for Boiling Water Reactors (BWRs), focusing on the Maximum Fraction of Limiting Power Density (MFLPD) metric used to track the Linear Heat Generation Rate (LHGR) limit. The proposed model employs a fully convolutional encoder decoder architecture, incorporating a feature fusion network to predict corrected MFLPD values closer to online measurements. Evaluated across five independent fuel cycles, the model reduces the mean nodal array error by 74 percent, the mean absolute deviation in limiting values by 72 percent, and the maximum bias by 52 percent compared to offline methods. These results demonstrate the model’s potential to meaningfully improve fuel cycle economics and operational planning, and a commercial variant has been deployed at multiple operating BWRs.
[AI-89] Dynamic Theory of Mind as a Temporal Memory Problem: Evidence from Large Language Models
【速读】:该论文试图解决的问题是:当前对心智理论(Theory of Mind, ToM)的评估多局限于静态信念判断(如错误信念测试),忽视了ToM的核心动态特性——即个体在时间维度上对他人信念的表征、更新与回忆能力。为填补这一空白,作者提出将ToM视为一个时序扩展的表征记忆问题,并引入DToM-Track评估框架,通过受控的多轮对话任务来检验大语言模型(LLMs)是否能够追踪信念轨迹,包括对更新前信念的记忆、当前信念的推理以及信念变化的检测。其解决方案的关键在于:利用LLMs作为计算探针,揭示出模型在信念推理中存在显著不对称性——即能可靠推断当前信念,但难以维持和检索更新后的先前信念状态,这种现象与认知科学中已知的近因效应(recency bias)和干扰效应(interference effects)一致,从而表明动态信念追踪是一个区别于传统错误信念推理的全新挑战。
链接: https://arxiv.org/abs/2603.14646
作者: Thuy Ngoc Nguyen,Duy Nhat Phan,Cleotilde Gonzalez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 3 tables, conference
Abstract:Theory of Mind (ToM) is central to social cognition and human-AI interaction, and Large Language Models (LLMs) have been used to help understand and represent ToM. However, most evaluations treat ToM as a static judgment at a single moment, primarily relying on tests of false beliefs. This overlooks a key dynamic dimension of ToM: the ability to represent, update, and retrieve others’ beliefs over time. We investigate dynamic ToM as a temporally extended representational memory problem, asking whether LLMs can track belief trajectories across interactions rather than only inferring current beliefs. We introduce DToM-Track, an evaluation framework to investigate temporal belief reasoning in controlled multiturn conversations, testing the recall of beliefs held prior to an update, the inference of current beliefs, and the detection of belief change. Using LLMs as computational probes, we find a consistent asymmetry: models reliably infer an agent’s current belief but struggle to maintain and retrieve prior belief states once updates occur. This pattern persists across LLM model families and scales, and is consistent with recency bias and interference effects well documented in cognitive science. These results suggest that tracking belief trajectories over time poses a distinct challenge beyond classical false-belief reasoning. By framing ToM as a problem of temporal representation and retrieval, this work connects ToM to core cognitive mechanisms of memory and interference and exposes the implications for LLM models of social reasoning in extended human-AI interactions.
[AI-90] s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLM s
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的神经符号推理方法在数学竞赛类定理证明任务中表现优异,但缺乏对真实工业级低层代码(如加密库汇编实现)进行形式化验证的能力这一问题。其解决方案的关键在于构建了一个名为s2n-bignum-bench的新基准,该基准源自AWS实际使用的工业级密码学库s2n-bignum,其汇编代码已在HOL Light中被人类专家完成形式化规格说明与证明。该基准将已验证的正式规格作为输入,要求LLM生成可被HOL Light自动检查并接受的证明脚本,从而提供一个贴近现实、具有挑战性的测试平台,用于评估LLM在工业级低层代码形式化证明合成中的能力。
链接: https://arxiv.org/abs/2603.14628
作者: Balaji Rao,John Harrison,Soonho Kong,Juneyoung Lee,Carlo Lipizzi
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Logic in Computer Science (cs.LO)
备注: Under review as a Workshop paper at AIPV 2026
Abstract:Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In \textits2n-bignum-bench, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, \textits2n-bignum-bench is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \hrefthis https URLs2n-bignum-bench.
[AI-91] LLM -Augmented Release Intelligence: Automated Change Summarization and Impact Analysis in Cloud-Native CI/CD Pipelines
【速读】:该论文旨在解决云原生软件交付平台中,跨环境发布(如从开发到预发布、再到生产)时,工程团队难以高效、准确地获取变更内容及其影响范围的问题。在多阶段CI/CD流水线中,单次发布可能涉及多个作者和数十个独立版本化的任务,手动整理发布沟通信息效率低且易出错。解决方案的关键在于提出一个AI增强的发布智能框架,其核心包括三项能力:(1)自动化提交收集与语义过滤,以识别实质性变更并抑制常规维护;(2)基于结构化大语言模型(Large Language Model, LLM)的摘要生成,输出面向不同利益相关者的分类化促进报告;(3)静态任务-流水线依赖分析,精确映射修改的任务与其参与的所有流水线,量化每项变更的影响范围(blast radius)。该框架集成于CI/CD促进流程中,作为GitHub Actions触发的后置步骤运行,已在支持超过60个Tekton任务和20余条发布流水线的Kubernetes原生平台中落地应用。
链接: https://arxiv.org/abs/2603.14619
作者: Happy Bhati(Northeastern University)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figure, 4 tables
Abstract:Cloud-native software delivery platforms orchestrate releases through complex, multi-stage pipelines composed of dozens of independently versioned tasks. When code is promoted between environments – development to staging, staging to production – engineering teams need timely, accurate communication about what changed and what downstream components are affected. Manual preparation of such release communication is slow, inconsistent, and particularly error-prone in repositories where a single promotion may bundle contributions from many authors across numerous pipeline tasks. We present a framework for AI-augmented release intelligence that combines three capabilities: (1) automated commit collection with semantic filtering to surface substantive changes while suppressing routine maintenance, (2) structured large language model summarization that produces categorized, stakeholder-oriented promotion reports, and (3) static task-pipeline dependency analysis that maps modified tasks to every pipeline they participate in, quantifying the blast radius of each change. The framework is integrated directly into the CI/CD promotion workflow and operates as a post-promotion step triggered by GitHub Actions. We describe the architecture and implementation within a production Kubernetes-native release platform that manages over sixty Tekton tasks across more than twenty release pipelines. Through concrete walkthrough examples and qualitative comparison with recent tools such as SmartNote and VerLog, we discuss the distinctive requirements of internal promotion communication versus user-facing release notes and identify open challenges for LLM-driven release engineering.
[AI-92] Delightful Policy Gradient
【速读】:该论文旨在解决标准策略梯度方法在更新方向上存在的两个病理问题:其一,在单个决策上下文(如一张图像或提示)中,罕见但负向优势的动作会因仅按优势加权而过度扭曲梯度方向;其二,在批量处理多个上下文时,期望梯度会过度分配优化预算给当前策略已表现良好的情境。解决方案的关键在于提出愉悦策略梯度(Delightful Policy Gradient, DG),该方法通过一个sigmoid函数对每个梯度项进行门控,门控信号由“愉悦度”(delight)决定,即优势(advantage)与动作意外性(action surprisal,即负对数概率)的乘积。这一设计使得梯度更新更聚焦于高价值且出乎意料的动作,从而提升单个上下文中的方向准确性,并在多上下文中使期望梯度更接近监督交叉熵基准,且该改进并非源于方差降低,即使样本无限也依然有效。
链接: https://arxiv.org/abs/2603.14608
作者: Ian Osband
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textitDelightful Policy Gradient (DG), which gates each term with a sigmoid of \emphdelight, the product of advantage and action surprisal (negative log-probability). For K -armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even with infinite samples. Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines across MNIST, transformer sequence modeling, and continuous control, with larger gains on harder tasks.
[AI-93] A Loss Landscape Visualization Framework for Interpreting Reinforcement Learning: An ADHDP Case Study
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)算法在动态系统控制中内部学习行为难以解释的问题,特别是价值估计、策略优化与时间差分(Temporal-Difference, TD)信号之间交互机制不清晰的挑战。解决方案的关键在于提出一个四维互补的可视化框架:通过三维重构 critic match loss 表面揭示 TD 目标如何塑造优化几何结构;利用冻结 critic 下的 actor loss 地图展示策略如何利用该几何;结合时间、贝尔曼误差和策略权重的轨迹追踪更新路径;以及状态-TD 映射识别驱动更新的关键状态区域。该框架以航天器姿态控制中的动作相关启发式动态规划(Action-Dependent Heuristic Dynamic Programming, ADHDP)算法为案例,系统性地分析不同变体的训练稳定器与目标更新对优化景观的影响,从而提供一种可解释且系统的工具用于理解RL算法的设计与行为。
链接: https://arxiv.org/abs/2603.14600
作者: Jingyi Liu,Jian Guo,Eberhard Gill
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Submitted to Acta Astronautica
Abstract:Reinforcement learning algorithms have been widely used in dynamic and control systems. However, interpreting their internal learning behavior remains a challenge. In the authors’ previous work, a critic match loss landscape visualization method was proposed to study critic training. This study extends that method into a framework which provides a multi-perspective view of the learning dynamics, clarifying how value estimation, policy optimization, and temporal-difference (TD) signals interact during training. The proposed framework includes four complementary components; a three-dimensional reconstruction of the critic match loss surface that shows how TD targets shape the optimization geometry; an actor loss landscape under a frozen critic that reveals how the policy exploits that geometry; a trajectory combining time, Bellman error, and policy weights that indicates how updates move across the surface; and a state-TD map that identifies the state regions that drive those updates. The Action-Dependent Heuristic Dynamic Programming (ADHDP) algorithm for spacecraft attitude control is used as a case study. The framework is applied to compare several ADHDP variants and shows how training stabilizers and target updates change the optimization landscape and affect learning stability. Therefore, the proposed framework provides a systematic and interpretable tool for analyzing reinforcement learning behavior across algorithmic designs.
[AI-94] Scaling the Explanation of Multi-Class Bayesian Network Classifiers
【速读】:该论文旨在解决贝叶斯网络分类器(Bayesian Network Classifier, BNC)到类公式(class formula)编译过程中存在的局限性问题,特别是针对现有方法仅适用于二分类任务、编译效率低以及输出公式缺乏结构可解释性等不足。其解决方案的关键在于提出一种新的编译算法,该算法不仅支持多分类场景,显著提升编译速度,且生成的类公式以否定正规形式(Negation Normal Form, NNF)电路表示,并具备OR分解性(OR-decomposable),从而为后续基于逻辑推理的分类决策解释提供高效且结构清晰的表达基础。
链接: https://arxiv.org/abs/2603.14594
作者: Yaofang Zhang,Adnan Darwiche
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: To appear in the 4th World Conference on Explainable Artificial Intelligence (XAI), 2026
Abstract:We propose a new algorithm for compiling Bayesian network classifier (BNC) into class formulas. Class formulas are logical formulas that represent a classifier’s input-output behavior, and are crucial in the recent line of work that uses logical reasoning to explain the decisions made by classifiers. Compared to prior work on compiling class formulas of BNCs, our proposed algorithm is not restricted to binary classifiers, shows significant improvement in compilation time, and outputs class formulas as negation normal form (NNF) circuits that are OR-decomposable, which is an important property when computing explanations of classifiers.
[AI-95] Adapting Critic Match Loss Landscape Visualization to Off-policy Reinforcement Learning
【速读】:该论文旨在解决离线策略强化学习(off-policy reinforcement learning, off-policy RL)中评论家(critic)优化几何结构不明确的问题,尤其关注其与在线策略方法在数据流和目标计算上的差异对优化路径的影响。解决方案的关键在于将原有的评论家匹配损失(critic match loss)景观可视化方法从在线策略扩展至离线策略框架,并针对Soft Actor-Critic(SAC)算法进行适配:通过固定回放缓冲区批次(fixed replay batch)和预计算的评论家目标(precomputed critic targets),使损失评估与批处理的数据流和目标更新机制保持一致;随后利用主成分分析(PCA)投影训练过程中记录的评论家参数,构建三维损失景观并叠加二维优化路径,从而实现对评论家优化动态的几何诊断。
链接: https://arxiv.org/abs/2603.14589
作者: Jingyi Liu,Jian Guo,Eberhard Gill
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Revised manuscript, submitted to Astrodynamics
Abstract:This work extends an established critic match loss landscape visualization method from online to off-policy reinforcement learning (RL), aiming to reveal the optimization geometry behind critic learning. Off-policy RL differs from stepwise online actor-critic learning in its replay-based data flow and target computation. Based on these two structural differences, the critic match loss landscape visualization method is adapted to the Soft Actor-Critic (SAC) algorithm by aligning the loss evaluation with its batch-based data flow and target computation, using a fixed replay batch and precomputed critic targets from the selected policy. Critic parameters recorded during training are projected onto a principal component plane, where the critic match loss is evaluated to form a 3-D landscape with an overlaid 2-D optimization path. Applied to a spacecraft attitude control problem, the resulting landscapes are analyzed both qualitatively and quantitatively using sharpness, basin area, and local anisotropy metrics, together with temporal landscape snapshots. Comparisons between convergent SAC, divergent SAC, and divergent Action-Dependent Heuristic Dynamic Programming (ADHDP) cases reveal distinct geometric patterns and optimization behaviors under different algorithmic structures. The results demonstrate that the adapted critic match loss visualization framework serves as a geometric diagnostic tool for analyzing critic optimization dynamics in replay-based off-policy RL-based control problems.
[AI-96] Machine Learning-Driven Intelligent Memory System Design: From On-Chip Caches to Storag e MICRO2026
【速读】:该论文旨在解决现代计算平台内存系统中,主流架构策略仍依赖静态人工设计启发式方法、难以根据工作负载和系统行为进行自适应优化的问题。其解决方案的关键在于引入轻量级且实用的机器学习(Machine Learning, ML)方法,实现内存层次结构中从缓存预取、多级缓存预测到混合存储数据放置的全流程自适应控制。具体包括:基于强化学习的片上缓存预取器Pythia、基于感知机学习的片外多级缓存预测器Hermes,以及基于强化学习的数据放置策略Sibyl,三者均在显著优于传统人工设计策略的同时,仅带来适度的硬件开销,验证了将学习驱动的自适应机制集成至内存子系统可实现智能、自我优化的架构设计,从而突破传统方法的性能与效率瓶颈。
链接: https://arxiv.org/abs/2603.14583
作者: Rahul Bera,Rakesh Nadig,Onur Mutlu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Extended version of the IEEE Micro 2026 article
Abstract:Despite the data-rich environment in which memory systems of modern computing platforms operate, many state-of-the-art architectural policies employed in the memory system rely on static, human-designed heuristics that fail to truly adapt to the workload and system behavior via principled learning methodologies. In this article, we propose a fundamentally different design approach: using lightweight and practical machine learning (ML) methods to enable adaptive, data-driven control throughout the memory hierarchy. We present three ML-guided architectural policies: (1) Pythia, a reinforcement learning-based data prefetcher for on-chip caches, (2) Hermes, a perceptron learning-based off-chip predictor for multi-level cache hierarchies, and (3) Sibyl, a reinforcement learning-based data placement policy for hybrid storage systems. Our evaluation shows that Pythia, Hermes, and Sibyl significantly outperform the best-prior human-designed policies, while incurring modest hardware overheads. Collectively, this article demonstrates that integrating adaptive learning into memory subsystems can lead to intelligent, self-optimizing architectures that unlock performance and efficiency gains beyond what is possible with traditional human-designed approaches. Comments: Extended version of the IEEE Micro 2026 article Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) ACMclasses: C.1 Cite as: arXiv:2603.14583 [cs.AR] (or arXiv:2603.14583v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2603.14583 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/MM.2026.3667076 Focus to learn more DOI(s) linking to related resources
[AI-97] JobMatchAI An Intelligent Job Matching Platform Using Knowledge Graphs Semantic Search and Explainable AI
【速读】:该论文旨在解决招聘系统中候选人匹配效率与透明度不足的问题,传统系统多依赖关键词过滤,难以处理技能同义词和非线性职业路径,导致漏筛候选人且匹配分数缺乏可解释性。其解决方案的关键在于构建一个生产就绪的JobMatchAI系统,融合Transformer嵌入(Transformer embeddings)、技能知识图谱(skill knowledge graphs)与可解释重排序机制,实现对技能契合度、经验、地理位置、薪资及公司偏好等多维度的优化,并通过简历驱动的搜索流程提供因子级解释,从而提升匹配精度与决策透明度。
链接: https://arxiv.org/abs/2603.14558
作者: Mayank Vyaas,Abhijit Chakrabroty,Vivek Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recruiters and job seekers rely on search systems to navigate labor markets, making candidate matching engines critical for hiring outcomes. Most systems act as keyword filters, failing to handle skill synonyms and nonlinear careers, resulting in missed candidates and opaque match scores. We introduce JobMatchAI, a production-ready system integrating Transformer embeddings, skill knowledge graphs, and interpretable reranking. Our system optimizes utility across skill fit, experience, location, salary, and company preferences, providing factor-wise explanations through resume-driven search workflows. We release JobSearch-XS benchmark and a hybrid retrieval stack combining BM25, knowledge graph and semantic components to evaluate skill generalization. We assess system performance on JobSearch-XS across retrieval tasks, provide a demo video, a hosted website and installable package.
[AI-98] Visualizing Critic Match Loss Landscapes for Interpretation of Online Reinforcement Learning Control Algorithms
【速读】:该论文旨在解决在线强化学习(Online Reinforcement Learning)中算法性能受系统动态变化影响时缺乏系统性解释与分析手段的问题,尤其关注基于策略-评论家(Actor-Critic)结构的算法中评论家神经网络(Critic Neural Network)的学习行为机制。其解决方案的关键在于提出一种评论家匹配损失景观可视化方法(Critic Match Loss Landscape Visualization Method),通过将记录的评论家参数轨迹投影到低维线性子空间,并利用固定参考状态样本和时序差分目标(Temporal-Difference Targets)在投影参数网格上评估评论家匹配损失,从而构建三维损失曲面与二维优化路径,实现对评论家学习行为的定性和定量刻画。该方法进一步引入量化景观指标与归一化系统性能指数,支持跨训练结果的结构化比较,有效揭示了稳定收敛与不稳定学习对应的景观特征差异。
链接: https://arxiv.org/abs/2603.14535
作者: Jingyi Liu,Jian Guo,Eberhard Gill
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Revised manuscript, submitted to Acta Astronautica
Abstract:Reinforcement learning has proven its power on various occasions. However, its performance is not always guaranteed when system dynamics change. Instead, it largely relies on users’ empirical experience. For reinforcement learning algorithms with an actor-critic structure, the critic neural network reflects the approximation and optimization process in the RL algorithm. Analyzing the performance of the critic neural network helps to understand the mechanism of the algorithm. To support systematic interpretation of such algorithms in dynamic control problems, this work proposes a critic match loss landscape visualization method for online reinforcement learning. The method constructs a loss landscape by projecting recorded critic parameter trajectories onto a low-dimensional linear subspace. The critic match loss is evaluated over the projected parameter grid using fixed reference state samples and temporal-difference targets. This yields a three-dimensional loss surface together with a two-dimensional optimization path that characterizes critic learning behavior. To extend analysis beyond visual inspection, quantitative landscape indices and a normalized system performance index are introduced, enabling structured comparison across different training outcomes. The approach is demonstrated using the Action-Dependent Heuristic Dynamic Programming algorithm on cart-pole and spacecraft attitude control tasks. Comparative analyses across projection methods and training stages reveal distinct landscape characteristics associated with stable convergence and unstable learning. The proposed framework enables both qualitative and quantitative interpretation of critic optimization behavior in online reinforcement learning.
[AI-99] Emotional Cost Functions for AI Safety: Teaching Agents to Feel the Weight of Irreversible Consequences
【速读】:该论文旨在解决当前人工智能(AI)安全方法无法模拟人类通过“质性痛苦”(qualitative suffering)进行学习的问题,即现有技术如奖励塑形(reward shaping)仅关注数值惩罚而忽略行为背后的深层意义,规则对齐(rule-based alignment)虽能约束行为却难以真正改变智能体的决策逻辑。其解决方案的核心是提出情感成本函数(Emotional Cost Functions)框架,该框架使智能体能够构建持久的“质性痛苦状态”(Qualitative Suffering States),这些状态是不可逆后果的丰富叙事表征,并持续影响其未来决策。该机制通过四组件架构——后果处理器(Consequence Processor)、人格状态(Character State)、前瞻扫描(Anticipatory Scan)和故事更新(Story Update)——确保行动不可撤销且必须承担后果,从而实现从经验中生成具体智慧而非普遍抑制。实验表明,该方法在金融交易、危机支持与内容审核等场景下显著优于传统数值基线,在保持高风险情境下适度参与能力的同时,有效避免过度规避行为。
链接: https://arxiv.org/abs/2603.14531
作者: Pandurang Mopgar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Humans learn from catastrophic mistakes not through numerical penalties, but through qualitative suffering that reshapes who they are. Current AI safety approaches replicate none of this. Reward shaping captures magnitude, not meaning. Rule-based alignment constrains behaviour, but does not change it. We propose Emotional Cost Functions, a framework in which agents develop Qualitative Suffering States, rich narrative representations of irreversible consequences that persist forward and actively reshape character. Unlike numerical penalties, qualitative suffering states capture the meaning of what was lost, the specific void it creates, and how it changes the agent’s relationship to similar future situations. Our four-component architecture - Consequence Processor, Character State, Anticipatory Scan, and Story Update is grounded in one principle. Actions cannot be undone and agents must live with what they have caused. Anticipatory dread operates through two pathways. Experiential dread arises from the agent’s own lived consequences. Pre-experiential dread is acquired without direct experience, through training or inter-agent transmission. Together they mirror how human wisdom accumulates across experience and culture. Ten experiments across financial trading, crisis support, and content moderation show that qualitative suffering produces specific wisdom rather than generalised paralysis. Agents correctly engage with moderate opportunities at 90-100% while numerical baselines over-refuse at 90%. Architecture ablation confirms the mechanism is necessary. The full system generates ten personal grounding phrases per probe vs. zero for a vanilla LLM. Statistical validation (N=10) confirms reproducibility at 80-100% consistency. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.14531 [cs.AI] (or arXiv:2603.14531v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.14531 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-100] Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的前向干扰(Proactive Interference, PI)问题,即旧信息在上下文窗口中积累并干扰当前信息的检索,导致检索准确率随时间呈对数线性下降。传统方法如滑动窗口或衰减机制无法有效缓解此问题。其解决方案的核心是提出 SleepGate 框架,该框架受生物大脑睡眠依赖的记忆巩固机制启发,通过在键值缓存(Key-Value Cache)上引入可学习的睡眠周期,在推理阶段周期性执行三种机制:(1) 基于冲突感知的时间标记器识别新旧条目替换关系;(2) 轻量级遗忘门训练以选择性地淘汰或压缩过时缓存条目;(3) 整合模块将留存条目合并为紧凑摘要。该框架采用双相训练目标,联合优化清醒期的语言建模与睡眠期的后整合检索性能,理论上将干扰范围从 O(n) 降低至 O(log n),实验证明其在小规模模型上显著优于现有基线方法。
链接: https://arxiv.org/abs/2603.14517
作者: Ying Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) suffer from proactive interference (PI): outdated information in the context window disrupts retrieval of current values. This interference degrades retrieval accuracy log-linearly as stale associations accumulate, a bottleneck that persists regardless of context length and resists prompt-engineering mitigations. Biological brains resolve an analogous challenge through sleep-dependent memory consolidation: synaptic downscaling, selective replay, and targeted forgetting. We propose SleepGate, a biologically inspired framework that augments transformer-based LLMs with a learned sleep cycle over the key-value (KV) cache. SleepGate introduces three mechanisms: (1) a conflict-aware temporal tagger detecting when new entries supersede old ones; (2) a lightweight forgetting gate trained to selectively evict or compress stale cache entries; and (3) a consolidation module that merges surviving entries into compact summaries. These components activate periodically during inference in sleep micro-cycles, governed by an adaptive entropy-based trigger. We formalize a dual-phase training objective jointly optimizing language modeling during the wake phase and post-consolidation retrieval during the sleep phase. Theoretical analysis shows SleepGate reduces the interference horizon from O(n) to O(log n). In experiments with a small-scale transformer (4 layers, 793K parameters), SleepGate achieves 99.5% retrieval accuracy at PI depth 5 and 97.0% at depth 10, while all five baselines – full KV cache, sliding window, H2O, StreamingLLM, and decay-only ablation – remain below 18%. Our framework offers an architecture-level solution that prompt engineering cannot address.
[AI-101] Bridging the Gap in the Responsible AI Divides
【速读】:该论文旨在解决人工智能安全(AI Safety, AIS)与人工智能伦理(AI Ethics, AIE)之间日益加剧的分歧问题,即所谓的“负责任人工智能分裂”(responsible AI divides),其核心挑战在于如何在治理和公共讨论中实现两者的协同而非对立。解决方案的关键在于提出并倡导“批判性连接”(critical bridging)这一模式,尤其聚焦于识别和解决两类研究领域共有的“桥接问题”(bridging problems),如透明性、可复现性和治理机制不足等。通过分析3,550篇文献的数据集,作者揭示了AIS与AIE在议题上的差异(如AIE关注不公与具体伤害,AIS侧重能力风险预判)以及显著重叠的核心关切,进而主张以桥接问题为切入点,推动跨领域的协作式负责任人工智能治理路径。
链接: https://arxiv.org/abs/2603.14495
作者: Bálint Gyevnár,Atoosa Kasirzadeh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Tensions between AI Safety (AIS) and AI Ethics (AIE) have increasingly surfaced in AI governance and public debates about AI, leading to what we term the “responsible AI divides”. We introduce a model that categorizes four modes of engagement with the tensions: radical confrontation, disengagement, compartmentalized coexistence, and critical bridging. We then investigate how critical bridging, with a particular focus on bridging problems, offers one of the most viable constructive paths for advancing responsible AI. Using computational tools to analyze a curated dataset of 3,550 papers, we map the research landscapes of AIE and AIS to identify both distinct and overlapping problems. Our findings point to both thematic divides and overlaps. For example, we find that AIE has long grappled with overcoming injustice and tangible AI harms, whereas AIS has primarily embodied an anticipatory approach focused on the mitigation of risks from AI capabilities. At the same time, we find significant overlap in core research concerns across both AIE and AIS around transparency, reproducibility, and inadequate governance mechanisms. As AIE and AIS continue to evolve, we recommend focusing on bridging problems as a constructive path forward for enhancing collaborative AI governance. We offer a series of recommendations to integrate shared considerations into a collaborative approach to responsible AI. Alongside our proposal, we highlight its limitations and explore open problems for future research. All data including the fully annotated dataset of papers with code to reproduce our figures can be found at: this https URL.
[AI-102] Geometric and Topological Deep Learning for Predicting Thermo-mechanical Performance in Cold Spray Deposition Process Modeling
【速读】:该论文旨在解决冷喷射(Cold Spray)工艺中粒子撞击响应预测的难题,传统有限元仿真计算成本高且难以实时优化工艺参数。解决方案的关键在于构建一个基于几何深度学习(Geometric Deep Learning)的代理模型框架,将输入的工艺参数(如粒子速度、温度和摩擦系数)映射为多目标输出(如最大等效塑性应变、最大温度等),并通过构建k近邻特征空间图来捕捉不同工艺条件间的空间相似性,从而实现高效、高精度的预测。其中,GraphSAGE和几何注意力网络(GAT)表现出最优性能,R²值超过0.93,尤其GAT在最大塑性应变预测上达到0.97,验证了基于空间邻域聚合的图神经网络方法在冷喷工艺优化中的有效性与物理可解释性。
链接: https://arxiv.org/abs/2603.14478
作者: Akshansh Mishra
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 27 pages, 19 figures, 6 tables
Abstract:This study presents a geometric deep learning framework for predicting cold spray particle impact responses using finite element simulation data. A parametric dataset was generated through automated Abaqus simulations spanning a systematic range of particle velocity, particle temperature, and friction coefficient, yielding five output targets including maximum equivalent plastic strain, average contact plastic strain, maximum temperature, maximum von Mises stress, and deformation ratio. Four novel algorithms i.e. a GraphSAGE-style inductive graph neural network, a Chebyshev spectral graph convolution network, a topological data analysis augmented multilayer perceptron, and a geometric attention network were implemented and evaluated. Each input sample was treated as a node in a k-nearest-neighbour feature-space graph, enabling the models to exploit spatial similarity between process conditions during training. Three-dimensional feature space visualisations and two-dimensional contour projections confirmed the highly non-linear and velocity-dominated nature of the input-output relationships. Quantitative evaluation demonstrated that GraphSAGE and GAT consistently achieved R-square values exceeding 0.93 across most targets, with GAT attaining peak performance of R-square equal to 0.97 for maximum plastic strain. ChebSpectral and TDA-MLP performed considerably worse, yielding negative R-square values for several targets. These findings establish spatial graph-based neighbourhood aggregation as a robust and physically interpretable surrogate modelling strategy for cold spray process optimisation.
[AI-103] Agent ProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长时程工具使用任务中缺乏准确的步骤级验证能力的问题。当前主流过程级评测基准多局限于封闭世界的数学推理领域,无法刻画真实场景下工具执行的动态性和开放性,导致模型在面对不可逆副作用时难以实现有效纠错。为此,作者提出了AgentProcessBench,这是首个专注于评估工具增强轨迹中步骤级有效性的基准,其关键创新在于构建了包含1,000条多样化轨迹和8,509条人工标注步骤的高质量数据集,并采用三元标签体系捕捉探索行为、引入错误传播规则降低标注歧义,从而实现对模型工具使用过程的精细化评估。实验表明,该基准揭示了现有模型在步骤正确性判断上的偏差与局限,并验证了过程信号相较于结果监督具有互补价值,可显著提升测试阶段的扩展性能。
链接: https://arxiv.org/abs/2603.14465
作者: Shengda Fan,Xuyan Ye,Yupeng Huo,Zhi-Yuan Chen,Yiju Guo,Shenzhi Yang,Wenkai Yang,Shuqi Ye,Jingwen Chen,Haotian Chen,Xin Cong,Yankai Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at this https URL.
[AI-104] STAG-CN: Spatio-Temporal Apiary Graph Convolutional Network for Disease Onset Prediction in Beehive Sensor Networks
【速读】:该论文旨在解决蜂群疾病传播监测中忽视空间路径的问题,即现有系统将每个蜂巢视为孤立单元,未能捕捉病害在养蜂场间的扩散机制。其解决方案的关键在于提出一种时空蜂箱图卷积网络(Spatio-Temporal Apiary Graph Convolutional Network, STAG-CN),该模型通过构建融合物理邻近性和气候传感器相关性的双重邻接图,并采用时序-空间-时序的夹心架构(基于因果膨胀卷积与Chebyshev谱图卷积)处理多变量物联网传感器数据,从而实现对蜂群疾病爆发的精准预测。实证表明,气候邻接矩阵单独即可达到与完整模型相当的性能(F1=0.607),显著优于仅依赖物理邻接的模型(F1=0.274),揭示了共享环境响应模式比空间距离蕴含更强的疾病预测信号。
链接: https://arxiv.org/abs/2603.14462
作者: Sungwoo Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Honey bee colony losses threaten global pollination services, yet current monitoring systems treat each hive as an isolated unit, ignoring the spatial pathways through which diseases spread across apiaries. This paper introduces the Spatio-Temporal Apiary Graph Convolutional Network (STAG-CN), a graph neural network that models inter-hive relationships for disease onset prediction. STAG-CN operates on a dual adjacency graph combining physical co-location and climatic sensor correlation among hive sessions, and processes multivariate IoT sensor streams through a temporal–spatial–temporal sandwich architecture built on causal dilated convolutions and Chebyshev spectral graph convolutions. Evaluated on the Korean AI Hub apiculture dataset (dataset #71488) with expanding-window temporal cross-validation, STAG-CN achieves an F1 score of 0.607 at a three-day forecast horizon. An ablation study reveals that the climatic adjacency matrix alone matches full-model performance (F1,=,0.607), while the physical adjacency alone yields F1,=,0.274, indicating that shared environmental response patterns carry stronger predictive signal than spatial proximity for disease onset. These results establish a proof-of-concept for graph-based biosecurity monitoring in precision apiculture, demonstrating that inter-hive sensor correlations encode disease-relevant information invisible to single-hive approaches.
机器学习
[LG-0] HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification
链接: https://arxiv.org/abs/2603.15617
作者: Erik Y. Wang,Sumeet Motwani,James V. Roggeveen,Eliot Hodges,Dulhan Jayalath,Charles London,Kalyan Ramakrishnan,Flaviu Cipcigan,Philip Torr,Alessandro Abate
类目: Machine Learning (cs.LG)
*备注:
Abstract:Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 predominantly unsolved problems spanning 8 domains in computational and applied mathematics, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0%. Existing research-level benchmarks instead rely on formal proof verification or manual review, both of which are expensive to scale. Using this platform, we find two problems for which GPT 5.4 Pro proposes solutions that improve on the best-known published results, representing potential novel contributions (pending expert review). We release HorizonMath as an open challenge and a growing community resource, where correct solutions to problems in the unsolved problem classes could constitute novel results in the mathematical literature.
[LG-1] SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval
链接: https://arxiv.org/abs/2603.15599
作者: Jesper Derehag,Carlos Calva,Timmy Ghiurau
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent conversational memory systems invest heavily in LLM-based structuring at ingestion time and learned retrieval policies at query time. We show that neither is necessary. SmartSearch retrieves from raw, unstructured conversation history using a fully deterministic pipeline: NER-weighted substring matching for recall, rule-based entity discovery for multi-hop expansion, and a CrossEncoder+ColBERT rank fusion stage – the only learned component – running on CPU in ~650ms. Oracle analysis on two benchmarks identifies a compilation bottleneck: retrieval recall reaches 98.6%, but without intelligent ranking only 22.5% of gold evidence survives truncation to the token budget. With score-adaptive truncation and no per-dataset tuning, SmartSearch achieves 93.5% on LoCoMo and 88.4% on LongMemEval-S, exceeding all known memory systems under the same evaluation protocol on both benchmarks while using 8.5x fewer tokens than full-context baselines.
[LG-2] Robust and Computationally Efficient Linear Contextual Bandits under Adversarial Corruption and Heavy-Tailed Noise
链接: https://arxiv.org/abs/2603.15596
作者: Naoto Tani,Futoshi Futami
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study linear contextual bandits under adversarial corruption and heavy-tailed noise with finite (1+\epsilon) -th moments for some \epsilon \in (0,1] . Existing work that addresses both adversarial corruption and heavy-tailed noise relies on a finite variance (i.e., finite second-moment) assumption and suffers from computational inefficiency. We propose a computationally efficient algorithm based on online mirror descent that achieves robustness to both adversarial corruption and heavy-tailed noise. While the existing algorithm incurs \mathcalO(t\log T) computational cost, our algorithm reduces this to \mathcalO(1) per round. We establish an additive regret bound consisting of a term depending on the (1+\epsilon) -moment bound of the noise and a term depending on the total amount of corruption. In particular, when \epsilon = 1 , our result recovers existing guarantees under finite-variance assumptions. When no corruption is present, it matches the best-known rates for linear contextual bandits with heavy-tailed noise. Moreover, the algorithm requires no prior knowledge of the noise moment bound or the total amount of corruption and still guarantees sublinear regret.
[LG-3] Effective Distillation to Hybrid xLSTM Architectures
链接: https://arxiv.org/abs/2603.15590
作者: Lukas Hauzenberger,Niklas Schmidinger,Thomas Schmied,Anamaria-Roberta Hartl,David Stap,Pieter-Jan Hoedt,Maximilian Beck,Sebastian Böck,Günter Klambauer,Sepp Hochreiter
类目: Machine Learning (cs.LG)
*备注:
Abstract:There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher’s performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.
[LG-4] Unbiased and Biased Variance-Reduced Forward-Reflected-Backward Splitting Methods for Stochastic Composite Inclusions
链接: https://arxiv.org/abs/2603.15576
作者: Quoc Tran-Dinh,Nghia Nguyen-Trung
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 34 pages and 2 figures
Abstract:This paper develops new variance-reduction techniques for the forward-reflected-backward splitting (FRBS) method to solve a class of possibly nonmonotone stochastic composite inclusions. Unlike unbiased estimators such as mini-batching, developing stochastic biased variants faces a fundamental technical challenge and has not been utilized before for inclusions and fixed-point problems. We fill this gap by designing a new framework that can handle both unbiased and biased estimators. Our main idea is to construct stochastic variance-reduced estimators for the forward-reflected direction and use them to perform iterate updates. First, we propose a class of unbiased variance-reduced estimators and show that increasing mini-batch SGD, loopless-SVRG, and SAGA estimators fall within this class. For these unbiased estimators, we establish a \mathcalO(1/k) best-iterate convergence rate for the expected squared residual norm, together with almost-sure convergence of the iterate sequence to a solution. Consequently, we prove that the best oracle complexities for the n -finite-sum and expectation settings are \mathcalO(n^2/3\epsilon^-2) and \mathcalO(\epsilon^-10/3) , respectively, when employing loopless-SVRG or SAGA, where \epsilon is a desired accuracy. Second, we introduce a new class of biased variance-reduced estimators for the forward-reflected direction, which includes SARAH, Hybrid SGD, and Hybrid SVRG as special instances. While the convergence rates remain valid for these biased estimators, the resulting oracle complexities are \mathcalO(n^3/4\epsilon^-2) and \mathcalO(\epsilon^-5) for the n -finite-sum and expectation settings, respectively. Finally, we conduct two numerical experiments on AUC optimization for imbalanced classification and policy evaluation in reinforcement learning.
[LG-5] Co-Design of Memory-Storag e Systems for Workload Awareness with Interpretable Models
链接: https://arxiv.org/abs/2603.15571
作者: Jay Sarkar,Vamsi Pavan Rayaprolu,Abhijeet Bhalerao
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 9 pages, 10 figures
Abstract:Solid-state storage architectures based on NAND or emerging memory devices (SSD), are fundamentally architected and optimized for both reliability and performance. Achieving these simultaneous goals requires co-design of memory components with firmware-architected Error Management (EM) algorithms for density- and performance-scaled memory technologies. We describe a Machine Learning (ML) for systems methodology and modeling for co-designing the EM subsystem together with the natural variance inherent to scaled silicon process of memory components underlying SSD technology. The modeling analyzes NAND memory components and EM algorithms interacting with comprehensive suite of synthetic (stress-focused and JEDEC) and emulation (YCSB and similar) workloads across Flash Translation abstraction layers, by leveraging a statistically interpretable and intuitively explainable ML algorithm. The generalizable co-design framework evaluates several thousand datacenter SSDs spanning multiple generations of memory and storage technology. Consequently, the modeling framework enables continuous, holistic, data-driven design towards generational architectural advancements. We additionally demonstrate that the framework enables Representation Learning of the EM-workload domain for enhancement of the architectural design-space across broad spectrum of workloads.
[LG-6] Mamba-3: Improved Sequence Modeling using State Space Principles ICLR2026
链接: https://arxiv.org/abs/2603.15569
作者: Aakash Lahoti,Kevin Y. Li,Berlin Chen,Caitlin Wang,Aviv Bick,J. Zico Kolter,Tri Dao,Albert Gu
类目: Machine Learning (cs.LG)
*备注: ICLR 2026
Abstract:Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While the current Transformer-based models deliver strong model quality, their quadratic compute and linear memory make inference expensive. This has spurred the development of sub-quadratic models with reduced linear compute and constant memory requirements. However, many recent linear models trade off model quality and capability for algorithmic efficiency, failing on tasks such as state tracking. Moreover, their theoretically linear inference remains hardware-inefficient in practice. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state space model (SSM) viewpoint of linear models. We combine: (1) a more expressive recurrence derived from SSM discretization, (2) a complex-valued state update rule that enables richer state tracking, and (3) a multi-input, multi-output (MIMO) formulation for better model performance without increasing decode latency. Together with architectural refinements, our Mamba-3 model achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks. At the 1.5B scale, Mamba-3 improves average downstream accuracy by 0.6 percentage points compared to the next best model (Gated DeltaNet), with Mamba-3’s MIMO variant further improving accuracy by another 1.2 points for a total 1.8 point gain. Across state-size experiments, Mamba-3 achieves comparable perplexity to Mamba-2 despite using half of its predecessor’s state size. Our evaluations demonstrate Mamba-3’s ability to advance the performance-efficiency Pareto frontier.
[LG-7] Predictive Uncertainty in Short-Term PV Forecasting under Missing Data: A Multiple Imputation Approach
链接: https://arxiv.org/abs/2603.15564
作者: Parastoo Pashmchi,Jérôme Benoit,Motonobu Kanagawa
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 10 pages
Abstract:Missing values are common in photovoltaic (PV) power data, yet the uncertainty they induce is not propagated into predictive distributions. We develop a framework that incorporates missing-data uncertainty into short-term PV forecasting by combining stochastic multiple imputation with Rubin’s rule. The approach is model-agnostic and can be integrated with standard machine-learning predictors. Empirical results show that ignoring missing-data uncertainty leads to overly narrow prediction intervals. Accounting for this uncertainty improves interval calibration while maintaining comparable point prediction accuracy. These results demonstrate the importance of propagating imputation uncertainty in data-driven PV forecasting.
[LG-8] Bridging Local and Global Knowledge: Cascaded Mixture-of-Experts Learning for Near-Shortest Path Routing
链接: https://arxiv.org/abs/2603.15541
作者: Yung-Fu Chen,Anish Arora
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:While deep learning models that leverage local features have demonstrated significant potential for near-optimal routing in dense Euclidean graphs, they struggle to generalize well in sparse networks where topological irregularities require broader structural awareness. To address this limitation, we train a Cascaded Mixture of Experts (Ca-MoE) to solve the all-pairs near-shortest path (APNSP) routing problem. Our Ca-MoE is a modular two-tier architecture that supports the decision-making for forwarder selection with lower-tier experts relying on local features and upper-tier experts relying on global features. It performs adaptive inference wherein the upper-tier experts are triggered only when the lower-tier ones do not suffice to achieve adequate decision quality. Computational efficiency is thus achieved by escalating model capacity only when necessitated by topological complexity, and parameter redundancy is avoided. Furthermore, we incorporate an online meta-learning strategy that facilitates independent expert fine-tuning and utilizes a stability-focused update mechanism to prevent catastrophic forgetting as new graph environments are encountered. Experimental evaluations demonstrate that Ca-MoE routing improves accuracy by up to 29.1% in sparse networks compared to single-expert baselines and maintains performance within 1%-6% of the theoretical upper bound across diverse graph densities.
[LG-9] Vib2ECG: A Paired Chest-Lead SCG-ECG Dataset and Benchmark for ECG Reconstruction
链接: https://arxiv.org/abs/2603.15539
作者: Guorui Lu,Xiaohui Cai,Todor Stefanov,Qinyu Chen
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Twelve-lead electrocardiography (ECG) is essential for cardiovascular diagnosis, but its long-term acquisition in daily life is constrained by complex and costly hardware. Recent efforts have explored reconstructing ECG from low-cost cardiac vibrational signals such as seismocardiography (SCG), however, due to the lack of a dataset, current methods are limited to limb leads, while clinical diagnosis requires multi-lead ECG, including chest leads. In this work, we propose Vib2ECG, the first paired, multi-channel electro-mechanical cardiac signal dataset, which includes complete twelve-lead ECGs and vibrational signals acquired by inertial measurement units (IMUs) at six chest-lead positions from 17 subjects. Based on this dataset, we also provide a benchmark. Experimental results demonstrate the feasibility of reconstructing electrical cardiac signals at variable locations from vibrational signals using a lightweight 364 K-parameter U-Net. Furthermore, we observe a hallucination phenomenon in the model, where ECG waveforms are generated in regions where no corresponding electrical activity is present. We analyze the causes of this phenomenon and propose potential directions for mitigation. This study demonstrates the feasibility of mobile-device-friendly ECG monitoring through chest-lead ECG prediction from low-cost vibrational signals acquired using IMU sensors. It expands the application of cardiac vibrational signals and provides new insights into the spatial relationship between cardiac electrical and mechanical activities with spatial location variation.
[LG-10] Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs
链接: https://arxiv.org/abs/2603.15510
作者: Ido Pinto,Yizhak Yisrael Elboher,Haoze Wu,Nina Narodytska,Guy Katz
类目: Machine Learning (cs.LG)
*备注:
Abstract:The synthesis of inductive loop invariants is a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on hard instances, generating invariants that are invalid or computationally ineffective. While fine-tuning is a natural route to mitigate this limitation, obtaining high-quality training data for invariant generation remains an open challenge. We present a rigorous data curation pipeline designed to extract high-quality training signals from raw verifier-generated invariants. First, we formalize the properties required for a high-quality training invariant. Second, we propose Wonda, a pipeline that refines noisy data via AST-based normalization, followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. We demonstrate that fine-tuning Small Language Models (SLMs) on this curated dataset result in consistent and significant performance gain. In particular, a fine-tuned 4B parameter model matches the utility of a GPT-OSS-120B baseline and approaches the state-of-the-art GPT-5.2, without incurring reasoning-time overhead. On challenging instances from the recent InvBench evaluation suite, our approach doubles the invariant correctness and speedup rates of base models; and improves their Virtual Best Performance (VBP) rates on the verification task by up to 14.2%.
[LG-11] Local Urysohn Width: A Topological Complexity Measure for Classification
链接: https://arxiv.org/abs/2603.15412
作者: Xin Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce \emphlocal Urysohn width, a complexity measure for classification problems on metric spaces. Unlike VC dimension, fat-shattering dimension, and Rademacher complexity, which characterize the richness of hypothesis \emphclasses, Urysohn width characterizes the topological-geometric complexity of the classification \emphproblem itself: the minimum number of connected, diameter-bounded local experts needed to correctly classify all points within a margin-safe region. We prove four main results. First, a \textbfstrict hierarchy theorem: for every integer w \geq 1 , there exists a classification problem on a \emphconnected compact metric space (a bouquet of circles with first Betti number \beta_1 = w ) whose Urysohn width is exactly~ w , establishing that topological complexity of the input space forces classifier complexity. Second, a \textbftopology \times geometry scaling law: width scales as \Omega(w \cdot L/D_0) , where w counts independent loops and L/D_0 is the ratio of loop circumference to locality scale. Third, a \textbftwo-way separation from VC dimension: there exist problem families where width grows unboundedly while VC dimension is bounded by a constant, and conversely, families where VC dimension grows unboundedly while width remains~1. Fourth, a \textbfsample complexity lower bound: any learner that must correctly classify all points in the safe region of a width- w problem needs \Omega(w \log w) samples, independent of VC dimension.
[LG-12] Deep learning and the rate of approximation by flows
链接: https://arxiv.org/abs/2603.15363
作者: Jingpu Cheng,Qianxiao Li,Ting Lin,Zuowei Shen
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:We investigate the dependence of the approximation capacity of deep residual networks on its depth in a continuous dynamical systems setting. This can be formulated as the general problem of quantifying the minimal time-horizon required to approximate a diffeomorphism by flows driven by a given family \mathcal F of vector fields. We show that this minimal time can be identified as a geodesic distance on a sub-Finsler manifold of diffeomorphisms, where the local geometry is characterised by a variational principle involving \mathcal F . This connects the learning efficiency of target relationships to their compatibility with the learning architectural choice. Further, the results suggest that the key approximation mechanism in deep learning, namely the approximation of functions by composition or dynamics, differs in a fundamental way from linear approximation theory, where linear spaces and norm-based rate estimates are replaced by manifolds and geodesic distances.
[LG-13] Data Augmentation via Causal-Residual Bootstrapping
链接: https://arxiv.org/abs/2603.15335
作者: Mateusz Gajewski,Sophia Xiao,Bijan Mazaheri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data augmentation integrates domain knowledge into a dataset by making domain-informed modifications to existing data points. For example, image data can be augmented by duplicating images in different tints or orientations, thereby incorporating the knowledge that images may vary in these dimensions. Recent work by Teshima and Sugiyama has explored the integration of causal knowledge (e.g, A causes B causes C) up to conditional independence equivalence. We suggest a related approach for settings with additive noise that can incorporate information beyond a Markov equivalence class. The approach, built on the principle of independent mechanisms, permutes the residuals of models built on marginal probability distributions. Predictive models built on our augmented data demonstrate improved accuracy, for which we provide theoretical backing in linear Gaussian settings.
[LG-14] A scaled TW-PINN: A physics-informed neural network for traveling wave solutions of reaction-diffusion equations with general coefficients
链接: https://arxiv.org/abs/2603.15331
作者: Seungwan Han,Kwanghyuk Park,Jiaxi Gu,Jae-Hun Jung
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:We propose an efficient and generalizable physics-informed neural network (PINN) framework for computing traveling wave solutions of n -dimensional reaction-diffusion equations with various reaction and diffusion coefficients. By applying a scaling transformation with the traveling wave form, the original problem is reduced to a one-dimensional scaled reaction-diffusion equation with unit reaction and diffusion coefficients. This reduction leads to the proposed framework, termed scaled TW-PINN, in which a single PINN solver trained on the scaled equation is reused for different coefficient choices and spatial dimensions. We also prove a universal approximation property of the proposed PINN solver for traveling wave solutions. Numerical experiments in one and two dimensions, together with a comparison to the existing wave-PINN method, demonstrate the accuracy, flexibility, and superior performance of scaled TW-PINN. Finally, we explore an extension of the framework to the Fisher’s equation with general initial conditions.
[LG-15] CASHomon Sets: Efficient Rashomon Sets Across Multiple Model Classes and their Hyperparameters
链接: https://arxiv.org/abs/2603.15321
作者: Fiona Katharina Ewald,Martin Binder,Matthias Feurer,Bernd Bischl,Giuseppe Casalicchio
类目: Machine Learning (cs.LG)
*备注: Equal contributions by Fiona Katharina Ewald and Martin Binder
Abstract:Rashomon sets are model sets within one model class that perform nearly as well as a reference model from the same model class. They reveal the existence of alternative well-performing models, which may support different interpretations. This enables selecting models that match domain knowledge, hidden constraints, or user preferences. However, efficient construction methods currently exist for only a few model classes. Applied machine learning usually searches many model classes, and the best class is unknown beforehand. We therefore study Rashomon sets in the combined algorithm selection and hyperparameter optimization (CASH) setting and call them CASHomon sets. We propose TruVaRImp, a model-based active learning algorithm for level set estimation with an implicit threshold, and provide convergence guarantees. On synthetic and real-world datasets, TruVaRImp reliably identifies CASHomon sets members and matches or outperforms naive sampling, Bayesian optimization, classical and implicit level set estimation methods, and other baselines. Our analyses of predictive multiplicity and feature-importance variability across model classes question the common practice of interpreting data through a single model class.
[LG-16] A Kolmogorov-Arnold Surrogate Model for Chemical Equilibria: Application to Solid Solutions
链接: https://arxiv.org/abs/2603.15307
作者: Leonardo Boledi,Dirk Bosbach,Jenna Poonoosamy
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:The computational cost of geochemical solvers is a challenging matter. For reactive transport simulations, where chemical calculations are performed up to billions of times, it is crucial to reduce the total computational time. Existing publications have explored various machine-learning approaches to determine the most effective data-driven surrogate model. In particular, multilayer perceptrons are widely employed due to their ability to recognize nonlinear relationships. In this work, we focus on the recent Kolmogorov-Arnold networks, where learnable spline-based functions replace classical fixed activation functions. This architecture has achieved higher accuracy with fewer trainable parameters and has become increasingly popular for solving partial differential equations. First, we train a surrogate model based on an existing cement system benchmark. Then, we move to an application case for the geological disposal of nuclear waste, i.e., the determination of radionuclide-bearing solids solubilities. To the best of our knowledge, this work is the first to investigate co-precipitation with radionuclide incorporation using data-driven surrogate models, considering increasing levels of thermodynamic complexity from simple mechanical mixtures to non-ideal solid solutions of binary (Ba,Ra)SO _4 and ternary (Sr,Ba,Ra)SO _4 systems. On the cement benchmark, we demonstrate that the Kolmogorov-Arnold architecture outperforms multilayer perceptrons in both absolute and relative error metrics, reducing them by 62% and 59%, respectively. On the binary and ternary radium solid solution models, Kolmogorov-Arnold networks maintain median prediction errors near 1\times10^-3 . This is the first step toward employing surrogate models to speed up reactive transport simulations and optimize the safety assessment of deep geological waste repositories.
[LG-17] xplainfi: Feature Importance and Statistical Inference for Machine Learning in R
链接: https://arxiv.org/abs/2603.15306
作者: Lukas Burk,Fiona Katharina Ewald,Giuseppe Casalicchio,Marvin N. Wright,Bernd Bischl
类目: Machine Learning (cs.LG)
*备注: 25 pages, 5 figures
Abstract:We introduce xplainfi, an R package built on top of the mlr3 ecosystem for global, loss-based feature importance methods for machine learning models. Various feature importance methods exist in R, but significant gaps remain, particularly regarding conditional importance methods and associated statistical inference procedures. The package implements permutation feature importance, conditional feature importance, relative feature importance, leave-one-covariate-out, and generalizations thereof, and both marginal and conditional Shapley additive global importance methods. It provides a modular conditional sampling architecture based on Gaussian distributions, adversarial random forests, conditional inference trees, and knockoff-based samplers, which enable conditional importance analysis for continuous and mixed data. Statistical inference is available through multiple approaches, including variance-corrected confidence intervals and the conditional predictive impact framework. We demonstrate that xplainfi produces importance scores consistent with existing implementations across multiple simulation settings and learner types, while offering competitive runtime performance. The package is available on CRAN and provides researchers and practitioners with a comprehensive toolkit for feature importance analysis and model interpretation in R.
[LG-18] Enhancing classification accuracy through chaos
链接: https://arxiv.org/abs/2603.15299
作者: Panos Stinis
类目: Machine Learning (cs.LG)
*备注: 23 pages, 8 figures
Abstract:We propose a novel approach which exploits chaos to enhance classification accuracy. Specifically, the available data that need to be classified are treated as vectors that are first lifted into a higher-dimensional space and then used as initial conditions for the evolution of a chaotic dynamical system for a prescribed temporal interval. The evolved state of the dynamical system is then fed to a trainable softmax classifier which outputs the probabilities of the various classes. As proof-of-concept, we use samples of randomly perturbed orthogonal vectors of moderate dimension (2 to 20), with a corresponding number of classes equal to the vector dimension, and show how our approach can both significantly accelerate the training process and improve the classification accuracy compared to a standard softmax classifier which operates on the original vectors, as well as a softmax classifier which only lifts the vectors to a higher-dimensional space without evolving them. We also provide an explanation for the improved performance of the chaos-enhanced classifier.
[LG-19] Evaluating the Robustness of Reinforcement Learning based Adaptive Traffic Signal Control
链接: https://arxiv.org/abs/2603.15283
作者: Dickens Kwesiga,Angshuman Guin,Khaled Abdelghany,Michael Hunter
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) has attracted increasing interest for adaptive traffic signal control due to its model-free ability to learn control policies directly from interaction with the traffic environment. However, several challenges remain before RL-based signal control can be considered ready for field deployment. Many existing studies rely on simplified signal timing structures, robustness of trained models under varying traffic demand conditions remains insufficiently evaluated, and runtime efficiency continues to pose challenges when training RL algorithms in traffic microscopic simulation environments. This study formulates an RL-based signal control algorithm capable of representing a full eight-phase ring-barrier configuration consistent with field signal controllers. The algorithm is trained and evaluated under varying traffic demand conditions and benchmarked against state-of-the-practice actuated signal control (ASC). To assess robustness, experiments are conducted across multiple traffic volumes and origin-destination (O-D) demand patterns with varying levels of structural similarity. To improve training efficiency, a distributed asynchronous training architecture is implemented that enables parallel simulation across multiple computing nodes. Results from a case study intersection show that the proposed RL-based signal control significantly outperforms optimized ASC, reducing average delay by 11-32% across movements. A model trained on a single O-D pattern generalizes well to similar unseen demand patterns but degrades under substantially different demand conditions. In contrast, a model trained on diverse O-D patterns demonstrates strong robustness, consistently outperforming ASC even under highly dissimilar unseen demand scenarios.
[LG-20] Mechanistic Foundations of Goal-Directed Control
链接: https://arxiv.org/abs/2603.15248
作者: Alma Lago
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to the 7th International Conference on the Mathematics of Neuroscience and AI (Rome, June 2026)
Abstract:Mechanistic interpretability has transformed the analysis of transformer circuits by decomposing model behavior into competing algorithms, identifying phase transitions during training, and deriving closed-form predictions for when and why strategies shift. However, this program has remained largely confined to sequence-prediction architectures, leaving embodied control systems without comparable mechanistic accounts. Here we extend this framework to sensorimotor-cognitive development, using infant motor learning as a model system. We show that foundational inductive biases give rise to causal control circuits, with learned gating mechanisms converging toward theoretically motivated uncertainty thresholds. The resulting dynamics reveal a clean phase transition in the arbitration gate whose commitment behavior is well described by a closed-form exponential moving-average surrogate. We identify context window k as the critical parameter governing circuit formation: below a minimum threshold (k \leq 4) the arbitration mechanism cannot form; above it (k \geq 8), gate confidence scales asymptotically as log k. A two-dimensional phase diagram further reveals task-demand-dependent route arbitration consistent with the prediction that prospective execution becomes advantageous only when prediction error remains within the task tolerance window. Together, these results provide a mechanistic account of how reactive and prospective control strategies emerge and compete during learning. More broadly, this work sharpens mechanistic accounts of cognitive development and provides principled guidance for the design of interpretable embodied agents.
[LG-21] Decomposing Probabilistic Scores: Reliability Information Loss and Uncertainty
链接: https://arxiv.org/abs/2603.15232
作者: Arthur Charpentier,Agathe Fernandes-Machado
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:Calibration is a conditional property that depends on the information retained by a predictor. We develop decomposition identities for arbitrary proper losses that make this dependence explicit. At any information level \mathcal A , the expected loss of an \mathcal A -measurable predictor splits into a proper-regret (reliability) term and a conditional entropy (residual uncertainty) term. For nested levels \mathcal A\subseteq\mathcal B , a chain decomposition quantifies the information gain from \mathcal A to \mathcal B . Applied to classification with features \boldsymbolX and score S=s(\boldsymbolX) , this yields a three-term identity: miscalibration, a \em grouping term measuring information loss from \boldsymbolX to S , and irreducible uncertainty at the feature level. We leverage the framework to analyze post-hoc recalibration, aggregation of calibrated models, and stagewise/boosting constructions, with explicit forms for Brier and log-loss.
[LG-22] Massive Redundancy in Gradient Transport Enables Sparse Online Learning
链接: https://arxiv.org/abs/2603.15195
作者: Aur Shalev Merin
类目: Machine Learning (cs.LG)
*备注: 27 pages, 5 figures, 14 tables
Abstract:Real-time recurrent learning (RTRL) computes exact online gradients by propagating a Jacobian tensor forward through recurrent dynamics, but at O(n^4) cost per step. Prior work has sought structured approximations (rank-1 compression, graph-based sparsity, Kronecker factorization). We show that, in the continuous error signal regime, the recurrent Jacobian is massively redundant:propagating through a random 6% of paths (k=4 of n=64) recovers 84 +/- 6% of full RTRL’s adaptation ability across five seeds, and the absolute count k=4 remains effective from n=64 to n=256 (6% to 1.6%, recovery 84 to 78%), meaning sparse RTRL becomes relatively cheaper as networks grow. In RNNs, the recovery is selection-invariant (even adversarial path selection works) and exhibits a step-function transition from zero to any nonzero propagation. Spectral analysis reveals the mechanism: the Jacobian is full-rank but near-isotropic (condition numbers 2.6-6.5), so any random subset provides a directionally representative gradient estimate. On chaotic dynamics (Lorenz attractor), sparse propagation is more numerically stable than full RTRL (CV 13% vs. 88%), as subsampling avoids amplifying pathological spectral modes. The redundancy extends to LSTMs (k=4 matches full RTRL) and to transformers via sparse gradient transport (50% head sparsity outperforms the dense reference; 33% is borderline), with higher thresholds reflecting head specialization rather than isotropy. On real primate neural data, sparse RTRL (k=4) adapts online to cross-session electrode drift (80 +/- 11% recovery, 5 seeds), where sparse propagation is again more stable than full RTRL. Without continuous error signal, Jacobian propagation accumulates numerical drift and degrades all RTRL variants, a scope condition for all forward-mode methods. Results hold with SGD (92 +/- 1% recovery), suggesting independence from optimizer choice.
[LG-23] PiGRAND: Physics-informed Graph Neural Diffusion for Intelligent Additive Manufacturing
链接: https://arxiv.org/abs/2603.15194
作者: Benjamin Uhrich,Tim Häntschel,Erhard Rahm
类目: Machine Learning (cs.LG)
*备注: 36 pages, 29 figures
Abstract:A comprehensive understanding of heat transport is essential for optimizing various mechanical and engineering applications, including 3D printing. Recent advances in machine learning, combined with physics-based models, have enabled a powerful fusion of numerical methods and data-driven algorithms. This progress is driven by the availability of limited sensor data in various engineering and scientific domains, where the cost of data collection and the inaccessibility of certain measurements are high. To this end, we present PiGRAND, a Physics-informed graph neural diffusion framework. In order to reduce the computational complexity of graph learning, an efficient graph construction procedure was developed. Our approach is inspired by the explicit Euler and implicit Crank-Nicolson methods for modeling continuous heat transport, leveraging sub-learning models to secure the accurate diffusion across graph nodes. To enhance computational performance, our approach is combined with efficient transfer learning. We evaluate PiGRAND on thermal images from 3D printing, demonstrating significant improvements in prediction accuracy and computational performance compared to traditional graph neural diffusion (GRAND) and physics-informed neural networks (PINNs). These enhancements are attributed to the incorporation of physical principles derived from the theoretical study of partial differential equations (PDEs) into the learning model. The PiGRAND code is open-sourced on GitHub: this https URL
[LG-24] Joint Routing and Model Pruning for Decentralized Federated Learning in Bandwidth-Constrained Multi-Hop Wireless Networks
链接: https://arxiv.org/abs/2603.15188
作者: Xiaoyu He,Weicai Li,Tiejun Lv,Xi Yu
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Decentralized federated learning (D-FL) enables privacy-preserving training without a central server, but multi-hop model exchanges and aggregation are often bottlenecked by communication resource constraints. To address this issue, we propose a joint routing-and-pruning framework that optimizes routing paths and pruning rates to maintain communication latency within prescribed limits. We analyze how the sum of model biases across all clients affects the convergence bound of D-FL and formulate an optimization problem that maximizes the model retention rate to minimize these biases under communication constraints. Further analysis reveals that each client’s model retention rate is path-dependent, which reduces the original problem to a routing optimization. Leveraging this insight, we develop a routing algorithm that selects latency-efficient transmission paths, allowing more parameters to be delivered within the time budget and thereby improving D-FL convergence. Simulations demonstrate that, compared with unpruned systems, the proposed framework reduces average transmission latency by 27.8% and improves testing accuracy by approximately 12%. Furthermore, relative to standard benchmark routing algorithms, the proposed routing method improves accuracy by roughly 8%.
[LG-25] Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies
链接: https://arxiv.org/abs/2603.15158
作者: Zahra Rahiminasab,Reza Soumi,Arto Klami,Samuel Kaski
类目: Machine Learning (cs.LG)
*备注:
Abstract:Addressing the domain adaptation problem becomes more challenging when distribution shifts across domains stem from latent confounders that affect both covariates and outcomes. Existing proxy-based approaches that address latent shift rely on a strong completeness assumption to uniquely determine (point-identify) a robust predictor. Completeness requires that proxies have sufficient information about variations in latent confounders. For imperfect proxies the mapping from confounders to the space of proxy distributions is non-injective, and multiple latent confounder values can generate the same proxy distribution. This breaks the completeness assumption and observed data are consistent with multiple potential predictors (set-identified). To address this, we introduce latent equivalent classes (LECs). LECs are defined as groups of latent confounders that induce the same conditional proxy distribution. We show that point-identification for the robust predictor remains achievable as long as multiple domains differ sufficiently in how they mix proxy-induced LECs to form the robust predictor. This domain diversity condition is formalized as a cross-domain rank condition on the mixture weights, which is substantially weaker assumption than completeness. We introduce the Proximal Quasi-Bayesian Active learning (PQAL) framework, which actively queries a minimal set of diverse domains that satisfy this rank condition. PQAL can efficiently recover the point-identified predictor, demonstrates robustness to varying degrees of shift and outperforms previous methods on synthetic data and semi-synthetic dSprites dataset.
[LG-26] Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction
链接: https://arxiv.org/abs/2603.15144
作者: Yanghao Li,Changxin Liu,Yuhao Yi
类目: Machine Learning (cs.LG)
*备注: 62 pages,12 figures
Abstract:In collaborative and distributed learning, Byzantine robustness reflects a major facet of optimization algorithms. Such distributed algorithms are often accompanied by transmitting a large number of parameters, so communication compression is essential for an effective solution. In this paper, we propose Byz-DM21, a novel Byzantine-robust and communication-efficient stochastic distributed learning algorithm. Our key innovation is a novel gradient estimator based on a double-momentum mechanism, integrating recent advancements in error feedback techniques. Using this estimator, we design both standard and accelerated algorithms that eliminate the need for large batch sizes while maintaining robustness against Byzantine workers. We prove that the Byz-DM21 algorithm has a smaller neighborhood size and converges to \varepsilon -stationary points in \mathcalO(\varepsilon^-4) iterations. To further enhance efficiency, we introduce a distributed variant called Byz-VR-DM21, which incorporates local variance reduction at each node to progressively eliminate variance from random approximations. We show that Byz-VR-DM21 provably converges to \varepsilon -stationary points in \mathcalO(\varepsilon^-3 ) iterations. Additionally, we extend our results to the case where the functions satisfy the Polyak-Łojasiewicz condition. Finally, numerical experiments demonstrate the effectiveness of the proposed method.
[LG-27] Establishing Construct Validity in LLM Capability Benchmarks Requires Nomological Networks
链接: https://arxiv.org/abs/2603.15121
作者: Timo Freiesleben
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Recent work in machine learning increasingly attributes human-like capabilities such as reasoning or theory of mind to large language models (LLMs) on the basis of benchmark performance. This paper examines this practice through the lens of construct validity, understood as the problem of linking theoretical capabilities to their empirical measurements. It contrasts three influential frameworks: the nomological account developed by Cronbach and Meehl, the inferential account proposed by Messick and refined by Kane, and Borsboom’s causal account. I argue that the nomological account provides the most suitable foundation for current LLM capability research. It avoids the strong ontological commitments of the causal account while offering a more substantive framework for articulating construct meaning than the inferential account. I explore the conceptual implications of adopting the nomological account for LLM research through a concrete case: the assessment of reasoning capabilities in LLMs.
[LG-28] rustworthy Koopman Operator Learning: Invariance Diagnostics and Error Bounds
链接: https://arxiv.org/abs/2603.15091
作者: Gustav Conradie,Nicolas Boullé,Jean-Christophe Loiseau,Steven L. Brunton,Matthew J. Colbrook
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注:
Abstract:Koopman operator theory provides a global linear representation of nonlinear dynamics and underpins many data-driven methods. In practice, however, finite-dimensional feature spaces induced by a user-chosen dictionary are rarely invariant, so closure failures and projection errors lead to spurious eigenvalues, misleading Koopman modes, and overconfident forecasts. This paper addresses a central validation problem in data-driven Koopman methods: how to quantify invariance and projection errors for an arbitrary feature space using only snapshot data, and how to use these diagnostics to produce actionable guarantees and guide dictionary refinement? A unified a posteriori methodology is developed for certifying when a Koopman approximation is trustworthy and improving it when it is not. Koopman invariance is quantified using principal angles between a subspace and its Koopman image, yielding principal observables and a principal angle decomposition (PAD), a dynamics-informed alternative to SVD truncation with significantly improved performance. Multi-step error bounds are derived for Koopman and Perron–Frobenius mode decompositions, including RKHS-based pointwise guarantees, and are complemented by Gaussian process expected error surrogates. The resulting toolbox enables validated spectral analysis, certified forecasting, and principled dictionary and kernel learning, demonstrated on chaotic and high-dimensional benchmarks and real-world datasets, including cavity flow and the Pluto–Charon system.
[LG-29] Affordable Precision Agriculture: A Deployment-Oriented Review of Low-Cost Low-Power Edge AI and TinyML for Resource-Constrained Farming Systems
链接: https://arxiv.org/abs/2603.15085
作者: Riya Samanta,Bidyut Saha
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:
Abstract:Precision agriculture increasingly integrates artificial intelligence to enhance crop monitoring, irrigation management, and resource efficiency. Nevertheless, the vast majority of the current systems are still mostly cloud-based and require reliable connectivity, which hampers the adoption to smaller scale, smallholder farming and underdeveloped country systems. Using recent literature reviews, ranging from 2023 to 2026, this review covers deployments of Edge AI, focused on the evolution and acceptance of Tiny Machine Learning, in low-cost and low-powered agriculture. A hardware-targeted deployment-oriented study has shown pronounced variation in architecture with microcontroller-class platforms i.e. ESP32, STM32, ATMega dominating the inference options, in parallel with single-board computers and UAV-assisted solutions. Quantitative synthesis shows quantization is the dominant optimization strategy; the approach in many works identified: around 50% of such works are quantized, while structured pruning, multi-objective compression and hardware aware neural architecture search are relatively under-researched. Also, resource profiling practices are not uniform: while model size is occasionally reported, explicit flash, RAM, MAC, latency and millijoule level energy metrics are not well documented, hampering reproducibility and cross-system comparison. Moreoever, to bridge the gap between research prototypes and deployment-ready systems, the review also presents a literature-informed deployment perspective in the form of a privacy-preserving layered Edge AI architecture for agriculture, synthesizing the key system-level design insights emerging from the surveyed works. Overall, the findings demonstrate a clear architectural shift toward localized inference with centralized training asymmetry.
[LG-30] Interpretable Classification of Time Series Using Euler Characteristic Surfaces
链接: https://arxiv.org/abs/2603.15079
作者: Salam Rabindrajit Luwang,Sushovan Majhi,Vishal Mandal,Atish J. Mitra,Md. Nurujjaman,Buddha Nath Sharma
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:
Abstract:Persistent homology (PH) – the conventional method in topological data analysis – is computationally expensive, requires further vectorization of its signatures before machine learning (ML) can be applied, and captures information along only the spatial axis. For time series data, we propose Euler Characteristic Surfaces (ECS) as an alternative topological signature based on the Euler characteristic ( \chi ) – a fundamental topological invariant. The ECS provides a computationally efficient, spatiotemporal, and inherently discretized feature representation that can serve as direct input to ML models. We prove a stability theorem guaranteeing that the ECS remains stable under small perturbations of the input time series. We first demonstrate that ECS effectively captures the nontrivial topological differences between the limit cycle and the strange attractor in the Rössler system. We then develop an ECS-based classification framework and apply it to five benchmark biomedical datasets (four ECG, one EEG) from the UCR/UEA archive. On \textitECG5000 , our single-feature ECS classifier achieves 98% accuracy with O(n+R\cdot T) complexity, compared to 62% reported by a recent PH-based method. An AdaBoost extension raises accuracy to 98.6% , matching the best deep learning results while retaining full interpretability. Strong results are also obtained on \textitTwoLeadECG ( 94.1% ) and \textitEpilepsy2 ( 92.6% ).
[LG-31] Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization
链接: https://arxiv.org/abs/2603.15059
作者: Hideaki Iiduka
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Muon is a recently proposed optimizer that enforces orthogonality in parameter updates by projecting gradients onto the Stiefel manifold, leading to stable and efficient training in large-scale deep neural networks. Meanwhile, the previously reported results indicated that stochastic noise in practical machine learning may exhibit heavy-tailed behavior, violating the bounded-variance assumption. In this paper, we consider the problem of minimizing a nonconvex Hölder-smooth empirical risk that works well with the heavy-tailed stochastic noise. We then show that Muon converges to a stationary point of the empirical risk under the boundedness condition accounting for heavy-tailed stochastic noise. In addition, we show that Muon converges faster than mini-batch SGD.
[LG-32] CrossADR: enhancing adverse drug reactions prediction for combination pharmacotherapy with cross-layer feature integration and cross-level associative learning
链接: https://arxiv.org/abs/2603.15047
作者: Y. Cheung
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注:
Abstract:Combination pharmacotherapy offers substantial therapeutic advantages but also poses substantial risks of adverse drug reactions (ADRs). The accurate prediction of ADRs with interpretable computational methods is crucial for clinical safety management, drug development, and precision medicine. However, managing ADRs remains a challenge due to the vast search space of drug combinations and the complexity of physiological responses. Current graph-based architectures often struggle to effectively integrate multi-scale biological information and frequently rely on fixed association matrices, which limits their ability to capture dynamic organ-level dependencies and generalize across diverse datasets. Here we propose CrossADR, a hierarchical framework for organ-level ADR prediction through cross-layer feature integration and cross-level associative learning. It incorporates a gated-residual-flow graph neural network to fuse multi-scale molecular features and utilizes a learnable ADR embedding space to dynamically capture latent biological correlations across 15 organ systems. Systematic evaluation on the newly constructed CrossADR-Dataset-covering 1,376 drugs and 946,000 unique combinations-demonstrates that CrossADR consistently achieves state-of-the-art performance across 80 distinct experimental scenarios and provides high-resolution insights into drug-related protein protein interactions and pathways. Overall, CrossADR represents a robust tool for cross-scale biomedical information integration, cross-layer feature integration as well as cross-level associative learning, and can be effectively utilized to prevent ADRs in clinical decision-making.
[LG-33] Rethinking Machine Unlearning: Models Designed to Forget via Key Deletion
链接: https://arxiv.org/abs/2603.15033
作者: Sonia Laguna,Jorge da Silva Goncalves,Moritz Vandenhirtz,Alain Ryser,Irene Cannistraci,Julia E. Vogt
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine unlearning is rapidly becoming a practical requirement, driven by privacy regulations, data errors, and the need to remove harmful or corrupted training samples. Despite this, most existing methods tackle the problem purely from a post-hoc perspective. They attempt to erase the influence of targeted training samples through parameter updates that typically require access to the full training data. This creates a mismatch with real deployment scenarios where unlearning requests can be anticipated, revealing a fundamental limitation of post-hoc approaches. We propose \textitunlearning by design, a novel paradigm in which models are directly trained to support forgetting as an inherent capability. We instantiate this idea with Machine UNlearning via KEY deletion (MUNKEY), a memory augmented transformer that decouples instance-specific memorization from model weights. Here, unlearning corresponds to removing the instance-identifying key, enabling direct zero-shot forgetting without weight updates or access to the original samples or labels. Across natural image benchmarks, fine-grained recognition, and medical datasets, MUNKEY outperforms all post-hoc baselines. Our results establish that unlearning by design enables fast, deployment-oriented unlearning while preserving predictive performance.
[LG-34] MONET: Modeling and Optimization of neural NEtwork Training from Edge to Data Centers
链接: https://arxiv.org/abs/2603.15002
作者: Jérémy Morlier,Robin Geens,Stef Cuyckens,Arne Symons,Marian Verhelst,Vincent Gripon,Mathieu Léonardon
类目: Machine Learning (cs.LG)
*备注: 12 pages, 12 figures
Abstract:While hardware-software co-design has significantly improved the efficiency of neural network inference, modeling the training phase remains a critical yet underexplored challenge. Training workloads impose distinct constraints, particularly regarding memory footprint and backpropagation complexity, which existing inference-focused tools fail to capture. This paper introduces MONET, a framework designed to model the training of neural networks on heterogeneous dataflow accelerators. MONET builds upon Stream, an experimentally verified framework that that models the inference of neural networks on heterogeneous dataflow accelerators with layer fusion. Using MONET, we explore the design space of ResNet-18 and a small GPT-2, demonstrating the framework’s capability to model training workflows and find better hardware architectures. We then further examine problems that become more complex in neural network training due to the larger design space, such as determining the best layer-fusion configuration. Additionally, we use our framework to find interesting trade-offs in activation checkpointing, with the help of a genetic algorithm. Our findings highlight the importance of a holistic approach to hardware-software co-design for scalable and efficient deep learning deployment.
[LG-35] Lightweight User-Personalization Method for Closed Split Computing
链接: https://arxiv.org/abs/2603.14958
作者: Yuya Okada,Takayuki Nishio
类目: Machine Learning (cs.LG)
*备注: 15 pages, 12 figures
Abstract:Split Computing enables collaborative inference between edge devices and the cloud by partitioning a deep neural network into an edge-side head and a server-side tail, reducing latency and limiting exposure of raw input data. However, inference performance often degrades in practical deployments due to user-specific data distribution shifts, unreliable communication, and privacy-oriented perturbations, especially in closed environments where model architectures and parameters are inaccessible. To address this challenge, we propose SALT (Split-Adaptive Lightweight Tuning), a lightweight adaptation framework for closed Split Computing systems. SALT introduces a compact client-side adapter that refines intermediate representations produced by a frozen head network, enabling effective model adaptation without modifying the head or tail networks or increasing communication overhead. By modifying only the training conditions, SALT supports multiple adaptation objectives, including user personalization, communication robustness, and privacy-aware inference. Experiments using ResNet-18 on CIFAR-10 and CIFAR-100 show that SALT achieves higher accuracy than conventional retraining and fine-tuning while significantly reducing training cost. On CIFAR-10, SALT improves personalized accuracy from 88.1% to 93.8% while reducing training latency by more than 60%. SALT also maintains over 90% accuracy under 75% packet loss and preserves high accuracy (about 88% at sigma = 1.0) under noise injection. These results demonstrate that SALT provides an efficient and practical adaptation framework for real-world Split Computing systems.
[LG-36] SFedHIFI: Fire Rate-Based Heterogeneous Information Fusion for Spiking Federated Learning
链接: https://arxiv.org/abs/2603.14956
作者: Ran Tao,Qiugang Zhan,Shantian Yang,Xiurui Xie,Qi Tian,Guisong Liu
类目: Machine Learning (cs.LG)
*备注: 9 pages, 1 figure
Abstract:Spiking Federated Learning (SFL) has been widely studied with the energy efficiency of Spiking Neural Networks (SNNs). However, existing SFL methods require model homogeneity and assume all clients have sufficient computational resources, resulting in the exclusion of some resource-constrained clients. To address the prevalent system heterogeneity in real-world scenarios, enabling heterogeneous SFL systems that allow clients to adaptively deploy models of different scales based on their local resources is crucial. To this end, we introduce SFedHIFI, a novel Spiking Federated Learning framework with Fire Rate-Based Heterogeneous Information Fusion. Specifically, SFedHIFI employs channel-wise matrix decomposition to deploy SNN models of adaptive complexity on clients with heterogeneous resources. Building on this, the proposed heterogeneous information fusion module enables cross-scale aggregation among models of different widths, thereby enhancing the utilization of diverse local knowledge. Extensive experiments on three public benchmarks demonstrate that SFedHIFI can effectively enable heterogeneous SFL, consistently outperforming all three baseline methods. Compared with ANN-based FL, it achieves significant energy savings with only a marginal trade-off in accuracy.
[LG-37] Spiking Layer-Adaptive Magnitude-based Pruning
链接: https://arxiv.org/abs/2603.14946
作者: Junqiao Wang,Zhehang Ye,Yuqi Ouyang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spiking Neural Networks (SNNs) provide energy-efficient computation but their deployment is constrained by dense connectivity and high spiking operation costs. Existing magnitude-based pruning strategies, when naively applied to SNNs, fail to account for temporal accumulation, non-uniform timestep contributions, and membrane stability, often leading to severe performance degradation. This paper proposes Spiking Layer-Adaptive Magnitude-based Pruning (SLAMP), a theory-guided pruning framework that generalizes layer-adaptive magnitude pruning to temporal SNNs by explicitly controlling worst-case output distortion across layers and timesteps. SLAMP formulates sparsity allocation as a temporal distortion-constrained optimization problem, yielding time-aware layer importance scores that reduce to conventional layer-adaptive pruning in single-timestep limit. An efficient two-stage procedure is derived, combining temporal score estimation, global sparsity allocation, and magnitude pruning with retraining for stability recovery. Experiments on CIFAR10, CIFAR100, and the event-based CIFAR10-DVS datasets demonstrate that SLAMP achieves substantial connectivity and spiking operation reductions while preserving accuracy, enabling efficient and deployable SNN inference.
[LG-38] Ultra-Early Prediction of Tipping Points: Integrating Dynamical Measures with Reservoir Computing
链接: https://arxiv.org/abs/2603.14944
作者: Xin Li,Qunxi Zhu,Chengli Zhao,Bolin Zhao,Xue Zhang,Xiaojun Duan,Wei Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Complex dynamical systems-such as climate, ecosystems, and economics-can undergo catastrophic and potentially irreversible regime changes, often triggered by environmental parameter drift and stochastic disturbances. These critical thresholds, known as tipping points, pose a prediction problem of both theoretical and practical significance, yet remain largely unresolved. To address this, we articulate a model-free framework that integrates the measures characterizing the stability and sensitivity of dynamical systems with the reservoir computing (RC), a lightweight machine learning technique, using only observational time series data. The framework consists of two stages. The first stage involves using RC to robustly learn local complex dynamics from observational data segmented into windows. The second stage focuses on accurately detecting early warning signals of tipping points by analyzing the learned autonomous RC dynamics through dynamical measures, including the dominant eigenvalue of the Jacobian matrix, the maximum Floquet multiplier, and the maximum Lyapunov exponent. Furthermore, when these dynamical measures exhibit trend-like patterns, their extrapolation enables ultra-early prediction of tipping points significantly prior to the occurrence of critical transitions. We conduct a rigorous theoretical analysis of the proposed method and perform extensive numerical evaluations on a series of representative synthetic systems and eight real-world datasets, as well as quantitatively predict the tipping time of the Atlantic Meridional Overturning Circulation system. Experimental results demonstrate that our framework exhibits advantages over the baselines in comprehensive evaluations, particularly in terms of dynamical interpretability, prediction stability and robustness, and ultra-early prediction capability.
[LG-39] Intelligent Control of Differential Drive Robots Subject to Unmodeled Dynamics with EKF-based State Estimation
链接: https://arxiv.org/abs/2603.14940
作者: Amos Alwala,Yuchen Hu,Gabriel da Silva Lima,Wallace Moreira Bessa
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Reliable control and state estimation of differential drive robots (DDR) operating in dynamic and uncertain environments remains a challenge, particularly when system dynamics are partially unknown and sensor measurements are prone to degradation. This work introduces a unified control and state estimation framework that combines a Lyapunov-based nonlinear controller and Adaptive Neural Networks (ANN) with Extended Kalman Filter (EKF)-based multi-sensor fusion. The proposed controller leverages the universal approximation property of neural networks to model unknown nonlinearities in real time. An online adaptation scheme updates the weights of the radial basis function (RBF), the architecture chosen for the ANN. The learned dynamics are integrated into a feedback linearization (FBL) control law, for which theoretical guarantees of closed-loop stability and asymptotic convergence in a trajectory-tracking task are established through a Lyapunov-like stability analysis. To ensure robust state estimation, the EKF fuses inertial measurement unit (IMU) and odometry from monocular, 2D-LiDAR and wheel encoders. The fused state estimate drives the intelligent controller, ensuring consistent performance even under drift, wheel slip, sensor noise and failure. Gazebo simulations and real-world experiments are done using DDR, demonstrating the effectiveness of the approach in terms of improved velocity tracking performance with reduction in linear and angular velocity errors up to 53.91% and 29.0% in comparison to the baseline FBL.
[LG-40] Masked BRep Autoencoder via Hierarchical Graph Transformer
链接: https://arxiv.org/abs/2603.14927
作者: Yifei Li,Kang Wu,Wenming Wu,Xiaoming Fu
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 27 pages, 11 figures. Under review
Abstract:We introduce a novel self-supervised learning framework that automatically learns representations from input computer-aided design (CAD) models for downstream tasks, including part classification, modeling segmentation, and machining feature recognition. To train our network, we construct a large-scale, unlabeled dataset of boundary representation (BRep) models. The success of our algorithm relies on two keycomponents. The first is a masked graph autoencoder that reconstructs randomly masked geometries and attributes of BReps for representation learning to enhance the generalization. The second is a hierarchical graph Transformer architecture that elegantly fuses global and local learning by a cross-scale mutual attention block to model long-range geometric dependencies and a graph neural network block to aggregate local topological information. After training the autoencoder, we replace its decoder with a task-specific network trained on a small amount of labeled data for downstream tasks. We conduct experiments on various tasks and achieve high performance, even with a small amount of labeled data, demonstrating the practicality and generalizability of our model. Compared to other methods, our model performs significantly better on downstream tasks with the same amount of training data, particularly when the training data is very limited.
[LG-41] BiTro: Bidirectional Transfer Learning Enhances Bulk and Spatial Transcriptomics Prediction in Cancer Pathological Images
链接: https://arxiv.org/abs/2603.14897
作者: Jingkun Yu,Guangkai Shang,Changtao Li,Xun Gong,Tianrui Li,Yazhou He,Zhipeng Luo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cancer pathological analysis requires modeling tumor heterogeneity across multiple modalities, primarily through transcriptomics and whole slide imaging (WSI), along with their spatial relations. On one hand, bulk transcriptomics and WSI images are largely available but lack spatial mapping; on the other hand, spatial transcriptomics (ST) data can offer high spatial resolution, yet facing challenges of high cost, low sequencing depth, and limited sample sizes. Therefore, the data foundation of either side is flawed and has its limit in accurately finding the mapping between the two modalities. To this end, we propose BiTro, a bidirectional transfer learning framework that can enhance bulk and spatial transcriptomics prediction from pathological images. Our contributions are twofold. First, we design a universal and transferable model architecture that works for both bulk+WSI and ST data. A major highlight is that we model WSI images on the cellular level to better capture cells’ visual features, morphological phenotypes, and their spatial relations; to map cells’ features to their transcriptomics measured in bulk or ST, we adopt multiple instance learning. Second, by using LoRA, our model can be efficiently transferred between bulk and ST data to exploit their complementary information. To test our framework, we conducted comprehensive experiments on five cancer datasets. Results demonstrate that 1) our base model can achieve better or competitive performance compared to existing models on bulk or spatial transcriptomics prediction, and 2) transfer learning can further improve the base model’s performance.
[LG-42] Lost in Aggregation: On a Fundamental Expressivity Limit of Message-Passing Graph Neural Networks
链接: https://arxiv.org/abs/2603.14846
作者: Eran Rosenbluth
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注:
Abstract:We define a generic class of functions that captures most conceivable aggregations for Message-Passing Graph Neural Networks (MP-GNNs), and prove that any MP-GNN model with such aggregations induces only a polynomial number of equivalence classes on all graphs - while the number of non-isomorphic graphs is doubly-exponential (in number of vertices). Adding a familiar perspective, we observe that merely 2-iterations of Color Refinement (CR) induce at least an exponential number of equivalence classes, making the aforementioned MP-GNNs relatively infinitely weaker. Previous results state that MP-GNNs match full CR, however they concern a weak, ‘non-uniform’, notion of distinguishing-power where each graph size may required a different MP-GNN to distinguish graphs up to that size. Our results concern both distinguishing between non-equivariant vertices and distinguishing between non-isomorphic graphs. Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC) MSC classes: 68T05, 68T07 ACMclasses: I.2.6 Cite as: arXiv:2603.14846 [cs.LG] (or arXiv:2603.14846v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.14846 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-43] Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks
链接: https://arxiv.org/abs/2603.14830
作者: Yuri Kinoshita,Naoki Nishikawa,Taro Toyoizumi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Dataset distillation, a training-aware data compression technique, has recently attracted increasing attention as an effective tool for mitigating costs of optimization and data storage. However, progress remains largely empirical. Mechanisms underlying the extraction of task-relevant information from the training process and the efficient encoding of such information into synthetic data points remain elusive. In this paper, we theoretically analyze practical algorithms of dataset distillation applied to the gradient-based training of two-layer neural networks with width L . By focusing on a non-linear task structure called multi-index model, we prove that the low-dimensional structure of the problem is efficiently encoded into the resulting distilled data. This dataset reproduces a model with high generalization ability for a required memory complexity of \tilde\Theta (r^2d+L) , where d and r are the input and intrinsic dimensions of the task. To the best of our knowledge, this is one of the first theoretical works that include a specific task structure, leverage its intrinsic dimensionality to quantify the compression rate and study dataset distillation implemented solely via gradient-based algorithms.
[LG-44] OpenReservoirComputing: GPU-Accelerated Reservoir Computing in JAX
链接: https://arxiv.org/abs/2603.14802
作者: Jan Williams,Dima Tretiak,Steven L. Brunton,J. Nathan Kutz,Krithika Manohar
类目: Machine Learning (cs.LG)
*备注:
Abstract:OpenReservoirComputing (ORC) is a Python library for reservoir computing (RC) written in JAX (Bradbury et al. 2018) and Equinox (Kidger and Garcia 2021). JAX is a Python library for high-performance numerical computing that enables automatic differentiation, just-in-time (JIT) compilation, and GPU/TPU acceleration, while Equinox is a neural network framework for JAX. RC is a form of machine learning that functions by lifting a low-dimensional sequence or signal into a high-dimensional dynamical system and training a simple, linear readout layer from the high-dimensional dynamics back to a lower-dimensional quantity of interest. The most common application of RC is time-series forecasting, where the goal is to predict a signal’s future evolution. RC has achieved state-of-the-art performance on this task, particularly when applied to chaotic dynamical systems. In addition, RC approaches can be adapted to perform classification and control tasks. ORC provides both modular components for building custom RC models and built-in models for forecasting, classification, and control. By building on JAX and Equinox, ORC offers GPU acceleration, JIT compilation, and automatic vectorization. These capabilities make prototyping new models faster and enable larger and more powerful reservoir architectures. End-to-end differentiability also enables seamless integration with other deep learning models built with Equinox.
[LG-45] GARCH-FIS: A Hybrid Forecasting Model with Dynamic Volatility-Driven Parameter Adaptation
链接: https://arxiv.org/abs/2603.14793
作者: Wen-Jing Li,Da-Qing Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes a novel hybrid model, termed GARCH-FIS, for recursive rolling multi-step forecasting of financial time series. It integrates a Fuzzy Inference System (FIS) with a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model to jointly address nonlinear dynamics and time-varying volatility. The core innovation is a dynamic parameter adaptation mechanism for the FIS, specifically activated within the multi-step forecasting cycle. In this process, the conditional volatility estimated by a rolling window GARCH model is continuously translated into a price volatility measure. At each forecasting step, this measure, alongside the updated mean of the sliding window data – which now incorporates the most recent predicted price – jointly determines the parameters of the FIS membership functions for the next prediction. Consequently, the granularity of the fuzzy inference adapts as the forecast horizon extends: membership functions are automatically widened during high-volatility market regimes to bolster robustness and narrowed during stable periods to enhance precision. This constitutes a fundamental advancement over a static one-step-ahead prediction setup. Furthermore, the model’s fuzzy rule base is automatically constructed from data using the Wang-Mendel method, promoting interpretability and adaptability. Empirical evaluation, focused exclusively on multi-step forecasting performance across ten diverse financial assets, demonstrates that the proposed GARCH-FIS model significantly outperforms benchmark models – including Support Vector Regression(SVR), Long Short-Term Memory networks(LSTM), and an ARIMA-GARCH econometric model – in terms of predictive accuracy and stability, while effectively mitigating error accumulation in extended recursive forecasts.
[LG-46] Orthogonal Subspace Clustering: Enhancing High-Dimensional Data Analysis through Adaptive Dimensionality Reduction and Efficient Clustering
链接: https://arxiv.org/abs/2603.14783
作者: Qing-Yuan Wen,Da-Qing Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper presents Orthogonal Subspace Clustering (OSC), an innovative method for high-dimensional data clustering. We first establish a theoretical theorem proving that high-dimensional data can be decomposed into orthogonal subspaces in a statistical sense, whose form exactly matches the paradigm of Q-type factor analysis. This theorem lays a solid mathematical foundation for dimensionality reduction via matrix decomposition and factor analysis. Based on this theorem, we propose the OSC framework to address the “curse of dimensionality” – a critical challenge that degrades clustering effectiveness due to sample sparsity and ineffective distance metrics. OSC integrates orthogonal subspace construction with classical clustering techniques, introducing a data-driven mechanism to select the subspace dimension based on cumulative variance contribution. This avoids manual selection biases while maximizing the retention of discriminative information. By projecting high-dimensional data into an uncorrelated, low-dimensional orthogonal subspace, OSC significantly improves clustering efficiency, robustness, and accuracy. Extensive experiments on various benchmark datasets demonstrate the effectiveness of OSC, with thorough analysis of evaluation metrics including Cluster Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI) highlighting its advantages over existing methods.
[LG-47] Understanding the geometry of deep learning with decision boundary volume
链接: https://arxiv.org/abs/2603.14768
作者: Matthew Burfitt,Jacek Brodzki,Pawel Dłotko
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:For classification tasks, the performance of a deep neural network is determined by the structure of its decision boundary, whose geometry directly affects essential properties of the model, including accuracy and robustness. Motivated by a classical tube formula due to Weyl, we introduce a method to measure the decision boundary of a neural network through local surface volumes, providing a theoretically justifiable and efficient measure enabling a geometric interpretation of the effectiveness of the model applicable to the high dimensional feature spaces considered in deep learning. A smaller surface volume is expected to correspond to lower model complexity and better generalisation. We verify, on a number of image processing tasks with convolutional architectures that decision boundary volume is inversely proportional to classification accuracy. Meanwhile, the relationship between local surface volume and generalisation for fully connected architecture is observed to be less stable between tasks. Therefore, for network architectures suited to a particular data structure, we demonstrate that smoother decision boundaries lead to better performance, as our intuition would suggest. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 68T07 (Primary) 53Z50 (Secondary) Cite as: arXiv:2603.14768 [cs.LG] (or arXiv:2603.14768v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.14768 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-48] Investigating the Impact of Speech Enhancement on Audio Deepfake Detection in Noisy Environments
链接: https://arxiv.org/abs/2603.14767
作者: Anacin,Angela,Shruti Kshirsagar,Anderson R. Avila
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Logical Access (LA) attacks, also known as audio deepfake attacks, use Text-to-Speech (TTS) or Voice Conversion (VC) methods to generate spoofed speech data. This can represent a serious threat to Automatic Speaker Verification (ASV) systems, as intruders can use such attacks to bypass voice biometric security. In this study, we investigate the correlation between speech quality and the performance of audio spoofing detection systems (i.e., LA task). For that, the performance of two enhancement algorithms is evaluated based on two perceptual speech quality measures, namely Perceptual Evaluation of Speech Quality (PESQ) and Speech-to-Reverberation Modulation Ratio (SRMR), and in respect to their impact on the audio spoofing detection system. We adopted the LA dataset, provided in the ASVspoof 2019 Challenge, and corrupted its test set with different Signal-to-Noise Ratio (SNR) levels, while leaving the training data untouched. Enhancement was applied to attenuate the detrimental effects of noisy speech, and the performances of two models, Speech Enhancement Generative Adversarial Network (SEGAN) and Metric-Optimized Generative Adversarial Network Plus (MetricGAN+), were compared. Although we expect that speech quality will correlate well with speech applications’ performance, it can also have as a side effect on downstream tasks if unwanted artifacts are introduced or relevant information is removed from the speech signal. Our results corroborate with this hypothesis, as we found that the enhancement algorithm leading to the highest speech quality scores, MetricGAN+, provided the lowest Equal Error Rate (EER) on the audio spoofing detection task, whereas the enhancement method with the lowest speech quality scores, SEGAN, led to the lowest EER, thus leading to better performance on the LA task.
[LG-49] CAMD: Coverag e-Aware Multimodal Decoding for Efficient Reasoning of Multimodal Large Language Models
链接: https://arxiv.org/abs/2603.14745
作者: Huijie Guo,Jingyao Wang,Lingyu Si,Jiahuan Zhou,Changwen Zheng,Wenwen Qiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have shown impressive reasoning capabilities across vision-language tasks, yet still face the challenge of compute-difficulty mismatch. Through empirical analyses, we identify that existing decoding methods may waste compute on easy cases while underserving hard ones, affecting both model effectiveness and efficiency. To address this issue, we first develop a theoretical framework that links sampling coverage, instance difficulty, and residual risk. Our analysis reveals that multimodal reasoning exhibits a heavy-tailed difficulty distribution; a small subset of hard or ambiguous samples dominates the residual failure probability. Based on this insight, we propose Coverage-Aware Multimodal Decoding (CAMD), an adaptive inference mechanism that dynamically allocates computation according to estimated uncertainty. CAMD integrates evidence-weighted scoring, posterior coverage estimation, and sequential Bayesian updating to balance efficiency and reliability under a limited token budget. Experiments on various benchmark datasets and baselines demonstrate the effectiveness and advantages of our approach.
[LG-50] GNNVerifier: Graph-based Verifier for LLM Task Planning
链接: https://arxiv.org/abs/2603.14730
作者: Yu Hao(1),Qiuyu Wang(1),Cheng Yang(1),Yawen Li(1),Zhiqiang Zhang(2),Chuan Shi(1) ((1) Beijing University of Posts and Telecommunications, (2) Ant Group)
类目: Machine Learning (cs.LG)
*备注: 17pages,12figures
Abstract:Large language models (LLMs) facilitate the development of autonomous agents. As a core component of such agents, task planning aims to decompose complex natural language requests into concrete, solvable sub-tasks. Since LLM-generated plans are frequently prone to hallucinations and sensitive to long-context prom-pts, recent research has introduced plan verifiers to identify and correct potential flaws. However, most existing approaches still rely on an LLM as the verifier via additional prompting for plan review or self-reflection. LLM-based verifiers can be misled by plausible narration and struggle to detect failures caused by structural relations across steps, such as type mismatches, missing intermediates, or broken dependencies. To address these limitations, we propose a graph-based verifier for LLM task planning. Specifically, the proposed method has four major components: Firstly, we represent a plan as a directed graph with enriched attributes, where nodes denote sub-tasks and edges encode execution order and dependency constraints. Secondly, a graph neural network (GNN) then performs structural evaluation and diagnosis, producing a graph-level plausibility score for plan acceptance as well as node/edge-level risk scores to localize erroneous regions. Thirdly, we construct controllable perturbations from ground truth plan graphs, and automatically generate training data with fine-grained annotations. Finally, guided by the feedback from our GNN verifier, we enable an LLM to conduct local edits (e.g., tool replacement or insertion) to correct the plan when the graph-level score is insufficient. Extensive experiments across diverse datasets, backbone LLMs, and planners demonstrate that our GNNVerifier achieves significant gains in improving plan quality. Our data and code is available at this https URL.
[LG-51] DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning
链接: https://arxiv.org/abs/2603.14729
作者: Zhiyu Wang,Mohammad Goudarzi,Mingming Gong,Rajkumar Buyya
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Next-generation IoT applications increasingly span across autonomous administrative entities, necessitating silo-cooperative scheduling to leverage diverse computational resources while preserving data privacy. However, realizing efficient cooperation faces significant challenges arising from infrastructure heterogeneity, Non-IID workload shifts, and the inherent risks of adversarial environments. Existing approaches, relying predominantly on centralized coordination or independent learning, fail to address the incompatibility of state-action spaces across heterogeneous silos and lack robustness against malicious attacks. This paper proposes DeFRiS, a Decentralized Federated Reinforcement Learning framework for robust and scalable Silo-cooperative IoT application scheduling. DeFRiS integrates three synergistic innovations: (i) an action-space-agnostic policy utilizing candidate resource scoring to enable seamless knowledge transfer across heterogeneous silos; (ii) a silo-optimized local learning mechanism combining Generalized Advantage Estimation (GAE) with clipped policy updates to resolve sparse delayed reward challenges; and (iii) a Dual-Track Non-IID robust decentralized aggregation protocol leveraging gradient fingerprints for similarity-aware knowledge transfer and anomaly detection, and gradient tracking for optimization momentum. Extensive experiments on a distributed testbed with 20 heterogeneous silos and realistic IoT workloads demonstrate that DeFRiS significantly outperforms state-of-the-art baselines, reducing average response time by 6.4% and energy consumption by 7.2%, while lowering tail latency risk (CVaR _0.95 ) by 10.4% and achieving near-zero deadline violations. Furthermore, DeFRiS achieves over 3 times better performance retention as the system scales and over 8 times better stability in adversarial environments compared to the best-performing baseline.
[LG-52] Multimodal Deep Learning for Early Prediction of Patient Deterioration in the ICU: Integrating Time-Series EHR Data with Clinical Notes
链接: https://arxiv.org/abs/2603.14719
作者: Binesh Sadanandan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Early identification of patients at risk for clinical deterioration in the intensive care unit (ICU) remains a critical challenge. Delayed recognition of impending adverse events, including mortality, vasopressor initiation, and mechanical ventilation, contributes to preventable morbidity and mortality. We present a multimodal deep learning approach that combines structured time-series data (vital signs and laboratory values) with unstructured clinical notes to predict patient deterioration within 24 hours. Using the MIMIC-IV database, we constructed a cohort of 74,822 ICU stays and generated 5.7 million hourly prediction samples. Our architecture employs a bidirectional LSTM encoder for temporal patterns in physiologic data and ClinicalBERT embeddings for clinical notes, fused through a cross-modal attention mechanism. We also present a systematic review of existing approaches to ICU deterioration prediction, identifying 31 studies published between 2015 and 2024. Most existing models rely solely on structured data and achieve area under the curve (AUC) values between 0.70 and 0.85. Studies incorporating clinical notes remain rare but show promise for capturing information not present in structured fields. Our multimodal model achieves a test AUROC of 0.7857 and AUPRC of 0.1908 on 823,641 held-out samples, with a validation-to-test gap of only 0.6 percentage points. Ablation analysis validates the multimodal approach: clinical notes improve AUROC by 2.5 percentage points and AUPRC by 39.2% relative to a structured-only baseline, while deep learning models consistently outperform classical baselines (XGBoost AUROC: 0.7486, logistic regression: 0.7171). This work contributes both a thorough review of the field and a reproducible multimodal framework for clinical deterioration prediction.
[LG-53] raining-Free Generation of Protein Sequences from Small Family Alignments via Stochastic Attention
链接: https://arxiv.org/abs/2603.14717
作者: Jeffrey D. Varner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Most protein families have fewer than 100 known members, a regime where deep generative models overfit or collapse. We propose stochastic attention (SA), a training-free sampler that treats the modern Hopfield energy over a protein alignment as a Boltzmann distribution and draws samples via Langevin dynamics. The score function is a closed-form softmax attention operation requiring no training, no pretraining data, and no GPU, with cost linear in alignment size. Across eight Pfam families, SA generates sequences with low amino acid compositional divergence, substantial novelty, and structural plausibility confirmed by ESMFold and AlphaFold2. Generated sequences fold more faithfully to canonical family structures than natural members in six of eight families. Against profile HMMs, EvoDiff, and the MSA Transformer, which produce sequences that drift far outside the family, SA maintains 51 to 66 percent identity while remaining novel, in seconds on a laptop. The critical temperature governing generation is predicted from PCA dimensionality alone, enabling fully automatic operation. Controls confirm SA encodes correlated substitution patterns, not just per-position amino acid frequencies.
[LG-54] Cross-RAG : Zero-Shot Retrieval-Augmented Time Series Forecasting via Cross-Attention
链接: https://arxiv.org/abs/2603.14709
作者: Seunghan Lee,Jaehoon Lee,Jun Seo,Sungdong Yoo,Minjae Kim,Tae Yoon Lim,Dongwan Kang,Hwanil Choi,SoonYoung Lee,Wonbin Ahn
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in time series foundation models (TSFMs) demonstrate strong expressive capacity through large-scale pretraining across diverse time series domains. Zero-shot time series forecasting with TSFMs, however, exhibits limited generalization to unseen datasets, which retrieval-augmented forecasting addresses by leveraging an external knowledge base. Existing approaches rely on a fixed number of retrieved samples that may introduce irrelevant information. To this end, we propose Cross-RAG, a zero-shot retrieval-augmented forecasting framework that selectively attends to query-relevant retrieved samples. Cross-RAG models input-level relevance between the query and retrieved samples via query-retrieval cross-attention, while jointly incorporating information from the query and retrieved samples. Extensive experiments demonstrate that Cross-RAG consistently improves zero-shot forecasting performance across various TSFMs and RAG methods, and additional analyses confirm its effectiveness across diverse retrieval scenarios. Code is available at this https URL.
[LG-55] textttBayesBreak: Generalized Hierarchical Bayesian Segmentation with Irregular Designs Multi-Sample Hierarchies and Grouped/Latent-Group Designs
链接: https://arxiv.org/abs/2603.14681
作者: Omid Shams Solari
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian change-point and segmentation models provide uncertainty-aware piecewise-constant representations of ordered data, but exact inference is often tied to narrow likelihood classes, single-sequence settings, or index-uniform designs. We present \textttBayesBreak, a modular offline Bayesian segmentation framework built around a simple separation: each candidate block contributes a marginal likelihood and any required moment numerators, and a global dynamic program combines those block scores into posterior quantities over segment counts, boundary locations, and latent signals. For weighted exponential-family likelihoods with conjugate priors, block evidences and posterior moments are available in closed form from cumulative sufficient statistics, yielding exact sum-product inference for P(y\mid k) , P(k\mid y) , boundary marginals, and Bayes regression curves. We also distinguish these quantities from the \emphjoint MAP segmentation, which is recovered by a separate max-sum backtracking recursion.
[LG-56] A Single-Sample Polylogarithmic Regret Bound for Nonstationary Online Linear Programming
链接: https://arxiv.org/abs/2603.14673
作者: Haoran Xu,Owen Shen,Peter Glynn,Yinyu Ye,Patrick Jaillet
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:We study nonstationary Online Linear Programming (OLP), where n orders arrive sequentially with reward-resource consumption pairs that form a sequence of independent, but not necessarily identically distributed, random vectors. At the beginning of the planning horizon, the decision-maker is provided with a resource endowment that is sufficient to fulfill a significant portion of the requests. The decision-maker seeks to maximize the expected total reward by making immediate and irrevocable acceptance or rejection decisions for each order, subject to this resource endowment. We focus on the challenging single-sample setting, where only one sample from each of the n distributions is available at the start of the planning horizon. We propose a novel re-solving algorithm that integrates a dynamic programming perspective with the dual-based frameworks traditionally employed in stationary environments. In the large-resource regime, where the resource endowment scales linearly with the number of orders, we prove that our algorithm achieves O((\log n)^2) regret across a broad class of nonstationary distribution sequences. Our results demonstrate that polylogarithmic regret is attainable even under significant environmental shifts and minimal data availability, bridging the gap between stationary OLP and more volatile real-world resource allocation problems. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2603.14673 [cs.DS] (or arXiv:2603.14673v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2603.14673 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-57] Proactive Routing to Interpretable Surrogates with Distribution-Free Safety Guarantees
链接: https://arxiv.org/abs/2603.14623
作者: Iqtedar Uddin,Mazin Khider,André Bauer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model routing determines whether to use an accurate black-box model or a simpler surrogate that approximates it at lower cost or greater interpretability. In deployment settings, practitioners often wish to restrict surrogate use to inputs where its degradation relative to a reference model is controlled. We study proactive (input-based) routing, in which a lightweight gate selects the model before either runs, enabling distribution-free control of the fraction of routed inputs whose degradation exceeds a tolerance \tau. The gate is trained to distinguish safe from unsafe inputs, and a routing threshold is chosen via Clopper-Pearson conformal calibration on a held-out set, guaranteeing that the routed-set violation rate is at most \alpha with probability 1-\delta. We derive a feasibility condition linking safe routing to the base safe rate \pi and risk budget \alpha, along with sufficient AUC thresholds ensuring that feasible routing exists. Across 35 OpenML datasets and multiple black-box model families, gate-based conformal routing maintains controlled violation while achieving substantially higher coverage than regression conformal and naive baselines. We further show that probabilistic calibration primarily affects routing efficiency rather than distribution-free validity.
[LG-58] A Multi-Scale Graph Learning Framework with Temporal Consistency Constraints for Financial Fraud Detection in Transaction Networks under Non-Stationary Conditions
链接: https://arxiv.org/abs/2603.14592
作者: Yiming Lei,Qiannan Shen,Junhao Song
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 39 pages, 13 figures
Abstract:Financial fraud detection in transaction networks involves modeling sparse anomalies, dynamic patterns, and severe class imbalance in the presence of temporal drift in the data. In real-world transaction systems, a suspicious transaction is rarely isolated: rather, legitimate and suspicious transactions are often connected through accounts, intermediaries or through temporal transaction sequences. Attribute-based or randomly partitioned learning pipelines are therefore insufficient to detect relationally structured fraud. STC-MixHop, a graph-based framework combining spatial multi-resolution propagation with lightweight temporal consistency modeling for anomaly and fraud detection in dynamic transaction networks. It integrates three components: a MixHop-inspired multi-scale neighborhood diffusion encoder a multi-scale neighborhood diffusion MixHop-based encoder for learning structural patterns; a spatial-temporal attention module coupling current and preceding graph snapshots to stabilize representations; and a temporally informed self-supervised pretraining strategy exploiting unlabeled transaction interactions to improve representation quality. We evaluate the framework primarily on the PaySim dataset under strict chronological splits, supplementing the analysis with Porto Seguro and FEMA data to probe cross-domain component behavior. Results show that STC-MixHop is competitive among graph methods and achieves strong screening-oriented recall under highly imbalanced conditions. The experiments also reveal an important boundary condition: when node attributes are highly informative, tabular baselines remain difficult to outperform. Graph structure contributes most clearly where hidden relational dependencies are operationally important. These findings support a stability-focused view of graph learning for financial fraud detection.
[LG-59] Learning to Order: Task Sequencing as In-Context Optimization
链接: https://arxiv.org/abs/2603.14550
作者: Jan Kobiolka,Christian Frey,Arlind Kadra,Gresa Shala,Josif Grabocka
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Task sequencing (TS) is one of the core open problems in Deep Learning, arising in a plethora of real-world domains, from robotic assembly lines to autonomous driving. Unfortunately, prior work has not convincingly demonstrated the generalization ability of meta-learned TS methods to solve new TS problems, given few initial demonstrations. In this paper, we demonstrate that deep neural networks can meta-learn over an infinite prior of synthetically generated TS problems and achieve a few-shot generalization. We meta-learn a transformer-based architecture over datasets of sequencing trajectories generated from a prior distribution that samples sequencing problems as paths in directed graphs. In a large-scale experiment, we provide ample empirical evidence that our meta-learned models discover optimal task sequences significantly quicker than non-meta-learned baselines.
[LG-60] Excited Pfaffians: Generalized Neural Wave Functions Across Structure and State
链接: https://arxiv.org/abs/2603.14515
作者: Nicholas Gao,Till Grutschus,Frank Noé,Stephan Günnemann
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注:
Abstract:Neural-network wave functions in Variational Monte Carlo (VMC) have achieved great success in accurately representing both ground and excited states. However, achieving sufficient numerical accuracy in state overlaps requires increasing the number of Monte Carlo samples, and consequently the computational cost, with the number of states. We present a nearly constant sample-size approach, Multi-State Importance Sampling (MSIS), that leverages samples from all states to estimate pairwise overlap. To efficiently evaluate all states for all samples, we introduce Excited Pfaffians. Inspired by Hartree-Fock, this architecture represents many states within a single neural network. Excited Pfaffians also serve as generalized wave functions, allowing a single model to represent multi-state potential energy surfaces. On the carbon dimer, we match the O(N_s^4) -scaling natural excited states while training 200\times faster and modeling 50% more states. Our favorable scaling enables us to be the first to use neural networks to find all distinct energy levels of the beryllium atom. Finally, we demonstrate that a single wave function can represent excited states across various molecules.
[LG-61] High-Probability Bounds for SGD under the Polyak-Lojasiewicz Condition with Markovian Noise
链接: https://arxiv.org/abs/2603.14514
作者: Avik Kar,Siddharth Chandak,Rahul Singh,Eric Moulines,Shalabh Bhatnagar,Nicholas Bambos
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Submitted to SIAM Journal on Optimization
Abstract:We present the first uniform-in-time high-probability bound for SGD under the PL condition, where the gradient noise contains both Markovian and martingale difference components. This significantly broadens the scope of finite-time guarantees, as the PL condition arises in many machine learning and deep learning models while Markovian noise naturally arises in decentralized optimization and online system identification problems. We further allow the magnitude of noise to grow with the function value, enabling the analysis of many practical sampling strategies. In addition to the high-probability guarantee, we establish a matching 1/k decay rate for the expected suboptimality. Our proof technique relies on the Poisson equation to handle the Markovian noise and a probabilistic induction argument to address the lack of almost-sure bounds on the objective. Finally, we demonstrate the applicability of our framework by analyzing three practical optimization problems: token-based decentralized linear regression, supervised learning with subsampling for privacy amplification, and online system identification.
[LG-62] Predicting Stress-strain Behaviors of Additively Manufactured Materials via Loss-based and Activation-based Physics-informed Machine Learning
链接: https://arxiv.org/abs/2603.14489
作者: Chenglong Duan,Dazhong Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predicting the stress-strain behaviors of additively manufactured materials is crucial for part qualification in additive manufacturing (AM). Conventional physics-based constitutive models often oversimplify material properties, while data-driven machine learning (ML) models often lack physical consistency and interpretability. To address these issues, we propose a physics-informed machine learning (PIML) framework to improve the predictive performance and physical consistency for predicting the stress-strain curves of additively manufactured polymers and metals. A polynomial regression model is used to predict the yield point from AM process parameters, then stress-strain curves are segmented into elastic and plastic regions. Two long short-term memory (LSTM) models are trained to predict two regions separately. For the elastic region, Hooke’s law is embedded into the LSTM model for both polymer and metal. For the plastic region, Voce hardening law and Hollomon’s law are embedded into the LSTM model for polymer and metal, respectively. The loss-based and activation-based PIML architectures are developed by embedding the physical laws into the loss and activation functions, respectively. The performance of the two PIML architectures are compared with two LSTM-based ML models, three additional ML models, and a physics-based constitutive model. These models are built on experimental data collected from two additively manufactured polymers (i.e., Nylon and carbon fiber-acrylonitrile butadiene styrene) and two additively manufactured metals (i.e., AlSi10Mg and Ti6Al4V). Experimental results demonstrate that two PIML architectures consistently outperform the other models. The segmental predictive model with activation-based PIML architecture achieves the lowest MAPE of 10.46+/-0.81% and the highest R^2 of 0.82+/-0.05 arocss four datasets.
[LG-63] Unlearning-based sliding window for continual learning under concept drift
链接: https://arxiv.org/abs/2603.14484
作者: Michal Wozniak,Marek Klonowski,Maciej Maczynski,Bartosz Krawczyk
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures
Abstract:Traditional machine learning assumes a stationary data distribution, yet many real-world applications operate on nonstationary streams in which the underlying concept evolves over time. This problem can also be viewed as task-free continual learning under concept drift, where a model must adapt sequentially without explicit task identities or task boundaries. In such settings, effective learning requires both rapid adaptation to new data and forgetting of outdated information. A common solution is based on a sliding window, but this approach is often computationally demanding because the model must be repeatedly retrained from scratch on the most recent data. We propose a different perspective based on machine unlearning. Instead of rebuilding the model each time the active window changes, we remove the influence of outdated samples using unlearning and then update the model with newly observed data. This enables efficient, targeted forgetting while preserving adaptation to evolving distributions. To the best of our knowledge, this is the first work to connect machine unlearning with concept drift mitigation for task-free continual learning. Empirical results on image stream classification across multiple drift scenarios demonstrate that the proposed approach offers a competitive and computationally efficient alternative to standard sliding-window retraining. Our implementation can be found at \hrehttps://anonymous.this http URLthis https URL.
[LG-64] Disentangling Dynamical Systems: Causal Representation Learning Meets Local Sparse Attention
链接: https://arxiv.org/abs/2603.14483
作者: Markus W. Baumgartner,Anson Lei,Joe Watson,Ingmar Posner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Parametric system identification methods estimate the parameters of explicitly defined physical systems from data. Yet, they remain constrained by the need to provide an explicit function space, typically through a predefined library of candidate functions chosen via available domain knowledge. In contrast, deep learning can demonstrably model systems of broad complexity with high fidelity, but black-box function approximation typically fails to yield explicit descriptive or disentangled representations revealing the structure of a system. We develop a novel identifiability theorem, leveraging causal representation learning, to uncover disentangled representations of system parameters without structural assumptions. We derive a graphical criterion specifying when system parameters can be uniquely disentangled from raw trajectory data, up to permutation and diffeomorphism. Crucially, our analysis demonstrates that global causal structures provide a lower bound on the disentanglement guarantees achievable when considering local state-dependent causal structures. We instantiate system parameter identification as a variational inference problem, leveraging a sparsity-regularised transformer to uncover state-dependent causal structures. We empirically validate our approach across four synthetic domains, demonstrating its ability to recover highly disentangled representations that baselines fail to recover. Corroborating our theoretical analysis, our results confirm that enforcing local causal structure is often necessary for full identifiability.
[LG-65] On the (Generative) Linear Sketching Problem
链接: https://arxiv.org/abs/2603.14474
作者: Xinyu Yuan,Yan Qiao,Zonghui Wang,Wenzhi Chen
类目: Machine Learning (cs.LG)
*备注: 28 figures, 43 pages
Abstract:Sketch techniques have been extensively studied in recent years and are especially well-suited to data streaming scenarios, where the sketch summary is updated quickly and compactly. However, it is challenging to recover the current state from these summaries in a way that is accurate, fast, and real. In this paper, we seek a solution that reconciles this tension, aiming for near-perfect recovery with lightweight computational procedures. Focusing on linear sketching problems of the form \boldsymbol\Phif \rightarrow f , our study proceeds in three stages. First, we dissect existing techniques and show the root cause of the sketching dilemma: an orthogonal information loss. Second, we examine how generative priors can be leveraged to bridge the information gap. Third, we propose FLORE, a novel generative sketching framework that embraces these analyses to achieve the best of all worlds. More importantly, FLORE can be trained without access to ground-truth data. Comprehensive evaluations demonstrate FLORE’s ability to provide high-quality recovery, and support summary with low computing overhead, outperforming previous methods by up to 1000 times in error reduction and 100 times in processing speed compared to learning-based solutions.
[LG-66] Physics-Informed Policy Optimization via Analytic Dynamics Regularization ICML2026
链接: https://arxiv.org/abs/2603.14469
作者: Namai Chandra,Liu Mohan,Zhihao Gu,Lin Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 11 pages, 8 figures. Submitted to ICML 2026
Abstract:Reinforcement learning (RL) has achieved strong performance in robotic control; however, state-of-the-art policy learning methods, such as actor-critic methods, still suffer from high sample complexity and often produce physically inconsistent actions. This limitation stems from neural policies implicitly rediscovering complex physics from data alone, despite accurate dynamics models being readily available in simulators. In this paper, we introduce a novel physics-informed RL framework, called PIPER, that seamlessly integrates physical constraints directly into neural policy optimization with analytical soft physics constraints. At the core of our method is the integration of a differentiable Lagrangian residual as a regularization term within the actor’s objective. This residual, extracted from a robot’s simulator description, subtly biases policy updates towards dynamically consistent solutions. Crucially, this physics integration is realized through an additional loss term during policy optimization, requiring no alterations to existing simulators or core RL algorithms. Extensive experiments demonstrate that our method significantly improves learning efficiency, stability, and control accuracy, establishing a new paradigm for efficient and physically consistent robotic control.
附件下载


