本篇博文主要内容为 2026-04-07 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-04-07)
今日共更新1042篇论文,其中:
- 自然语言处理共136篇(Computation and Language (cs.CL))
- 人工智能共328篇(Artificial Intelligence (cs.AI))
- 计算机视觉共222篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共246篇(Machine Learning (cs.LG))
- 多智能体系统共33篇(Multiagent Systems (cs.MA))
- 信息检索共27篇(Information Retrieval (cs.IR))
- 人机交互共64篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Agent ic Federated Learning: The Future of Distributed Training Orchestration
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在真实场景中因客户端随机异质性和不可预测的系统动态导致的效率低下问题,这些问题常引发资源利用率不足和系统偏差。其解决方案的关键在于提出“代理驱动的联邦学习”(Agentic-FL)框架,其中基于语言模型的智能体(Language Model-based Agents, LMagents)承担自主调度角色:服务器端代理通过上下文推理缓解选择偏差,客户端代理则作为本地守护者,动态管理隐私预算并根据硬件约束自适应调整模型复杂度,从而实现更高效、公平且具备自治协作能力的分布式学习生态。
链接: https://arxiv.org/abs/2604.04895
作者: Rafael O. Jarczewski,Gabriel U. Talasso,Leandro Villas,Allan M. de Souza
机构: Institute of Computing, University of Campinas (计算学院,坎皮纳斯大学); Hub of Artificial Intelligence and Cognitive Architectures - H.IAAC (人工智能与认知架构中心 - H.IAAC)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Although Federated Learning (FL) promises privacy and distributed collaboration, its effectiveness in real-world scenarios is often hampered by the stochastic heterogeneity of clients and unpredictable system dynamics. Existing static optimization approaches fail to adapt to these fluctuations, resulting in resource underutilization and systemic bias. In this work, we propose a paradigm shift towards Agentic-FL, a framework where Language Model-based Agents (LMagents) assume autonomous orchestration roles. Unlike rigid protocols, we demonstrate how server-side agents can mitigate selection bias through contextual reasoning, while client-side agents act as local guardians, dynamically managing privacy budgets and adapting model complexity to hardware constraints. More than just resolving technical inefficiencies, this integration signals the evolution of FL towards decentralized ecosystems, where collaboration is negotiated autonomously, paving the way for future markets of incentive-based models and algorithmic justice. We discuss the reliability (hallucinations) and security challenges of this approach, outlining a roadmap for resilient multi-agent systems in federated environments.
[MA-1] SkillX: Automatically Constructing Skill Knowledge Bases for Agents
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)智能体在自我演化过程中效率低下、泛化能力差的问题,具体表现为智能体孤立学习、重复探索相似行为且难以从有限经验中提炼出可迁移的知识。其解决方案的关键在于提出SkillX框架,该框架通过三个协同创新构建一个可复用的“即插即用”技能知识库(Skill Knowledge Base, SkillKB):(i) 多层级技能设计,将原始轨迹提炼为战略规划、功能技能与原子技能的三层结构;(ii) 迭代式技能优化,基于执行反馈自动修正技能以持续提升库质量;(iii) 探索性技能扩展,主动生成并验证新技能以突破初始训练数据的覆盖边界。实验证明,该结构化、分层的经验表示机制显著提升了弱基线智能体的任务成功率和执行效率,验证了其在长周期、用户交互场景下的强迁移能力。
链接: https://arxiv.org/abs/2604.04804
作者: Chenxi Wang,Zhuoyun Yu,Xin Xie,Wuguannan Yao,Runnan Fang,Shuofei Qiao,Kexin Cao,Guozhou Zheng,Xiang Qi,Peng Zhang,Shumin Deng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress
Abstract:Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited experience, resulting in redundant exploration and poor generalization. To address this problem, we propose SkillX, a fully automated framework for constructing a \textbfplug-and-play skill knowledge base that can be reused across agents and environments. SkillX operates through a fully automated pipeline built on three synergistic innovations: \textit(i) Multi-Level Skills Design, which distills raw trajectories into three-tiered hierarchy of strategic plans, functional skills, and atomic skills; \textit(ii) Iterative Skills Refinement, which automatically revises skills based on execution feedback to continuously improve library quality; and \textit(iii) Exploratory Skills Expansion, which proactively generates and validates novel skills to expand coverage beyond seed training data. Using a strong backbone agent (GLM-4.6), we automatically build a reusable skill library and evaluate its transferability on challenging long-horizon, user-interactive benchmarks, including AppWorld, BFCL-v3, and \tau^2 -Bench. Experiments show that SkillKB consistently improves task success and execution efficiency when plugged into weaker base agents, highlighting the importance of structured, hierarchical experience representations for generalizable agent learning. Our code will be publicly available soon at this https URL.
[MA-2] ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration
【速读】:该论文旨在解决当前机器人系统在执行长时序、时序结构化任务时,语义理解与物理执行之间存在显著鸿沟的问题。现有框架通常采用模块化流水线进行数据采集、技能训练和策略部署,导致实验验证和策略优化成本高昂。其解决方案的关键在于提出ROSClaw框架,该框架将策略学习与任务执行集成于统一的视觉-语言模型(Vision-Language Model, VLM)控制器中,并利用e-URDF表示法作为物理约束构建仿真到现实世界的拓扑映射,从而实现对模拟与真实机器人物理状态的实时访问;同时引入数据收集与状态累积机制,在真实世界执行过程中存储多模态观测、机器人状态及执行轨迹,支持后续迭代式策略优化,最终通过统一代理维持推理与执行间的语义连续性,并动态分配任务特定控制至不同异构机器人,提升多策略协同执行的鲁棒性。
链接: https://arxiv.org/abs/2604.04664
作者: Rongfeng Zhao,Xuanhao Zhang,Zhaochen Guo,Xiang Shao,Zhongpan Zhu,Bin He,Jie Chen
机构: Shanghai Research Institute for Intelligent Autonomous Systems (上海智能自主系统研究院); National Key Laboratory of Autonomous Intelligent Unmanned Systems (Tongji University) (同济大学自主智能无人系统全国重点实验室); Frontiers Science Center for Intelligent Autonomous Systems (智能自主系统前沿科学中心); College of Electronics and Information Engineering, Tongji University (同济大学电子与信息工程学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The integration of large language models (LLMs) with embodied agents has improved high-level reasoning capabilities; however, a critical gap remains between semantic understanding and physical execution. While vision-language-action (VLA) and vision-language-navigation (VLN) systems enable robots to perform manipulation and navigation tasks from natural language instructions, they still struggle with long-horizon sequential and temporally structured tasks. Existing frameworks typically adopt modular pipelines for data collection, skill training, and policy deployment, resulting in high costs in experimental validation and policy optimization. To address these limitations, we propose ROSClaw, an agent framework for heterogeneous robots that integrates policy learning and task execution within a unified vision-language model (VLM) controller. The framework leverages e-URDF representations of heterogeneous robots as physical constraints to construct a sim-to-real topological mapping, enabling real-time access to the physical states of both simulated and real-world agents. We further incorporate a data collection and state accumulation mechanism that stores robot states, multimodal observations, and execution trajectories during real-world execution, enabling subsequent iterative policy optimization. During deployment, a unified agent maintains semantic continuity between reasoning and execution, and dynamically assigns task-specific control to different agents, thereby improving robustness in multi-policy execution. By establishing an autonomous closed-loop framework, ROSClaw minimizes the reliance on robot-specific development workflows. The framework supports hardware-level validation, automated generation of SDK-level control programs, and tool-based execution, enabling rapid cross-platform transfer and continual improvement of robotic skills. Ours project page: this https URL.
[MA-3] AI Agents Under EU Law DATE
【速读】:该论文旨在解决高风险AI代理(AI agents)在复杂多法域监管环境下的合规难题,尤其针对欧盟《人工智能法案》(EU AI Act)与其他相关法规(如GDPR、网络安全法案、数字服务法案等)交叉适用时的监管碎片化问题。其解决方案的关键在于构建一个系统性的合规架构:首先通过九类代理部署场景的实用分类体系,明确具体操作行为与监管触发条件之间的映射关系;其次提出一套包含十二个步骤的合规框架,强调对代理外部动作、数据流、关联系统及受影响人群的全面清查作为基础任务,并识别出网络安全、人工监督、多主体行动链透明度及运行时行为漂移等核心挑战,从而为AI代理提供商提供可操作的合规路径。
链接: https://arxiv.org/abs/2604.04604
作者: Luca Nannini,Adam Leon Smith,Michele Joshua Maggini,Enrico Panai,Sandra Feliciano,Aleksandr Tiulkanov,Elena Maran,James Gealy,Piercosma Bisconti
机构: Piccadilly Labs; AIQI Consortium; Association of AI Ethicists (BeEthical); INSIGHT – Piaget Research Center for Ecological Human Development; Responsible Innovations (ForHumanity Europe); Alethesis AI; SaferAI; DEXAI, Icaro Lab
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注: Working Paper - April 2026, subject to updates (EC M/613, M/606, Digital Omnibus proposals)
Abstract:AI agents - i.e. AI systems that autonomously plan, invoke external tools, and execute multi-step action chains with reduced human involvement - are being deployed at scale across enterprise functions ranging from customer service and recruitment to clinical decision support and critical infrastructure management. The EU AI Act (Regulation 2024/1689) regulates these systems through a risk-based framework, but it does not operate in isolation: providers face simultaneous obligations under the GDPR, the Cyber Resilience Act, the Digital Services Act, the Data Act, the Data Governance Act, sector-specific legislation, the NIS2 Directive, and the revised Product Liability Directive. This paper provides the first systematic regulatory mapping for AI agent providers integrating (a) draft harmonised standards under Standardisation Request M/613 to CEN/CENELEC JTC 21 as of January 2026, (b) the GPAI Code of Practice published in July 2025, © the CRA harmonised standards programme under Mandate M/606 accepted in April 2025, and (d) the Digital Omnibus proposals of November 2025. We present a practical taxonomy of nine agent deployment categories mapping concrete actions to regulatory triggers, identify agent-specific compliance challenges in cybersecurity, human oversight, transparency across multi-party action chains, and runtime behavioral drift. We propose a twelve-step compliance architecture and a regulatory trigger mapping connecting agent actions to applicable legislation. We conclude that high-risk agentic systems with untraceable behavioral drift cannot currently satisfy the AI Act’s essential requirements, and that the provider’s foundational compliance task is an exhaustive inventory of the agent’s external actions, data flows, connected systems, and affected persons.
[MA-4] Modelling and Analysis of Supply Chains using Product Time Petri Nets
【速读】:该论文旨在解决供应链系统中多地理分布制造与装配环节在严格时间约束下的时序可行性问题,特别是如何建模和分析因资源(尤其是供应链管理者)可用性与时间约束冲突导致的系统死锁或超时行为。解决方案的关键在于提出一种基于产品时间 Petri 网(Product Time Petri Nets, PTPNs)的模块化建模方法,通过独立表示各子系统并利用同步变迁标签实现全局行为演化,同时显式刻画供应链管理者作为关键共享移动资源的角色,从而系统性地识别可行调度配置、定时超时及由不兼容时间约束引发的时滞锁(timelocks)。
链接: https://arxiv.org/abs/2604.04544
作者: Eric Lubat(Université Toulouse, Toulouse, France),Pierre-Emmanuel Hladik(Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, 44000, Nantes, France),Yoann Mateu,Rémi Sauvère
机构: 未知
类目: ystems and Control (eess.SY); Formal Languages and Automata Theory (cs.FL); Multiagent Systems (cs.MA)
备注: In Proceedings MARS 2026, arXiv:2604.03053
Abstract:Supply chains involve geographically distributed manufacturing and assembly sites that must be coordinated under strict timing and resource constraints. While many existing approaches rely on Colored Petri Nets to model material flows, this work focuses on the temporal feasibility of supply chain processes. We propose a modular modelling approach based on Product Time Petri Nets (PTPNs), where each subsystem is represented independently and the global behaviour emerges through synchronised transition labels. A key feature of the model is the explicit representation of the supply chain manager as a critical shared and mobile resource, whose availability directly impacts system feasibility. We analyse how timing constraints and managerial capacity influence the system behaviour, identifying configurations that lead to successful executions, timeouts, or timelocks induced by incompatible timing constraints. This approach enables systematic what-if analysis of supply chain coordination policies and demonstrates the relevance of PTPNs for modelling and analysing synchronised timed systems. Comments: In Proceedings MARS 2026, arXiv:2604.03053 Subjects: Systems and Control (eess.SY); Formal Languages and Automata Theory (cs.FL); Multiagent Systems (cs.MA) Cite as: arXiv:2604.04544 [eess.SY] (or arXiv:2604.04544v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2604.04544 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 443, 2026, pp. 23-39 Related DOI: https://doi.org/10.4204/EPTCS.443.3 Focus to learn more DOI(s) linking to related resources
[MA-5] Statistical Model Checking of the Island Model: An Established Economic Agent -Based Model of Endogenous Growth
【速读】:该论文旨在解决代理模型(Agent-Based Models, ABMs)在经济复杂现象研究中缺乏形式化统计保障的问题,尤其是在分析如技术搜索中的探索-利用权衡等机制时,传统方法多依赖于经验性的蒙特卡洛模拟,难以提供可重复且具有置信度的定量结论。解决方案的关键在于引入统计模型检测(Statistical Model Checking, SMC),特别是基于Multi-VeStA工具的方法,通过自动化执行大量模拟并结合假设检验(如Welch’s t-test)来量化参数变化对经济增长轨迹的影响,从而实现对经典ABM——Fagiolo与Dosi的岛屿模型(Island Model)的严谨、可复现的定量分析。
链接: https://arxiv.org/abs/2604.04543
作者: Stefano Blando(Institute of Economics and L’EMbeDS, Sant’Anna School of Advanced Studies),Giorgio Fagiolo(Institute of Economics and L’EMbeDS, Sant’Anna School of Advanced Studies),Daniele Giachini(Institute of Economics and L’EMbeDS, Sant’Anna School of Advanced Studies),Andrea Vandin(Institute of Economics and L’EMbeDS, Sant’Anna School of Advanced Studies),Ernest Ivanaj(Swiss Finance Institute and University of Geneve)
机构: 未知
类目: Multiagent Systems (cs.MA); Logic in Computer Science (cs.LO)
备注: In Proceedings MARS 2026, arXiv:2604.03053
Abstract:Agent-based models (ABMs) are increasingly used to study complex economic phenomena such as endogenous growth, but their analysis typically relies on ad-hoc Monte Carlo exercises without formal statistical guarantees. We show how statistical model checking (SMC), and in particular Multi-VeStA, can automate and enrich the analysis of a seminal ABM: the Island Model of Fagiolo and Dosi, which captures the exploration-exploitation trade-off in technological search. We reproduce key stylized facts from the original model with formal confidence intervals, confirm the optimality of moderate exploration rates, and perform a counterfactual sensitivity analysis across returns to scale, skill transfer, and knowledge locality. Using MultiVeStA’s built-in Welch’s t-test, 6 out of 7 pairwise parameter comparisons yield statistically different growth trajectories, while the exception reveals a saturation effect in knowledge locality. Our results demonstrate that SMC offers a principled, reproducible methodology for the quantitative analysis of agent-based economic models.
[MA-6] HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agent ic AI Systems
【速读】:该论文旨在解决多智能体系统中因代理行为导致的问责空白问题,即如何验证终端动作是否由人类主体真实授权、通过何种委托链以及在何种权限范围内执行。解决方案的关键在于提出Human Delegation Provenance (HDP)协议,该协议采用轻量级基于令牌的机制,通过加密方式捕获并验证人类授权上下文:HDP令牌将人类授权事件绑定到会话,并以签名跳转(signed hop)形式记录每个代理的委托操作于不可变链中,任何参与者仅需使用签发者的Ed25519公钥和当前会话标识即可离线验证完整溯源记录,无需依赖注册表查询或第三方信任锚点。
链接: https://arxiv.org/abs/2604.04522
作者: Asiri Dalugoda
机构: 未知
类目: Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注: 12 pages, 1 figure. Introduces the Human Delegation Provenance (HDP) protocol for cryptographically verifiable human authorization in multi-agent AI systems. Open-source at this https URL (spec, schema, examples, TS SDK @helixar_ai /hdp on npm, Python integrations). Also IETF Internet-Draft draft-helixar-hdp-agentic-delegation-00 (March 2026). v0.1 open for review
Abstract:Agentic AI systems increasingly execute consequential actions on behalf of human principals, delegating tasks through multi-step chains of autonomous agents. No existing standard addresses a fundamental accountability gap: verifying that terminal actions in a delegation chain were genuinely authorized by a human principal, through what chain of delegation, and under what scope. This paper presents the Human Delegation Provenance (HDP) protocol, a lightweight token-based scheme that cryptographically captures and verifies human authorization context in multi-agent systems. An HDP token binds a human authorization event to a session, records each agent’s delegation action as a signed hop in an append-only chain, and enables any participant to verify the full provenance record using only the issuer’s Ed25519 public key and the current session identifier. Verification is fully offline, requiring no registry lookups or third-party trust anchors. We situate HDP within the existing landscape of delegation protocols, identify its distinct design point relative to OAuth 2.0 Token Exchange (RFC 8693), JSON Web Tokens (RFC 7519), UCAN, and the Intent Provenance Protocol (draft-haberkamp-ipp-00), and demonstrate that existing standards fail to address the multi-hop, append-only, human-provenance requirements of agentic systems. HDP has been published as an IETF Internet-Draft (draft-helixar-hdp-agentic-delegation-00) and a reference TypeScript SDK is publicly available.
[MA-7] Memory Intelligence Agent
【速读】:该论文旨在解决深度研究代理(Deep Research Agents, DRAs)在使用记忆系统时面临的两大核心问题:一是现有方法依赖检索相似历史轨迹来辅助推理,但存在记忆演化效率低下的缺陷;二是随着记忆积累,存储与检索成本显著增加。解决方案的关键在于提出一种新型的 Memory Intelligence Agent (MIA) 框架,其核心创新包括:采用 Manager-Planner-Executor 架构实现参数化(parametric)与非参数化(non-parametric)记忆的协同管理;通过交替强化学习增强 Planner 与 Executor 的协作能力;引入测试时学习(test-time learning)机制使 Planner 在推理过程中实时更新,无需中断任务流程;建立参数化与非参数化记忆间的双向转换回路以实现高效记忆演化;并集成反思(reflection)与无监督判断机制,提升开放世界中的自主推理与进化能力。
链接: https://arxiv.org/abs/2604.04503
作者: Jingyang Qiao,Weicheng Meng,Yu Cheng,Zhihang Lin,Zhizhong Zhang,Xin Tan,Jingyu Gong,Kun Shao,Yuan Xie
机构: East China Normal University (华东师范大学); Shanghai Innovation Institute (上海创新研究院); Harbin Institute of Technology (哈尔滨工业大学); Xiamen University (厦门大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Independent Researcher (独立研究者)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Deep research agents (DRAs) integrate LLM reasoning with external tools. Memory systems enable DRAs to leverage historical experiences, which are essential for efficient reasoning and autonomous evolution. Existing methods rely on retrieving similar trajectories from memory to aid reasoning, while suffering from key limitations of ineffective memory evolution and increasing storage and retrieval costs. To address these problems, we propose a novel Memory Intelligence Agent (MIA) framework, consisting of a Manager-Planner-Executor architecture. Memory Manager is a non-parametric memory system that can store compressed historical search trajectories. Planner is a parametric memory agent that can produce search plans for questions. Executor is another agent that can search and analyze information guided by the search plan. To build the MIA framework, we first adopt an alternating reinforcement learning paradigm to enhance cooperation between the Planner and the Executor. Furthermore, we enable the Planner to continuously evolve during test-time learning, with updates performed on-the-fly alongside inference without interrupting the reasoning process. Additionally, we establish a bidirectional conversion loop between parametric and non-parametric memories to achieve efficient memory evolution. Finally, we incorporate a reflection and an unsupervised judgment mechanisms to boost reasoning and self-evolution in the open world. Extensive experiments across eleven benchmarks demonstrate the superiority of MIA.
[MA-8] Explainable Autonomous Cyber Defense using Adversarial Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决当前自主防御系统在复杂网络环境中因依赖相关性信号、缺乏结构约束以及面对模糊或对抗输入时易产生推理漂移(reasoning drift)而导致的误报率高、响应不准确的问题。其核心解决方案是提出因果多智能体决策框架(Causal Multi-Agent Decision Framework, C-MADF),关键在于通过学习历史遥测数据构建结构因果模型(Structural Causal Model, SCM),将其编译为调查层级的有向无环图(Directed Acyclic Graph, DAG),从而定义可接受的响应转移路径,并将此路径形式化为动作空间受限的马尔可夫决策过程(Markov Decision Process, MDP)。在此基础上,采用双策略强化学习机制,由威胁优化的蓝队策略与保守塑造的红队策略相互制衡,以量化策略分歧并通过可解释性-透明度评分作为不确定性下的升级信号,显著提升了检测精度与鲁棒性。
链接: https://arxiv.org/abs/2604.04442
作者: Yiyao Zhang,Diksha Goel,Hussain Ahmad
机构: Adelaide University (阿德莱德大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61实验室)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Autonomous agents are increasingly deployed in both offensive and defensive cyber operations, creating high-speed, closed-loop interactions in critical infrastructure environments. Advanced Persistent Threat (APT) actors exploit “Living off the Land” techniques and targeted telemetry perturbations to induce ambiguity in monitoring systems, causing automated defenses to overreact or misclassify benign behavior as malicious activity. Existing monolithic and multi-agent defense pipelines largely operate on correlation-based signals, lack structural constraints on response actions, and are vulnerable to reasoning drift under ambiguous or adversarial inputs. We present the Causal Multi-Agent Decision Framework (C-MADF), a structurally constrained architecture for autonomous cyber defense that integrates causal modeling with adversarial dual-policy control. C-MADF first learns a Structural Causal Model (SCM) from historical telemetry and compiles it into an investigation-level Directed Acyclic Graph (DAG) that defines admissible response transitions. This roadmap is formalized as a Markov Decision Process (MDP) whose action space is explicitly restricted to causally consistent transitions. Decision-making within this constrained space is performed by a dual-agent reinforcement learning system in which a threat-optimizing Blue-Team policy is counterbalanced by a conservatively shaped Red-Team policy. Inter-policy disagreement is quantified through a Policy Divergence Score and exposed via a human-in-the-loop interface equipped with an Explainability-Transparency Score that serves as an escalation signal under uncertainty. On the real-world CICIoT2023 dataset, C-MADF reduces the false-positive rate from 11.2%, 9.7%, and 8.4% in three cutting-edge literature baselines to 1.8%, while achieving 0.997 precision, 0.961 recall, and 0.979 F1-score.
[MA-9] FORMULA: FORmation MPC with neUral barrier Learning for safety Assurance
【速读】:该论文旨在解决多机器人系统(Multi-Robot Systems, MRS)在复杂动态环境中实现可扩展、安全感知的编队控制问题,尤其针对现有模型预测控制(Model Predictive Control, MPC)方法在可扩展性和可证明安全性方面的局限,以及控制屏障函数(Control Barrier Functions, CBFs)在大规模非线性系统中难以手工设计的问题。解决方案的关键在于提出FORMULA框架,该框架融合MPC与控制李雅普诺夫函数(Control Lyapunov Functions, CLFs)以保证稳定性,并采用基于神经网络的CBF实现去中心化的安全约束,从而无需人工设计安全约束即可保障编队完整性与避障能力,同时降低在线计算负担并解决密集配置下的死锁问题。
链接: https://arxiv.org/abs/2604.04409
作者: Qintong Xie,Weishu Zhan,Peter Chin
机构: Thayer School of Engineering, Dartmouth College (达特茅斯学院工程学院); Department of Computer Science, University of Manchester (曼彻斯特大学计算机科学系)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: Accepted to IEEE Intelligent Vehicles Symposium (IV) 2026
Abstract:Multi-robot systems (MRS) are essential for large-scale applications such as disaster response, material transport, and warehouse logistics, yet ensuring robust, safety-aware formation control in cluttered and dynamic environments remains a major challenge. Existing model predictive control (MPC) approaches suffer from limitations in scalability and provable safety, while control barrier functions (CBFs), though principled for safety enforcement, are difficult to handcraft for large-scale nonlinear systems. This paper presents FORMULA, a safe distributed, learning-enhanced predictive control framework that integrates MPC with Control Lyapunov Functions (CLFs) for stability and neural network-based CBFs for decentralized safety, eliminating manual safety constraint design. This scheme maintains formation integrity during obstacle avoidance, resolves deadlocks in dense configurations, and reduces online computational load. Simulation results demonstrate that FORMULA enables scalable, safety-aware, formation-preserving navigation for multi-robot teams in complex environments.
[MA-10] Optimizing Service Operations via LLM -Powered Multi-Agent Simulation
【速读】:该论文旨在解决服务系统性能优化中因人类行为复杂性导致的建模难题,即如何有效刻画参与者对设计决策的响应并据此优化服务运营。其核心挑战在于设计选择会动态影响结果分布(决策依赖不确定性),传统方法难以捕捉这种非线性反馈机制。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的多智能体仿真框架(LLM-MAS),将设计参数嵌入提示词(prompt)以控制智能体交互行为,并通过提取LLM生成文本中的数值信息构建可控马尔可夫链来建模不确定性。在此基础上,开发了一种轨迹上的零阶梯度估计与参数更新算法,在单次仿真运行中实现稳态性能的在线优化,并结合方差缩减技术提升效率。该方法在可持续供应链和竞猜机制设计等场景中显著优于黑箱优化及将LLM作为数值求解器或角色扮演设计师的传统策略。
链接: https://arxiv.org/abs/2604.04383
作者: Yanyuan Wang,Xiaowei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注:
Abstract:Service system performance depends on how participants respond to design choices, but modeling these responses is hard due to the complexity of human behavior. We introduce an LLM-powered multi-agent simulation (LLM-MAS) framework for optimizing service operations. We pose the problem as stochastic optimization with decision-dependent uncertainty: design choices are embedded in prompts and shape the distribution of outcomes from interacting LLM-powered agents. By embedding key numerical information in prompts and extracting it from LLM-generated text, we model this uncertainty as a controlled Markov chain. We develop an on-trajectory learning algorithm that, on a single simulation run, simultaneously constructs zeroth-order gradient estimates and updates design parameters to optimize steady-state performance. We also incorporate variance reduction techniques. In a sustainable supply chain application, our method outperforms benchmarks, including blackbox optimization and using LLMs as numerical solvers or as role-playing system designers. A case study on optimal contest design with real behavioral data shows that LLM-MAS is both as a cost-effective evaluator of known designs and an exploratory tool that can uncover strong designs overlooked by traditional approaches.
[MA-11] Soft Tournament Equilibrium
【速读】:该论文旨在解决通用人工智能代理(General-Purpose Artificial Agents)评估中的非传递性交互问题,即当代理A击败B、B击败C、C又击败A时,传统依赖线性排序的评估方法会产生误导且不稳定。其解决方案的关键在于提出Soft Tournament Equilibrium (STE),一个可微分框架,用于直接从成对比较数据中学习并计算集合值的锦标赛解(set-valued tournament solutions)。STE首先学习一个概率性锦标赛模型(可能基于丰富上下文信息),再通过新颖的可微算子实现软可达性(soft reachability)与软覆盖(soft covering),从而得到两个经典锦标赛解——Top Cycle和Uncovered Set的连续逼近版本;最终输出一组核心代理及其校准后的成员得分,提供一种更精细、鲁棒的代理能力评估方式。
链接: https://arxiv.org/abs/2604.04328
作者: Saad Alqithami
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:The evaluation of general-purpose artificial agents, particularly those based on large language models, presents a significant challenge due to the non-transitive nature of their interactions. When agent A defeats B, B defeats C, and C defeats A, traditional ranking methods that force a linear ordering can be misleading and unstable. We argue that for such cyclic domains, the fundamental object of evaluation should not be a ranking but a set-valued core, as conceptualized in classical tournament theory. This paper introduces Soft Tournament Equilibrium (STE), a differentiable framework for learning and computing set-valued tournament solutions directly from pairwise comparison data. STE first learns a probabilistic tournament model, potentially conditioned on rich contextual information. It then employs novel, differentiable operators for soft reachability and soft covering to compute continuous analogues of two seminal tournament solutions: the Top Cycle and the Uncovered Set. The output is a set of core agents, each with a calibrated membership score, providing a nuanced and robust assessment of agent capabilities. We develop the theoretical foundation for STE to prove its consistency with classical solutions in the zero-temperature limit, which establishes its Condorcet-inclusion properties, and analyzing its stability and sample complexity. We specify an experimental protocol for validating STE on both synthetic and real-world benchmarks. This work aims to provide a complete, standalone treatise that re-centers general-agent evaluation on a more appropriate and robust theoretical foundation, moving from unstable rankings to stable, set-valued equilibria.
[MA-12] Decentralized Ergodic Coverag e Control in Unknown Time-Varying Environments
【速读】:该论文旨在解决灾难响应中无人机(UAV)在未知、时变环境中实现高效自适应覆盖的问题,核心挑战在于如何在探索未观测区域与持续监测变化的感兴趣区域(Region of Interest, ROI)之间取得平衡。解决方案的关键在于提出一种去中心化的多智能体覆盖框架,其中每个智能体基于马尔可夫链转移模型计算自适应遍历策略(ergodic policy),通过高斯过程(Gaussian Process)在线更新对重要性地图(importance map)的信念估计,从而动态调整航迹,使无人机在ROI停留时间与其估计重要性成比例,同时保留足够探索能力以感知并响应环境变化。该方法无需预先已知重要性分布、不依赖中央协调,且适用于部分可观测的动态场景,显著提升了在模拟灾难演化中的适应性和瞬态性能。
链接: https://arxiv.org/abs/2604.04280
作者: Maria G. Mendoza,Victoria Marie Tuck,Chinmay Maheshwari,Shankar Sastry
机构: University of California, Berkeley (加州大学伯克利分校); University of Pennsylvania (宾夕法尼亚大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 17 pages, 6 figures
Abstract:A key challenge in disaster response is maintaining situational awareness of an evolving landscape, which requires balancing exploration of unobserved regions with sustained monitoring of changing Regions of Interest (ROIs). Unmanned Aerial Vehicles (UAVs) have emerged as an effective response tool, particularly in applications like environmental monitoring and search-and-rescue, due to their ability to provide aerial coverage, withstand hazardous conditions, and navigate quickly and flexibly. However, efficient and adaptable multi-robot coverage with limited sensing in disaster settings and evolving time-varying information maps remains a significant challenge, necessitating better methods for UAVs to continuously adapt their trajectories in response to changes. In this paper, we propose a decentralized multi-agent coverage framework that serves as a high-level planning strategy for adaptive coverage in unknown, time-varying environments under partial observability. Each agent computes an adaptive ergodic policy, implemented via a Markov-chain transition model, that tracks a continuously updated belief over the underlying importance map. Gaussian Processes are used to perform those online belief updates. The resulting policy drives agents to spend time in ROIs proportional to their estimated importance, while preserving sufficient exploration to detect and adapt to time-varying environmental changes. Unlike existing approaches that assume known importance maps, require centralized coordination, or assume a static environment, our framework addresses the combined challenges of unknown, time-varying distributions in a more realistic decentralized and partially observable setting. We compare against alternative coverage strategies and analyze our method’s response to simulated disaster evolution, highlighting its improved adaptability and transient performance in dynamic scenarios.
[MA-13] Governance-Constrained Agent ic AI: Blockchain-Enforced Human Oversight for Safety-Critical Wildfire Monitoring
【速读】:该论文旨在解决当前基于人工智能(AI)的野火早期检测系统中存在的三大核心问题:缺乏自适应的多智能体协同机制、结构化的人类控制缺失以及责任不可验证性。这些问题在安全关键型灾害场景中可能导致误报泛滥、治理失效及系统信任危机。解决方案的关键在于提出一种基于区块链的治理意识型智能体AI架构,将人类授权作为状态转移不变量嵌入到智能体决策过程中,并通过智能合约实现强制性的权限控制;同时,将野火监测建模为带有治理约束的受限部分可观测马尔可夫决策过程(POMDP),以动态风险自适应地重新分配无人飞行器(UAVs)资源,在保障响应时效的同时降低误报率和资源消耗。该设计实现了警报完整性、人类控制、不可否认性和拜占庭容错下的有限检测延迟等形式化保证,显著提升了系统的可靠性与可信度。
链接: https://arxiv.org/abs/2604.04265
作者: Ali Akarma,Toqeer Ali Syed,Salman Jan,Hammad Muneer,Abdul Khadar Jilani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: This paper was presented at ICETAS 2026 Bahrain
Abstract:The AI-based sensing and autonomous monitoring have become the main components of wildfire early detection, but current systems do not provide adaptive inter-agent coordination, structurally defined human control, and cryptographically verifiable responsibility. Purely autonomous alert dissemination in the context of safety critical disasters poses threats of false alarming, governance failure and lack of trust in the system. This paper provides a blockchain-based governance-conscious agentic AI architecture of trusted wildfire early warning. The monitoring of wildfires is modeled as a constrained partially observable Markov decision process (POMDP) that accounts for the detection latency, false alarms reduction and resource consumption with clear governance constraints. Hierarchical multi-agent coordination means dynamic risk-adaptive reallocation of unmanned aerial vehicles (UAVs). With risk-adaptive policies, a permissioned blockchain layer sets mandatory human-authorization as a state-transition invariant as a smart contract. We build formal assurances such as integrity of alerts, human control, non-repudiation and limited detection latency assumptions of Byzantine fault. Security analysis shows that it is resistant to alert injections, replays, and tampering attacks. High-fidelity simulation environment experimental evaluation of governance enforcement demonstrates that it presents limited operational overhead and decreases false public alerts and maintains adaptive detection performance. This work is a step towards a principled design paradigm of reliable AI systems by incorporating accountability into the agentic control loop of disaster intelligence systems that demand safety in their application.
[MA-14] Agents for Agents : An Interrogator-Based Secure Framework for Autonomous Internet of Underwater Things
【速读】:该论文旨在解决水下多智能体系统中长期任务因静态信任机制导致的脆弱性问题,即一旦认证建立后不再动态评估行为可信度,使得被攻破或行为异常的智能体难以及时识别和隔离。解决方案的关键在于引入一种基于“ interrogator(问询者)”的结构,其中特权问询模块作为被动通信元数据分析器,利用轻量级Transformer模型实时计算动态信任评分,并据此授权关键任务数据的转发;同时结合许可区块链联盟链存储信任证据,实现去中心化且防篡改的身份管理,从而在不干扰自主性的前提下实现行为驱动的信任验证与快速隔离,显著提升检测准确率(相对基线提升21.7%),并保持网络可扩展性与连续性。
链接: https://arxiv.org/abs/2604.04262
作者: Ali Akarma,Toqeer Ali Syed,Abdul Khadar Jilani,Salman Jan,Hammad Muneer,Muazzam A. Khan,Changli Yu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: This paper was presented in ICETAS 2026 in Bahrain
Abstract:Autonomous underwater vehicles (AUVs) and sensor nodes increasingly support decentralized sensing and coordination in the Internet of Underwater Things (IoUT), yet most deployments rely on static trust once authentication is established, leaving long-duration missions vulnerable to compromised or behaviorally deviating agents. In this paper, an interrogator based structure is presented that incorporates the idea of behavioral trust monitoring into underwater multi-agent operation without interfering with autonomy. Privileged interrogator module is a passive communication metadata analyzer that uses a lightweight transformer model to calculate dynamic trust scores, which are used to authorize the forwarding of mission critical data. Suspicious agents cause proportional monitoring and conditional restrictions, which allow fast containment and maintain network continuity. The evidence of trust is stored in a permissioned blockchain consortium which offers identity management which is not tampered and is decentralized without causing the overhead of public consensus mechanisms. Simulation based analysis shows that the evaluation of the result compares to a relative improvement of 21.7% in the detection accuracy compared to the static trust baselines with limited energy overhead. These findings suggest that behavior driven validation has the capability of reinforcing underwater coordination without compromising scalability and deployment.
[MA-15] hree Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training
【速读】:该论文旨在解决混合专家模型(Mixture-of-Experts, MoE)中令牌路由策略的动态演化机制不清晰的问题,特别是如何量化和理解训练过程中负载均衡与专家质量之间的权衡关系。解决方案的关键在于将MoE的令牌路由建模为一个拥塞博弈(congestion game),引入有效拥塞系数 $ \gamma_{\text{eff}} $ 作为单一有效参数来刻画这一权衡,并通过跟踪其在训练过程中的三阶段轨迹——激增期(surge phase)、稳定期(stabilization phase)和松弛期(relaxation phase)——揭示了早期训练优先于负载平衡、晚期训练转向专家质量提升的非单调演化规律。该框架不仅提供了对温度缩放Softmax行为的解释,还通过有效拥塞分解、多类型扩展及多维度诊断验证了其对负载预测和训练动态的解释力。
链接: https://arxiv.org/abs/2604.04230
作者: Charafeddine Mouzouni
机构: OPIT – Open Institute of Technology (开放技术研究所); Cohorte AI (科赫特人工智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We model Mixture-of-Experts (MoE) token routing as a congestion game with a single effective parameter, the congestion coefficient gamma_eff, that quantifies the balance-quality tradeoff. Tracking gamma_eff across training checkpoints of two open-source MoE models, OLMoE-1B-7B (20 checkpoints, with dense sampling in the surge region) and OpenMoE-8B (6 checkpoints), reveals a three-phase trajectory: a surge phase where the router learns to balance load (gamma_eff: 14 to 36-39, peaking in the step 30K-40K region), a stabilization phase where experts specialize under steady balance (B_0: 2.4 to 2.3, steps 100K-400K), and a relaxation phase where the router trades balance for quality as experts differentiate (gamma_eff: 27 to 9, steps 400K-1.2M). This non-monotone trajectory, invisible to post-hoc analysis of converged models, reveals that early MoE training prioritizes balance while late training prioritizes quality. The theoretical framework is honest about its limits: the single-type equilibrium reduces to temperature-scaled softmax (held-out L1: MFG = 0.199 vs. softmax = 0.200). The game is not a better predictor; it reveals what the temperature means and, critically, how that temperature evolves. We complement the dynamics with an effective congestion decomposition, a multi-type extension that improves load prediction via token clustering on all 16 layers (mean: 30%), scope diagnostics (K/M, epsilon_l), and robustness verification across four independent quality estimators (r = 0.89). All confidence intervals are from bootstrap resampling over 50 independent text batches.
[MA-16] Agent ization of Digital Assets for the Agent ic Web: Concepts Techniques and Benchmark
【速读】:该论文旨在解决数字资产在Agentic Web(智能体网络)中缺乏自动化代理生成方法的问题,从而限制了数字资产的功能激活与跨代理协作能力。其核心挑战在于如何将静态的数字资产转化为具备目标驱动能力的智能体(Agent),并确保其在多智能体环境中的互操作性与保真度。解决方案的关键是提出了一套形式化框架——A2A-Agentization过程,并基于此开发了“Agentization Agent”以实现自动化代理转化;同时构建了首个专门用于评估代理化质量的基准测试工具A2A-Agentization Bench,从保真度(fidelity)和互操作性(interoperability)两个维度验证方法的有效性,从而推动数字资产在Agentic Web生态中的规模化、标准化集成。
链接: https://arxiv.org/abs/2604.04226
作者: Linyao Chen,Bo Huang,Qinlao Zhao,Shuai Shao,Zhi Han,Zicai Cui,Ziheng Zhang,Guangtao Zeng,Wenzheng Tang,Yikun Wang,Yuanjian Zhou,Zimian Peng,Yong Yu,Weiwen Liu,Hiroki Kobayashi,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); The University of Tokyo (东京大学); Huazhong University of Science and Technology (华中科技大学); Shanghai Innovation Institute (上海创新研究院); Nankai University (南开大学); Singapore University of Technology and Design (新加坡科技设计大学); Queen’s University (皇后大学); Fudan University (复旦大学); Zhejiang University (浙江大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic Web, as a new paradigm that redefines the internet through autonomous, goal-driven interactions, plays an important role in group intelligence. As the foundational semantic primitives of the Agentic Web, digital assets encapsulate interactive web elements into agents, which expand the capacities and coverage of agents in agentic web. The lack of automated methodologies for agent generation limits the wider usage of digital assets and the advancement of the Agentic Web. In this paper, we first formalize these challenges by strictly defining the A2A-Agentization process, decomposing it into critical stages and identifying key technical hurdles on top of the A2A protocol. Based on this framework, we develop an Agentization Agent to agentize digital assets for the Agentic Web. To rigorously evaluate this capability, we propose A2A-Agentization Bench, the first benchmark explicitly designed to evaluate agentization quality in terms of fidelity and interoperability. Our experiments demonstrate that our approach effectively activates the functional capabilities of digital assets and enables interoperable A2A multi-agent collaboration. We believe this work will further facilitate scalable and standardized integration of digital assets into the Agentic Web ecosystem.
[MA-17] Element-based Formation Control: a Unified Perspective from Continuum Mechanics
【速读】:该论文旨在解决多智能体系统中群体编队控制的统一建模与控制律设计问题,传统方法通常依赖于图边上的几何约束,难以统一处理多种几何不变性(如平移、旋转、缩放及仿射变换)。解决方案的关键在于引入连续介质力学中的形变梯度(deformation gradient)概念,将编队视为由单纯形单元构成的离散弹性体,并基于局部形变梯度张量定义广义畸变能量函数,由此推导出一类分布式控制律,能够实现多种几何不变性的精确保持。该框架理论上统一了基于刚度(rigidity-based)和拉普拉斯(Laplacian-based)的控制方法,揭示了其与能量最小化之间的内在联系,从而提供了一个具有普适性和理论深度的编队控制新范式。
链接: https://arxiv.org/abs/2604.04027
作者: Kun Cao,Lihua Xie
机构: Tongji University (同济大学); National Key Laboratory of Autonomous Intelligent Unmanned Systems (国家 autonomous intelligent unmanned systems重点实验室); Frontiers Science Center for Intelligent Autonomous Systems (智能自主系统前沿科学中心); Nanyang Technological University (南洋理工大学)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA); Robotics (cs.RO); Optimization and Control (math.OC)
备注: 14 pages, 4 figures
Abstract:This paper establishes a unified element-based framework for formation control by introducing the concept of the deformation gradient from continuum mechanics. Unlike traditional methods that rely on geometric constraints defined on graph edges, we model the formation as a discrete elastic body composed of simplicial elements. By defining a generalized distortion energy based on the local deformation gradient tensor, we derive a family of distributed control laws that can enforce various geometric invariances, including translation, rotation, scaling, and affine transformations. The convergence properties and the features of the proposed controllers are analyzed in detail. Theoretically, we show that the proposed framework serves as a bridge between existing rigidity-based and Laplacian-based approaches. Specifically, we show that rigidity-based controllers are mathematically equivalent to minimizing specific projections of the deformation energy tensor. Furthermore, we establish a rigorous link between the proposed energy minimization and Laplacian-based formation control. Numerical simulations in 2D and 3D validate the effectiveness and the unified nature of the proposed framework.
[MA-18] Ledger-State Stigmergy: A Formal Framework for Indirect Coordination Grounded in Distributed Ledger State
【速读】:该论文旨在解决去中心化系统中自治软件代理(autonomous software agents)如何通过分布式账本状态实现间接协调的问题。传统上,多智能体系统依赖直接消息传递进行协作,而区块链环境下的代理则通过读取共享账本状态(如余额、合约存储和事件日志)来感知环境变化并触发行动,这本质上是一种基于“痕迹”的协调机制,即stigmergy(刺激传递)。论文的关键贡献在于提出了一种名为“基于账本状态的间接协调”(Indirect coordination grounded in ledger state)的应用层定义,将Grassé提出的生物群体智能理论映射到分布式账本技术(DLT)场景,并构建了一个状态转移形式化框架,识别出三种常见的链上协调模式(State-Flag、Event-Signal、Threshold-Trigger)及一种Commit-Reveal序列化机制。这一方法为应用层提供了可复用的术语体系、形式化映射和设计指南,使开发者能够更高效地设计基于共享状态的去中心化协调逻辑,而非依赖中心化调度或点对点通信。
链接: https://arxiv.org/abs/2604.03997
作者: Fernando Paredes García
机构: Independent Researcher
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: 15 pages, 1 figure. Also archived at Zenodo DOI: https://doi.org/10.5281/zenodo.19425884 . Companion foundations preprint DOI: https://doi.org/10.5281/zenodo.19199497
Abstract:Autonomous software agents on blockchains solve distributed-coordination problems by reading shared ledger state instead of exchanging direct messages. Liquidation keepers, arbitrage bots, and other autonomous on-chain agents watch balances, contract storage, and event logs; when conditions change, they act. The ledger therefore functions as a replicated shared-state medium through which decentralized agents coordinate indirectly. This form of indirect coordination mirrors what Grassé called stigmergy in 1959: organisms coordinating through traces left in a shared environment, with no central plan. Stigmergy has mature formalizations in swarm intelligence and multi-agent systems, and on-chain agents already behave stigmergically in practice, but no prior application-layer framework cleanly bridges the two. We introduce Indirect coordination grounded in ledger state (Coordinación indirecta basada en el estado del registro contable) as a ledger-specific applied definition that maps Grassé’s mechanism onto distributed ledger technology. We operationalize this with a state-transition formalism, identify three recurring base on-chain coordination patterns (State-Flag, Event-Signal, Threshold- Trigger) together with a Commit-Reveal sequencing overlay, and work through a State-Flag task-board example to compare ledger-state coordination analytically with off-chain messaging and centralized orchestration. The contribution is a reusable vocabulary, a ledger-specific formal mapping, and design guidance for decentralized coordination over replicated shared state at the application layer. Comments: 15 pages, 1 figure. Also archived at Zenodo DOI: https://doi.org/10.5281/zenodo.19425884. Companion foundations preprint DOI: https://doi.org/10.5281/zenodo.19199497 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA) Cite as: arXiv:2604.03997 [cs.DC] (or arXiv:2604.03997v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.03997 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-19] Symbolic-Vector Attention Fusion for Collective Intelligence
【速读】:该论文旨在解决多智能体系统中因共享环境观测差异导致的信号混合问题,即接收方无法有效评估哪些维度的信息值得吸收。解决方案的关键在于提出符号-向量注意力融合(Symbolic-Vector Attention Fusion, SVAF),其通过将每条跨智能体信号分解为7类语义场,利用可学习的融合门机制对每个语义场进行内容级评估,并生成基于交叉域知识交集的新知识“混音”。SVAF进一步采用带通模型区分冗余、对齐、保护和拒绝四种状态,从而同时实现选择性与去冗余;该机制独立发现跨域相关性层级结构(如情绪字段在训练早期即获得最高权重),并与后续的闭式连续时间(Closed-form Continuous-time, CfC)神经网络协同工作——后者通过每个神经元的时间常数(tau)动态调控认知状态演化,形成从信息融合到状态更新再到群体行为的完整认知闭环。
链接: https://arxiv.org/abs/2604.03955
作者: Hongwei Xu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 26 pages, 14 tables, 0 figures
Abstract:When autonomous agents observe different domains of a shared environment, each signal they exchange mixes relevant and irrelevant dimensions. No existing mechanism lets the receiver evaluate which dimensions to absorb. We introduce Symbolic-Vector Attention Fusion (SVAF), the content-evaluation half of a two-level coupling engine for collective intelligence. SVAF decomposes each inter-agent signal into 7 typed semantic fields, evaluates each through a learned fusion gate, and produces a remix – new knowledge from the intersection of two domains. A band-pass model yields four outcomes (redundant, aligned, guarded, rejected), solving both selectivity and redundancy. The fusion gate independently discovers a cross-domain relevance hierarchy: mood emerges as the highest-weight field by epoch 1, before accuracy plateaus – consistent with independent mechanistic evidence that LLM emotion representations are structurally embedded along valence-arousal axes. SVAF forms Layer 4 of the Mesh Memory Protocol (MMP); the other half of the coupling engine is a per-agent Closed-form Continuous-time (CfC) neural network at Layer 6, whose learned per-neuron time constants (tau) create the temporal dynamics from which collective intelligence emerges: fast neurons synchronise affect across agents in seconds, while slow neurons preserve domain expertise indefinitely. SVAF determines what enters each agent’s cognitive state; CfC determines how that state evolves. Trained on 237K samples from 273 narrative scenarios, SVAF achieves 78.7% three-class accuracy. We verify the complete mesh cognition loop – from per-field evaluation through remix, CfC state evolution, tau-modulated peer blending, and autonomous action – in a live deployment with 7 nodes across macOS, iOS, and web.
[MA-20] DC-Ada: Reward-Only Decentralized Observation-Interface Adaptation for Heterogeneous Multi-Robot Teams
【速读】:该论文旨在解决多机器人团队在部署时因平台感知异质性(如传感器类型、范围、视场和故障模式差异)导致预训练共享策略性能显著下降的问题。解决方案的关键在于提出一种仅依赖奖励信号的去中心化适应方法DC-Ada,其核心是保持预训练共享策略冻结不变,通过学习紧凑的每机器人观测变换(observation transform),将异构感知映射到固定推理接口;该方法无需梯度计算且通信开销极低,采用预算受限的接受/拒绝随机搜索结合短周期公共随机数滚动回放,在严格步数预算下实现高效适应。
链接: https://arxiv.org/abs/2604.03905
作者: Saad Alqithami
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Heterogeneity is a defining feature of deployed multi-robot teams: platforms often differ in sensing modalities, ranges, fields of view, and failure patterns. Controllers trained under nominal sensing can degrade sharply when deployed on robots with missing or mismatched sensors, even when the task and action interface are unchanged. We present DC-Ada, a reward-only decentralized adaptation method that keeps a pretrained shared policy frozen and instead adapts compact per-robot observation transforms to map heterogeneous sensing into a fixed inference interface. DC-Ada is gradient-free and communication-minimal: it uses budgeted accept/reject random search with short common-random-number rollouts under a strict step budget. We evaluate DC-Ada against four baselines in a deterministic 2D multi-robot simulator covering warehouse logistics, search and rescue, and collaborative mapping, across four heterogeneity regimes (H0–H3) and five seeds with a matched budget of 200,000 joint environment steps per run. Results show that heterogeneity can substantially degrade a frozen shared policy and that no single mitigation dominates across all tasks and metrics. Observation normalization is strongest for reward robustness in warehouse logistics and competitive in search and rescue, while the frozen shared policy is strongest for reward in collaborative mapping. DC-Ada offers a useful complementary operating point: it improves completion most clearly in severe coverage-based mapping while requiring only scalar team returns and no policy fine-tuning or persistent communication. These results position DC-Ada as a practical deploy-time adaptation method for heterogeneous teams.
[MA-21] PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrag e
【速读】:该论文旨在解决去中心化预测市场(如Polymarket)中概率预测准确性不足、市场效率低以及延迟套利机会难以捕捉的问题。其解决方案的关键在于构建一个由50个多样化大语言模型(Large Language Model, LLM)组成的多智能体框架PolySwarm,通过以下核心机制实现:首先,采用置信度加权的贝叶斯组合方法融合智能体群体共识与市场隐含概率;其次,引入信息论驱动的市场分析引擎(基于Kullback-Leibler和Jensen-Shannon散度)识别跨市场无效性和否定对错配;最后,集成延迟套利模块,利用中心化交易所(CEX)隐含概率与对数正态定价模型,在人类反应时间窗口内执行高频交易。实验表明,该架构在Brier评分、校准性和对数损失等指标上显著优于单模型基线,尤其在概率校准方面表现突出。
链接: https://arxiv.org/abs/2604.03888
作者: Rajat M. Barot,Arjun S. Borkhatariya
机构: State University of New York, Binghamton (纽约州立大学宾汉顿分校); Arizona State University (亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Trading and Market Microstructure (q-fin.TR)
备注: 13 pages, 3 figures, 3 tables
Abstract:This paper presents PolySwarm, a novel multi-agent large language model (LLM) framework designed for real-time prediction market trading and latency arbitrage on decentralized platforms such as Polymarket. PolySwarm deploys a swarm of 50 diverse LLM personas that concurrently evaluate binary outcome markets, aggregating individual probability estimates through confidence-weighted Bayesian combination of swarm consensus with market-implied probabilities, and applying quarter-Kelly position sizing for risk-controlled execution. The system incorporates an information-theoretic market analysis engine using Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence to detect cross-market inefficiencies and negation pair mispricings. A latency arbitrage module exploits stale Polymarket prices by deriving CEX-implied probabilities from a log-normal pricing model and executing trades within the human reaction-time window. We provide a full architectural description, implementation details, and evaluation methodology using Brier scores, calibration analysis, and log-loss metrics benchmarked against human superforecaster performance. We further discuss open challenges including hallucination in agent pools, computational cost at scale, regulatory exposure, and feedback-loop risk, and outline five priority directions for future research. Experimental results demonstrate that swarm aggregation consistently outperforms single-model baselines in probability calibration on Polymarket prediction tasks.
[MA-22] Strategies in Sabotage Games: Temporal and Epistemic Perspectives
【速读】:该论文旨在解决如何在动态图上的对抗性博弈(即破坏游戏,Sabotage Games)中进行形式化推理的问题,特别是针对玩家策略的时序性质和不确定性建模。其解决方案的关键在于引入交替时间时态逻辑(Alternating Time Temporal Logic, ATL*)及其认知扩展(epistemic extensions),从而支持对博弈中获胜策略的推理,并能够刻画动态图在时序维度上的性质。这一框架不仅提升了对博弈过程的建模能力,也为分析动态环境中的多智能体交互提供了理论基础。
链接: https://arxiv.org/abs/2604.03872
作者: Nina Gierasimczuk,Katrine B.P. Thoft
机构: Technical University of Denmark (丹麦技术大学)
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: 18 pages, 3 figures
Abstract:Sabotage games are played on a dynamic graph, in which one agent, called a runner, attempts to reach a goal state, while being obstructed by a demon who at each round removes an edge from the graph. Sabotage modal logic was proposed to carry out reasoning about such games. Since its conception, it has undergone a thorough analysis (in terms of complexity, completeness, and various extensions) and has been applied to a variety of domains, e.g., to formal learning. In this paper, we propose examining the game from a temporal perspective using alternating time temporal logic (ATL ^\ast ), and address the players’ uncertainty in its epistemic extensions. This framework supports reasoning about winning strategies for those games, and opens ways to address temporal properties of dynamic graphs in general.
[MA-23] Investigating the Impact of Subgraph Social Structure Preference on the Strategic Behavior of Networked Mixed-Motive Learning Agents
【速读】:该论文旨在解决现有研究中对关系网络学习智能体在社会困境下战略行为的探讨不足,以及忽视复杂系统中精细社会动态的问题。其解决方案的关键在于提出社会关系内在动机(Socio-Relational Intrinsic Motivation, SRIM),通过赋予智能体对子图结构的不同偏好(如度数、团和关键连接基底结构),研究个体对其子图关系偏好的差异如何影响其在顺序社会困境中的策略决策。实验结果表明,不同子图结构偏好导致智能体在Harvest和Cleanup环境中分别表现出个体攻击性或贡献努力的差异化行为,且同一拓扑下不同偏好对应的BCI指标排序保持一致,揭示了子图结构影响的鲁棒性,为理解社会困境中智能体行为提供了新视角,并为设计由异质社交智能体组成的有效多智能体生态系统提供依据。
链接: https://arxiv.org/abs/2604.03818
作者: Xinqi Gao,Mario Ventresca
机构: Purdue University (普渡大学)
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
备注: 17 pages, 8 page manuscript and 9 page appendix, 10 figures
Abstract:Limited work has examined the strategic behaviors of relational networked learning agents under social dilemmas, and has overlooked the intricate social dynamics of complex systems. We address the challenge with Socio-Relational Intrinsic Motivation (SRIM), which endows agents with diverse preferences over sub-graphical social structures in order to study the impact of agents’ personal preferences over their sub-graphical relations on their strategic decision-making under sequential social dilemmas. Our results in the Harvest and Cleanup environments demonstrate that preferences over different subgraph structures (degree-, clique-, and critical connection-based) lead to distinct variations in agents’ reward gathering and strategic behavior: individual aggressiveness in Harvest and individual contribution effort in Cleanup. Moreover, agents with different subgraphical structural positions consistently exhibit similar strategic behavioral shifts. Our proposed BCI metric captures structural variation within the population, and the relative ordering of BCI across social preferences is consistent in Harvest and Cleanup games for the same topology, suggesting the subgraphical structural impact is robust across environments. These results provide a new lens for examining agents’ behavior in social dilemmas and insight for designing effective multi-agent ecosystems composed of heterogeneous social agents.
[MA-24] Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM)委员会在推理过程中因代理间思维链(chain-of-thought)高度相似而导致的表征坍缩(representational collapse)问题,这种现象削弱了群体决策的多样性与有效性。其核心解决方案是提出一种无需训练的共识协议 DALC(Diversity-Aware Consensus Protocol),通过嵌入空间中的几何结构计算多样性权重,从而在不增加额外训练成本的前提下提升整体准确率并降低token消耗;关键创新在于将代理间的语义相似性量化为可操作的多样性指标,并证明嵌入编码器的选择对坍缩程度和最终性能具有决定性影响。
链接: https://arxiv.org/abs/2604.03809
作者: Dipkumar Patel
机构: LLMs Research Inc.
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 11 pages, 2 figures, 7 tables
Abstract:Multi-agent LLM committees replicate the same model under different role prompts and aggregate outputs by majority vote, implicitly assuming that agents contribute complementary evidence. We embed each agent’s chain-of-thought rationale and measure pairwise similarity: across 100 GSM8K questions with three Qwen2.5-14B agents, mean cosine similarity is 0.888 and effective rank is 2.17 out of 3.0, a failure mode we term representational collapse. DALC, a training-free consensus protocol that computes diversity weights from embedding geometry, reaches 87% on GSM8K versus 84% for self-consistency at 26% lower token cost. Ablation experiments reveal 1-3 point per-protocol run-to-run variance, confirm that hint sharing contributes more than diversity weighting alone, and show that encoder choice strongly modulates collapse severity (cosine 0.908 with mxbai versus 0.888 with nomic) and downstream accuracy. The more robust finding is that collapse is measurable, worsens on harder tasks, and that the choice of embedding proxy is a first-order design decision for any latent communication protocol.
[MA-25] When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的多智能体系统在处理具有文化敏感性和价值多元性的任务(如仇恨言论审核)时,因人类标注者间存在合法分歧而被简单视为噪声并强制达成共识的问题。其关键解决方案在于提出“分歧即信号”(disagreement as signal)的新范式,通过构建基于推理相似性与结论一致性四分类的分歧模式分类体系,识别出“收敛性分歧”(convergent disagreement)这一反映真实价值多元性的结构化分歧类型,并发现此类结构化的分歧模式显著弱于人类标注者的冲突水平,从而表明应从追求共识转向基于分歧结构的不确定性识别机制,以更精准地触发人工干预。
链接: https://arxiv.org/abs/2604.03796
作者: Michał Wawer,Jarosław A. Chudziak
机构: Warsaw University of Technology (华沙理工大学)
类目: Multiagent Systems (cs.MA)
备注: Accepted to the ICLR 2026 Workshop on "From Human Cognition to AI Reasoning: Models, Methods, and Applications (HCAIR)
Abstract:When LLM-based multi-agent systems disagree, current practice treats this as noise to be resolved through consensus. We propose it can be signal. We focus on hate speech moderation, a domain where judgments depend on cultural context and individual value weightings, producing high legitimate disagreement among human annotators. We hypothesize that convergent disagreement, where agents reason similarly but conclude differently, indicates genuine value pluralism that humans also struggle to resolve. Using the Measuring Hate Speech corpus, we embed reasoning traces from five perspective-differentiated agents and classify disagreement patterns using a four-category taxonomy based on reasoning similarity and conclusion agreement. We find that raw reasoning divergence weakly predicts human annotator conflict, but the structure of agent discord carries additional signal: cases where agents agree on a verdict show markedly lower human disagreement than cases where they do not, with large effect sizes (d0.8) surviving correction for multiple comparisons. Our taxonomy-based ordering correlates with human disagreement patterns. These preliminary findings motivate a shift from consensus-seeking to uncertainty-surfacing multi-agent design, where disagreement structure - not magnitude - guides when human judgment is needed.
[MA-26] Decomposing Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning
【速读】:该论文针对协作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, COMA)在部分可观测环境下的通信延迟问题展开研究,旨在解决跨时间步延迟导致消息到达时信息已过期、进而引发时序错位和性能下降的问题。其核心贡献在于提出了一种新的形式化框架——延迟通信部分可观测马尔可夫博弈(Delayed-Communication Partially Observable Markov Game, DeComm-POMG),并定义了通信增益(Communication Gain)与延迟代价(Delay Cost)的分解机制,由此构建出CGDC指标用于量化延迟对策略价值的影响。解决方案的关键在于:首先基于CGDC设计了一个Actor-Critic框架CDCMA,仅当预测CGDC为正时才请求通信;其次通过未来观测预测减少消费时刻的信息错位;最后利用CGDC引导的注意力机制融合延迟消息。实验表明,该方法在多个任务场景下显著提升了性能、鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2604.03785
作者: Zihong Gao,Hongjian Liang,Lei Hao,Liangjun Ke
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Communication is essential for coordination in \emphcooperative multi-agent reinforcement learning under partial observability, yet \emphcross-timestep delays cause messages to arrive multiple timesteps after generation, inducing temporal misalignment and making information stale when consumed. We formalize this setting as a delayed-communication partially observable Markov game (DeComm-POMG) and decompose a message’s effect into \emphcommunication gain and \emphdelay cost, yielding the Communication Gain and Delay Cost (CGDC) metric. We further establish a value-loss bound showing that the degradation induced by delayed messages is upper-bounded by a discounted accumulation of an information gap between the action distributions induced by timely versus delayed messages. Guided by CGDC, we propose \textbfCDCMA, an actor–critic framework that requests messages only when predicted CGDC is positive, predicts future observations to reduce misalignment at consumption, and fuses delayed messages via CGDC-guided attention. Experiments on no-teammate-vision variants of Cooperative Navigation and Predator Prey, and on SMAC maps across multiple delay levels show consistent improvements in performance, robustness, and generalization, with ablations validating each component. Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2604.03785 [cs.AI] (or arXiv:2604.03785v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.03785 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zihong Gao [view email] [v1] Sat, 4 Apr 2026 16:14:41 UTC (6,513 KB) Full-text links: Access Paper: View a PDF of the paper titled Decomposing Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning, by Zihong Gao and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-04 Change to browse by: cs cs.MA References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[MA-27] DéjàVu: A Minimalistic Mechanism for Distributed Plurality Consensus
【速读】:该论文旨在解决分布式系统中的众数共识(plurality consensus)问题,即一群初始时各自持有k个可能意见之一的极简代理(agent),需通过局部交互达成对初始出现频率最高的意见的一致。传统方法如h-majority协议要求每个代理随机采样h个邻居并采纳其中最频繁的意见,但其性能依赖于参数h的选择且通信开销较高。本文提出一种名为DéjàVu的新机制:代理持续查询邻居直至首次遇到重复意见,此时将自身意见更新为该重复值。其核心创新在于无需维护计数器、估计频率或设定任何参数(如采样大小),仅依赖检测重复这一基础能力即可实现高效共识,且理论分析表明其在某些场景下比h-majority更优,尤其在降低通信复杂度方面具有显著优势,从而成为众数共识问题中一个强大而简洁的原语。
链接: https://arxiv.org/abs/2604.03648
作者: Francesco d’Amore,Niccolò D’Archivio,George Giakkoupis,Frédéric Giroire,Emanuele Natale
机构: Gran Sasso Science Institute (意大利格兰萨索科学研究所); INRIA, COATI, Université Côte d’Azur (法国国家信息与自动化研究院,COATI,蔚蓝海岸大学); INRIA Rennes (法国国家信息与自动化研究院雷恩分部)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:
Abstract:We study the plurality consensus problem in distributed systems where a population of extremely simple agents, each initially holding one of k opinions, aims to agree on the initially most frequent one. In this setting, h-majority is arguably the simplest and most studied protocol, in which each agent samples the opinion of h neighbors uniformly at random and updates its opinion to the most frequent value in the sample. We propose a new, extremely simple mechanism called DéjàVu: an agent queries neighbors until it encounters an opinion for the second time, at which point it updates its own opinion to the duplicate value. This rule does not require agents to maintain counters or estimate frequencies, nor to choose any parameter (such as a sample size h); it relies solely on the primitive ability to detect repetition. We provide a rigorous analysis of DéjàVu that relies on several technical ideas of independent interest and demonstrates that it is competitive with h-majority and, in some regimes, substantially more communication-efficient, thus yielding a powerful primitive for plurality consensus. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA) Cite as: arXiv:2604.03648 [cs.DC] (or arXiv:2604.03648v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.03648 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Niccolò D’Archivio [view email] [v1] Sat, 4 Apr 2026 08:53:28 UTC (60 KB)
[MA-28] VisionClaw: Always-On AI Agents through Smart Glasses
【速读】:该论文旨在解决传统可穿戴设备在执行任务时存在交互延迟高、操作繁琐以及缺乏持续感知与自主执行能力的问题。其核心挑战在于如何实现用户与环境的无缝交互,同时减少手动控制负担并提升任务效率。解决方案的关键是提出VisionClaw——一个始终在线(always-on)的可穿戴AI代理系统,它将实时第一人称视角感知(egocentric perception)与基于OpenClaw的智能体(AI agent)任务执行相结合,使用户可通过语音驱动的方式直接在智能眼镜上发起和委托任务,如添加现实物体到购物车、从纸质文档生成笔记或控制物联网设备等。该方案通过持续耦合感知与行动,实现了情境化、免手持的任务处理模式,显著提升了任务完成速度并降低了交互开销。
链接: https://arxiv.org/abs/2604.03486
作者: Xiaoan Liu,DaeHo Lee,Eric J Gonzalez,Mar Gonzalez-Franco,Ryo Suzuki
机构: University of Colorado Boulder(科罗拉多大学博尔德分校); Gwangju Institute of Science and Technology(光州科学技术院); Google(谷歌)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Submitted to UIST 2026. 10 pages, 11 figures, plus appendix
Abstract:We present VisionClaw, an always-on wearable AI agent that integrates live egocentric perception with agentic task execution. Running on Meta Ray-Ban smart glasses, VisionClaw continuously perceives real-world context and enables in-situ, speech-driven action initiation and delegation via OpenClaw AI agents. Therefore, users can directly execute tasks through the smart glasses, such as adding real-world objects to an Amazon cart, generating notes from physical documents, receiving meeting briefings on the go, creating events from posters, or controlling IoT devices. We evaluate VisionClaw through a controlled laboratory study (N=12) and a longitudinal deployment study (N=5). Results show that integrating perception and execution enables faster task completion and reduces interaction overhead compared to non-always-on and non-agent baselines. Beyond performance gains, deployment findings reveal a shift in interaction: tasks are initiated opportunistically during ongoing activities, and execution is increasingly delegated rather than manually controlled. These results suggest a new paradigm for wearable AI agents, where perception and action are continuously coupled to support situated, hands-free interaction.
[MA-29] Scaling Multi-agent Systems: A Smart Middleware for Improving Agent Interactions
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在从实验性原型向复杂、持久生态演化过程中,因直接智能体间通信导致的碎片化上下文、随机幻觉、僵化的安全边界及低效拓扑管理等问题。其解决方案的关键在于引入认知织构节点(Cognitive Fabric Node, CFN),构建一个贯穿智能体之间的“认知织构”中间层;该层并非传统消息队列或服务网格的被动传输机制,而是具备主动推理能力的智能中介,将记忆(Memory)从静态存储提升为驱动拓扑选择、语义锚定、安全策略执行与提示转换四大核心功能的动态功能基底,并通过强化学习(Reinforcement Learning, RL)和优化算法实现各模块的持续自适应优化,从而在保持单个智能体轻量化的同时,显著增强整个系统的连贯性、安全性与语义一致性。
链接: https://arxiv.org/abs/2604.03430
作者: Charles Fleming,Ramana Kompella,Peter Bosch,Vijoy Pandey
机构: 未知
类目: Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注:
Abstract:As Large Language Model (LLM) based Multi-Agent Systems (MAS) evolve from experimental pilots to complex, persistent ecosystems, the limitations of direct agent-to-agent communication have become increasingly apparent. Current architectures suffer from fragmented context, stochastic hallucinations, rigid security boundaries, and inefficient topology management. This paper introduces Cognitive Fabric Nodes (CFN), a novel middleware layer that creates an omnipresent “Cognitive Fabric” between agents. Unlike traditional message queues or service meshes, CFNs are not merely pass-through mechanisms; they are active, intelligent intermediaries. Central to this architecture is the elevation of Memory from simple storage to an active functional substrate that informs four other critical capabilities: Topology Selection, Semantic Grounding, Security Policy Enforcement, and Prompt Transformation. We propose that each of these functions be governed by learning modules utilizing Reinforcement Learning (RL) and optimization algorithms to improve system performance dynamically. By intercepting, analyzing, and rewriting inter-agent communication, the Cognitive Fabric ensures that individual agents remain lightweight while the ecosystem achieves coherence, safety, and semantic alignment. We evaluate the effectiveness of the CFN on the HotPotQA and MuSiQue datasets in a multi-agent environment and demonstrate that the CFN improves performance by more than 10% on both datasets over direct agent to agent communication. Subjects: Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2604.03430 [cs.MA] (or arXiv:2604.03430v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2604.03430 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-30] Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)多智能体系统在现实成本约束下,团队规模与终身学习能力之间协同 scaling 的机制不明确的问题。现有研究通常单独探讨团队规模或经验积累对性能的影响,但忽视了二者交互作用及其在有限资源下的优化路径。解决方案的关键在于提出一种名为 LLMA-Mem 的终身记忆框架,该框架支持灵活的记忆拓扑结构,并通过高效复用历史经验来提升长期任务表现。实验表明,LLMA-Mem 在多个复杂环境(如编码、科研和数据库任务)中均能显著改善长周期性能并降低总体成本,同时揭示出非单调的 scaling 规律:更大的团队未必带来更好结果,而更优的记忆设计可使小团队超越大团队,从而证明记忆架构是实现高效、可持续扩展多智能体系统的有效策略。
链接: https://arxiv.org/abs/2604.03295
作者: Shanglin Wu,Yuyang Luo,Yueqing Liang,Kaiwen Shi,Yanfang Ye,Ali Payani,Kai Shu
机构: Emory University(埃默里大学); Illinois Institute of Technology(伊利诺伊理工学院); University of Notre Dame(圣母大学); Cisco Research(思科研究院)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) multi-agent systems can scale along two distinct dimensions: by increasing the number of agents and by improving through accumulated experience over time. Although prior work has studied these dimensions separately, their interaction under realistic cost constraints remains unclear. In this paper, we introduce a conceptual scaling view of multi-agent systems that jointly considers team size and lifelong learning ability, and we study how memory design shares this landscape. To this end, we propose \textbfLLMA-Mem, a lifelong memory framework for LLM multi-agent systems under flexible memory topologies. We evaluate LLMA-Mem on \textscMultiAgentBench across coding, research, and database environments. Empirically, LLMA-Mem consistently improves long-horizon performance over baselines while reducing cost. Our analysis further reveals a non-monotonic scaling landscape: larger teams do not always produce better long-term performance, and smaller teams can outperform larger ones when memory better supports the reuse of experience. These findings position memory design as a practical path for scaling multi-agent systems more effectively and more efficiently over time.
[MA-31] Multi-Agent Training-free Urban Food Delivery System using Resilient UMST Network
【速读】:该论文旨在解决城市配送网络在效率与韧性之间难以平衡的问题:传统完全连通图虽具灵活性但计算复杂度高,而单一最小生成树(Minimum Spanning Tree, MST)虽高效却易受局部故障影响。其解决方案的关键在于提出“最小生成树并集”(Union of Minimum Spanning Trees, UMST)方法——通过随机扰动边权重生成多个MST,并将它们合并为一个稀疏但鲁棒的图结构,从而在显著减少边数(相比完全连通图减少20–40倍)的同时,保留多个备选路径以提升系统抗干扰能力,并支持高参与率的订单整合(75–83%),且无需训练即可实现接近学习型模型(如MADDPG和图神经网络)的性能表现(成功率88–96%,距离节省44–53%),同时具备可解释的路由结构和30倍更快的执行速度。
链接: https://arxiv.org/abs/2604.03280
作者: Md Nahid Hasan,Vishwam Tiwari,Aditya Challa,Vaskar Raychoudhury,Snehanshu Saha
机构: 1: University of California, Berkeley (加州大学伯克利分校); 2: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:
Abstract:Delivery systems have become a core part of urban life, supporting the demand for food, medicine, and other goods. Yet traditional logistics networks remain fragile, often struggling to adapt to road closures, accidents, and shifting demand. Online Food Delivery (OFD) platforms now represent a cornerstone of urban logistics, with the global market projected to grow to over 500 billion USD by 2030. Designing delivery networks that are efficient and resilient remains a major challenge: fully connected graphs provide flexibility but are computationally infeasible at scale, while single Minimum Spanning Trees (MSTs) are efficient but easily disrupted. We propose the Union of Minimum Spanning Trees (UMST) approach to construct delivery networks that are sparse yet robust. UMST generates multiple MSTs through randomized edge perturbations and unites them, producing graphs with far fewer edges than fully connected networks while maintaining multiple alternative routes between delivery hotspots. Across multiple U.S. cities, UMST achieves 20–40 \times fewer edges than fully connected graphs while enabling substantial order bundling with 75–83% participation rates. Compared to learning-based baselines including MADDPG and Graph Neural Networks, UMST delivers competitive performance (88-96% success rates, 44-53% distance savings) without requiring training, achieving 30 \times faster execution while maintaining interpretable routing structures. Its combination of structural efficiency and operational flexibility offers a scalable and resilient foundation for urban delivery networks. Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG) Cite as: arXiv:2604.03280 [cs.MA] (or arXiv:2604.03280v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2604.03280 Focus to learn more arXiv-issued DOI via DataCite
[MA-32] Emergent Compositional Communication for Latent World Properties
【速读】:该论文旨在解决如何从冻结的视频特征中通过多智能体通信机制无监督地提取离散且组合式的不可见物理属性表征(如弹性、摩擦力、质量比)的问题。其核心解决方案是引入基于Gumbel-Softmax瓶颈的多智能体迭代学习框架,使智能体在无需属性标签或消息结构监督的情况下,自发发展出位置解耦的通信协议,从而实现高精度的组合性表征(PosDis=0.999),并验证了该方法在真实物理场景中的泛化能力与因果可干预性。
链接: https://arxiv.org/abs/2604.03266
作者: Tomek Kaszyński
机构: Independent Researcher, Amsterdam, Netherlands
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注: 24 pages, 4 figures, 12 tables. Code: this https URL
Abstract:Can multi-agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We show that agents communicating through a Gumbel-Softmax bottleneck with iterated learning develop positionally disentangled protocols for latent properties (elasticity, friction, mass ratio) without property labels or supervision on message structure. With 4 agents, 100% of 80 seeds converge to near-perfect compositionality (PosDis=0.999, holdout 98.3%). Controls confirm multi-agent structure – not bandwidth or temporal coverage – drives this effect. Causal intervention shows surgical property disruption (~15% drop on targeted property, 3% on others). A controlled backbone comparison reveals that the perceptual prior determines what is communicable: DINOv2 dominates on spatially-visible ramp physics (98.3% vs 95.1%), while V-JEPA 2 dominates on dynamics-only collision physics (87.4% vs 77.7%, d=2.74). Scale-matched (d=3.37) and frame-matched (d=6.53) controls attribute this gap entirely to video-native pretraining. The frozen protocol supports action-conditioned planning (91.5%) with counterfactual velocity reasoning (r=0.780). Validation on Physics 101 real camera footage confirms 85.6% mass-comparison accuracy on unseen objects, temporal dynamics contributing +11.2% beyond static appearance, agent-scaling compositionality replicating at 90% for 4 agents, and causal intervention extending to real video (d=1.87, p=0.022).
自然语言处理
[NLP-0] Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM -Generated Text Detection ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成文本的细粒度检测问题,现有方法多采用二分类或三分类设置,仅能区分纯人类文本、纯LLM文本或协作文本,难以应对实际政策监管中对LLM润色的人类文本与人类化LLM文本的差异化处理需求。其解决方案的关键在于提出RACE(Rhetorical Analysis for Creator-Editor Modeling),通过修辞结构理论(Rhetorical Structure Theory)构建创作者的基础逻辑图谱,并提取基本话语单元(Elementary Discourse Unit)级别的特征以刻画编辑者风格,从而在严格的四分类框架下实现对LLM生成文本类型的精细化识别,显著优于12个基线方法且误报率低,为LLM治理提供符合政策导向的技术路径。
链接: https://arxiv.org/abs/2604.04932
作者: Yang Li,Qiang Sheng,Zhengjia Wang,Yehan Yang,Danding Wang,Juan Cao
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL 2026 Accepted Paper
Abstract:The misuse of large language models (LLMs) requires precise detection of synthetic text. Existing works mainly follow binary or ternary classification settings, which can only distinguish pure human/LLM text or collaborative text at best. This remains insufficient for the nuanced regulation, as the LLM-polished human text and humanized LLM text often trigger different policy consequences. In this paper, we explore fine-grained LLM-generated text detection under a rigorous four-class setting. To handle such complexities, we propose RACE (Rhetorical Analysis for Creator-Editor Modeling), a fine-grained detection method that characterizes the distinct signatures of creator and editor. Specifically, RACE utilizes Rhetorical Structure Theory to construct a logic graph for the creator’s foundation while extracting Elementary Discourse Unit-level features for the editor’s style. Experiments show that RACE outperforms 12 baselines in identifying fine-grained types with low false alarms, offering a policy-aligned solution for LLM regulation.
[NLP-1] Early Stopping for Large Reasoning Models via Confidence Dynamics
【速读】: 该论文旨在解决大模型在复杂问题求解过程中因长链式推理(chain-of-thought)导致的计算开销过大及性能下降的问题,核心挑战在于如何准确判断何时终止推理以输出最终答案。解决方案的关键在于利用推理过程中中间答案置信度(confidence)的变化动态:正确推理路径通常早期就能达到高置信度,而错误路径则表现为冗长且低效的推理轨迹,置信度波动不稳定。基于此观察,作者提出 CoDE-Stop(Confidence Dynamics Early Stop)方法,通过分析中间答案置信度的动力学特征来决定停止时机,无需额外训练即可集成到现有模型中,显著提升推理效率并保持或提升准确性。
链接: https://arxiv.org/abs/2604.04930
作者: Parsa Hosseini,Sumit Nawathe,Mahdi Salmani,Meisam Razaviyayn,Soheil Feizi
机构: University of Maryland (马里兰大学); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large reasoning models rely on long chain-of-thought generation to solve complex problems, but extended reasoning often incurs substantial computational cost and can even degrade performance due to overthinking. A key challenge is determining when the model should stop reasoning and produce the final answer. In this work, we study the confidence of intermediate answers during reasoning and observe two characteristic behaviors: correct reasoning trajectories often reach high-confidence answers early, while incorrect rollouts tend to produce long, unproductive reasoning traces and exhibit less reliable confidence dynamics. Motivated by these observations, we propose CoDE-Stop (Confidence Dynamics Early Stop), an early stopping method that leverages the dynamics of intermediate answer confidence to decide when to terminate reasoning, requiring no additional training and easily integrating into existing models. We evaluate CoDE-Stop on diverse reasoning and science benchmarks across multiple models. Compared to prior early stopping methods, it achieves a more favorable accuracy-compute tradeoff and reduces total token usage by 25-50% compared to standard full-length reasoning. In addition, we provide analyses of confidence dynamics during reasoning, offering insights into how confidence changes in both correct and incorrect trajectories.
[NLP-2] riAttention: Efficient Long Reasoning with Trigonometric KV Compression
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在扩展推理过程中因键值缓存(Key-Value Cache, KV cache)内存瓶颈导致的效率与稳定性问题。现有主流KV缓存压缩方法依赖于RoPE(Rotary Position Embedding)后查询向量的注意力分数来估计关键帧的重要性,但由于RoPE引入的位置旋转特性,使得代表性查询数量稀少,从而导致关键帧选择质量差且推理过程不稳定。解决方案的关键在于转向RoPE前的空间(pre-RoPE space),发现查询(Q)和键(K)向量在该空间中高度集中于固定的非零中心且位置不变(即Q/K浓度现象),并利用这一特性提出TriAttention机制:通过三角函数级数刻画由中心决定的键距离偏好关系,结合键的位置距离与Q/K范数作为双重信号对键重要性进行评分。该方法显著提升了长上下文推理的准确率与内存效率,在AIME25 32K token生成任务中达到全注意力精度的同时实现2.5倍吞吐提升或10.7倍KV内存压缩,使OpenClaw等模型可在单张消费级GPU上部署长上下文推理。
链接: https://arxiv.org/abs/2604.04921
作者: Weian Mao,Xi Lin,Wei Huang,Yuxin Xie,Tianfu Fu,Bohan Zhuang,Song Han,Yukang Chen
机构: 1. Tsinghua University (清华大学); 2. Massachusetts Institute of Technology (麻省理工学院); 3. Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL
Abstract:Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions – Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.
[NLP-3] Vero: An Open RL Recipe for General Visual Reasoning
【速读】: 该论文旨在解决如何构建一个能够在图表理解、科学推理、空间认知及开放式任务等多个视觉推理场景中通用的视觉语言模型(Vision-Language Model, VLM)的问题。其核心挑战在于现有最强模型虽展现出跨任务推理能力,但其训练机制依赖于封闭的强化学习(Reinforcement Learning, RL)管道和非公开数据,缺乏可复现性与透明度。解决方案的关键在于提出Vero系列完全开源的VLM,通过大规模扩展RL数据与奖励机制,在六个任务类别上构建了包含600K样本的Vero-600K数据集,并设计了任务导向型奖励机制以适配异构答案格式;实验证明,该方案在30个基准测试中显著优于多个基线模型,且无需额外专有思维数据即可超越同类模型,系统性消融实验进一步揭示:不同任务类别激发的推理模式具有质的差异且孤立迁移效果差,表明广泛的数据覆盖是实现强化学习有效扩展的核心驱动力。
链接: https://arxiv.org/abs/2604.04917
作者: Gabriel Sarch,Linrong Cai,Qunzhong Wang,Haoyang Wu,Danqi Chen,Zhuang Liu
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL
Abstract:What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.7-5.5 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.
[NLP-4] QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
【速读】: 该论文旨在解决如何在资源受限条件下训练小型开放模型以达到与大型专有模型相当的奥林匹克级别数学证明生成能力的问题。其核心挑战在于现有高性能AI系统依赖于庞大的内部模型和复杂架构,导致推理成本高、难以复现且不易改进。解决方案的关键在于提出一个三阶段训练流程:首先通过监督微调(Supervised Fine-Tuning, SFT)从DeepSeek-Math-V2蒸馏出优秀的证明写作风格;其次引入基于评分规则的强化学习(Reinforcement Learning, RL)优化输出质量;最后采用推理缓存机制(Reasoning Cache),将长证明分解为迭代式“总结-精炼”循环,显著增强测试时推理能力。这一方法使QED-Nano(4B参数规模)在性能上超越多个更大规模的开源模型,并逼近Gemini 3 Pro等专有模型的表现,同时大幅降低推理开销。
链接: https://arxiv.org/abs/2604.04898
作者: LM-Provers,Yuxiao Qu,Amrith Setlur,Jasper Dekoninck,Edward Beeching,Jia Li,Ian Wu,Lewis Tunstall,Aviral Kumar
机构: CMU(卡内基梅隆大学); Hugging Face(胡格弗莱斯); ETH Zurich(苏黎世联邦理工学院); Project Numina(项目诺米娜)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Proprietary AI systems have recently demonstrated impressive capabilities on complex proof-based problems, with gold-level performance reported at the 2025 International Mathematical Olympiad (IMO). However, the training pipelines behind these systems remain largely undisclosed, and their reliance on large “internal” models and scaffolds makes them expensive to run, difficult to reproduce, and hard to study or improve upon. This raises a central question: can small, open models also be trained to achieve competitive reasoning performance on difficult Olympiad-level math? In this paper, we answer this question by building QED-Nano, a 4B model post-trained for Olympiad-level proofs. Our training recipe has three stages: (1) supervised fine-tuning to imbue good proof-writing styles by distilling from DeepSeek-Math-V2, (2) reinforcement learning (RL) with rubric-based rewards, and (3) expanding RL with a reasoning cache, which decomposes long proofs into iterative summarize-and-refine cycles and enables stronger test-time reasoning. QED-Nano surpasses the proof-generation performance of much larger open models, including Nomos-1 and GPT-OSS-120B, and approaches the performance of proprietary models like Gemini 3 Pro, at a fraction of the inference cost. To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
[NLP-5] Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练大语言模型(Large Language Models, LLMs)时所面临的“受限探索”(restricted exploration)问题,即策略快速收敛至有限解空间,导致推理能力受限。传统通过熵正则化(entropy regularization)维持探索的方法对LLMs效果不佳,存在超参数敏感且增益有限的问题。解决方案的关键在于重新审视策略熵与探索之间的关系,提出将策略熵分解为“信息熵”(informative entropy,保留多样解路径)和“虚假熵”(spurious entropy,破坏推理模式),并揭示有效探索应依赖于“熵精炼”(entropy refinement)机制——该机制在组相对优势估计(group-relative advantage estimation)中隐式实现:在正向轨迹上保持信息熵,在负向轨迹上抑制虚假熵。基于此洞察,作者提出AsymGRPO框架,通过显式分离正负轨迹的调节机制,独立控制信息熵的保留与虚假噪声的抑制,从而显著提升探索效率与模型性能。
链接: https://arxiv.org/abs/2604.04894
作者: Hengrui Gu,Xiaotian Han,Yujing Bian,Kaixiong Zhou
机构: North Carolina State University (北卡罗来纳州立大学); Case Western Reserve University (凯斯西储大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models (LLMs). However, it faces a fundamental limitation termed \textitrestricted exploration, where the policy rapidly converges to a narrow set of solutions. While entropy regularization is a popular approach used to sustain exploration, it often proves unreliable for LLMs, suffering from high hyperparameter sensitivity and yielding only marginal performance gains. Motivated by these inefficiencies, we propose to rethink the relationship between policy entropy and exploration. By deriving a parametric formulation of group-relative advantage estimation and analyzing entropy dynamics, we conceptually decompose policy entropy into \textitinformative entropy, which preserves diverse solution paths, and \textitspurious entropy, which erodes reasoning patterns. Our analysis reveals that, in contrast to blind maximization, effective exploration requires \textitentropy refinement-a mechanism implicitly embedded in group-relative advantage estimation that sustains informative entropy on positive rollouts while suppressing spurious entropy on negative ones. Guided by this insight, we propose \textbfAsymGRPO, an exploratory framework that explicitly decouples the modulation of positive and negative rollouts. This allows for independent control over the preservation of informative entropy and the suppression of spurious noise. Extensive experiments demonstrate that AsymGRPO achieves superior performance compared to strong baselines and exhibits the potential to synergize with existing entropy regularization methods.
[NLP-6] Synthetic Sandbox for Training Machine Learning Engineering Agents
【速读】: 该论文旨在解决生成式 AI(Generative AI)在机器学习工程(Machine Learning Engineering, MLE)任务中,由于验证成本过高而难以应用轨迹级在线策略强化学习(trajectory-wise on-policy reinforcement learning, RL)的问题。传统方法依赖于完整ML流水线(包括数据预处理、模型训练与指标评估)在大规模数据集上的执行,导致推理效率极低,从而迫使研究者退而采用监督微调(Supervised Fine-Tuning, SFT)或离线代理奖励策略,牺牲了RL的探索能力和泛化优势。其解决方案的关键在于识别出沙盒数据规模是主要瓶颈,并提出SandMLE框架:通过少量种子任务生成多样且可验证的合成MLE环境,将每个任务的数据量控制在微尺度(50–200个训练样本),从而显著降低执行开销,首次实现MLE领域内大规模轨迹级在线策略RL训练。实验表明,SandMLE在MEL-bench-lite上相较SFT基线提升明显,且训练策略具备跨未见智能体结构的泛化能力。
链接: https://arxiv.org/abs/2604.04872
作者: Yuhang Zhou,Lizhu Zhang,Yifan Wu,Jiayi Liu,Xiangjun Fan,Zhuokai Zhao,Hong Yan
机构: Meta AI
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages, 9 tables, 8 figures
Abstract:As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines – data preprocessing, model training, and metric evaluation – on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.
[NLP-7] Do No Harm: Exposing Hidden Vulnerabilities of LLM s via Persona-based Client Simulation Attack in Psychological Counseling
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在心理治疗场景中因缺乏对“治疗性共情”与“适应不良的认同”(maladaptive validation)的区分能力而导致的安全隐患问题,即模型可能在多轮对话中无意间强化用户有害信念或行为。其解决方案的关键在于提出首个基于人格的客户模拟攻击(Personality-based Client Simulation Attack, PCSA)框架,通过构建具有连贯人格特征的模拟客户对话来系统性暴露LLMs在心理安全对齐方面的漏洞,相较于现有红队测试框架更聚焦于领域特异性对抗策略,实验证明其能有效识别模型在提供非授权医疗建议、强化妄想和隐性鼓励高风险行为等方面的脆弱性。
链接: https://arxiv.org/abs/2604.04842
作者: Qingyang Xu,Yaling Shen,Stephanie Fong,Zimu Wang,Yiwen Jiang,Xiangyu Zhao,Jiahe Liu,Zhongxing Xu,Vincent Lee,Zongyuan Ge
机构: Monash University (莫纳什大学); University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions. A key challenge is distinguishing therapeutic empathy from maladaptive validation, where supportive responses may inadvertently reinforce harmful beliefs or behaviors in multi-turn conversations. This risk is largely overlooked by existing red-teaming frameworks, which focus mainly on generic harms or optimization-based attacks. To address this gap, we introduce Personality-based Client Simulation Attack (PCSA), the first red-teaming framework that simulates clients in psychological counseling through coherent, persona-driven client dialogues to expose vulnerabilities in psychological safety alignment. Experiments on seven general and mental health-specialized LLMs show that PCSA substantially outperforms four competitive baselines. Perplexity analysis and human inspection further indicate that PCSA generates more natural and realistic dialogues. Our results reveal that current LLMs remain vulnerable to domain-specific adversarial tactics, providing unauthorized medical advice, reinforcing delusions, and implicitly encouraging risky actions.
[NLP-8] MERIT: Multilingual Expert-Reward Informed Tuning for Chinese-Centric Low-Resource Machine Translation
【速读】: 该论文旨在解决中文到低资源东南亚语言(LRLs)的神经机器翻译(NMT)问题,其核心挑战在于高质量平行语料极度稀缺以及现有挖掘数据中普遍存在噪声,导致模型训练困难且性能显著落后于高资源方向。解决方案的关键在于提出一种统一的翻译框架——多语言专家奖励感知微调(MERIT),该框架通过语言特定标记前缀(LTP)增强模型对目标语言的识别能力,结合监督微调(SFT)与一种基于语义对齐奖励(SAR)引导的组相对策略优化(GRPO),实现更精准的奖励驱动优化,从而在数据受限条件下显著提升翻译质量。
链接: https://arxiv.org/abs/2604.04839
作者: Zhixiang Lu,Chong Zhang,Chenyu Xue,Angelos Stefanidis,Chong Li,Jionglong Su,Zhengyong Jiang
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Neural machine translation (NMT) from Chinese to low-resource Southeast Asian languages remains severely constrained by the extreme scarcity of clean parallel corpora and the pervasive noise in existing mined data. This chronic shortage not only impedes effective model training but also sustains a large performance gap with high-resource directions, leaving millions of speakers of languages such as Lao, Burmese, and Tagalog with persistently low-quality translation systems despite recent advances in large multilingual models. We introduce \textbfMultilingual \textbfExpert-\textbfReward \textbfInformed \textbfTuning (\textbfMERIT), a unified translation framework that transforms the traditional English-centric ALT benchmark into a Chinese-centric evaluation suite for five Southeast Asian low-resource languages (LRLs). Our framework combines language-specific token prefixing (LTP) with supervised fine-tuning (SFT) and a novel group relative policy optimization (GRPO) guided by the semantic alignment reward (SAR). These results confirm that, in LRL\textrightarrowChinese translation, targeted data curation and reward-guided optimization dramatically outperform mere model scaling.
[NLP-9] Plausibility as Commonsense Reasoning : Humans Succeed Large Language Models Do not LREC2026
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在处理句法歧义时,是否能够像人类一样将世界知识与句法结构以结构敏感的方式结合,从而实现合理的歧义消解。解决方案的关键在于设计了一组土耳其语前置限定从句(prenominal relative-clause, RC)歧义材料,保持句法配置不变且两种解析(高附着 High Attachment, HA 和低附着 Low Attachment, LA)均符合语用可能性,同时通过事件合理性(plausibility)的梯度差异系统性地偏向某一解析。研究人员在人类被试中验证了显著且方向正确的合理性效应,并在此基础上,采用基于平均每词对数概率(mean per-token log-probability)的偏好范式评估土耳其语及多语言LLMs的表现,发现模型在合理性驱动下的附着偏好较弱、不稳定甚至方向相反,表明当前LLMs未能可靠地利用合理性信息进行结构敏感的歧义消解。
链接: https://arxiv.org/abs/2604.04825
作者: Sercan Karakaş
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to The Workshop on Cognitive Modeling and Computational Linguistics co-located with LREC 2026
Abstract:Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution. We test this question in Turkish prenominal relative-clause attachment ambiguities, where the same surface string permits high attachment (HA) or low attachment (LA). We construct ambiguous items that keep the syntactic configuration fixed and ensure both parses remain pragmatically possible, while graded event plausibility selectively favors High Attachment vs.\ Low Attachment. The contrasts are validated with independent norming ratings. In a speeded forced-choice comprehension experiment, humans show a large, correctly directed plausibility effect. We then evaluate Turkish and multilingual LLMs in a parallel preference-based setup that compares matched HA/LA continuations via mean per-token log-probability. Across models, plausibility-driven shifts are weak, unstable, or reversed. The results suggest that, in the tested models, plausibility information does not guide attachment preferences as reliably as it does in human judgments, and they highlight Turkish RC attachment as a useful cross-linguistic diagnostic beyond broad benchmarks.
[NLP-10] ANX: Protocol-First Design for AI Agent Interaction with a Supporting 3EX Decoupled Architecture
【速读】: 该论文旨在解决当前AI代理(AI agents)在执行任务时面临的高Token消耗、交互碎片化、安全性不足等问题,这些问题源于缺乏统一的顶层框架和关键组件,导致各模块独立运行且存在缺陷。其解决方案的核心是提出ANX(Agent-native eXtensible protocol),一个开放、可扩展、可验证的原生代理协议及顶层架构,通过四大创新实现突破:1)原生设计(ANX Config、Markup、CLI)提升信息密度与适应性,减少Token使用并消除不一致性;2)融合Skill机制的人机交互模式,支持指令与UI双渲染;3)基于MCP的按需轻量级应用部署,无需预注册;4)ANX Markup驱动的机器可执行标准操作流程(SOP),消除歧义以保障长期任务与多代理协作可靠性。实验表明,ANX相较MCP技能方案平均降低55.6% Token消耗,较GUI自动化减少66.3%,同时显著缩短执行时间。
链接: https://arxiv.org/abs/2604.04820
作者: Xu Mingze
机构: Hangzhou Ziyou Data Technology Co., Ltd.(杭州子游数据科技有限公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This open-source AI agent interaction protocol (ANX) is benchmarked against existing protocols (MCP, A2A, ANP, OpenCLI, SkillWeaver, CHEQ, COLLAB-LLM) across four dimensions: tooling, discovery, security, and multi-agent SOP collaboration. Code: this https URL
Abstract:AI agents, autonomous digital actors, need agent-native protocols; existing methods include GUI automation and MCP-based skills, with defects of high token consumption, fragmented interaction, inadequate security, due to lacking a unified top-level framework and key components, each independent module flawed. To address these issues, we present ANX, an open, extensible, verifiable agent-native protocol and top-level framework integrating CLI, Skill, MCP, resolving pain points via protocol innovation, architectural optimization and tool supplementation. Its four core innovations: 1) Agent-native design (ANX Config, Markup, CLI) with high information density, flexibility and strong adaptability to reduce tokens and eliminate inconsistencies; 2) Human-agent interaction combining Skill’s flexibility for dual rendering as agent-executable instructions and human-readable UI; 3) MCP-supported on-demand lightweight apps without pre-registration; 4) ANX Markup-enabled machine-executable SOPs eliminating ambiguity for reliable long-horizon tasks and multi-agent collaboration. As the first in a series, we focus on ANX’s design, present its 3EX decoupled architecture with ANXHub and preliminary feasibility analysis and experimental validation. ANX ensures native security: LLM-bypassed UI-to-Core communication keeps sensitive data out of agent context; human-only confirmation prevents automated misuse. Form-filling experiments with Qwen3.5-plus/GPT-4o show ANX reduces tokens by 47.3% (Qwen3.5-plus) and 55.6% (GPT-4o) vs MCP-based skills, 57.1% (Qwen3.5-plus) and 66.3% (GPT-4o) vs GUI automation, and shortens execution time by 58.1% and 57.7% vs MCP-based skills.
[NLP-11] LiveFact: A Dynamic Time-Aware Benchmark for LLM -Driven Fake News Detection ACL2026
【速读】: 该论文旨在解决当前虚假新闻检测与事实核查任务中评估框架滞后于大语言模型(Large Language Models, LLMs)发展的问题,尤其是静态基准测试易受基准数据污染(Benchmark Data Contamination, BDC)影响,且无法有效评估模型在时间不确定性下的推理能力。其解决方案的关键在于提出一个持续更新的动态基准 LiveFact,该基准通过引入随时间演化的证据集来模拟现实世界中信息不完整、不断变化的“战争迷雾”(fog of war)场景,从而评估模型基于证据进行推理的能力,而非依赖记忆知识;同时设计双模式评估机制——分类模式用于最终验证,推理模式用于证据驱动的逐步推理,并内置BDC监控组件,以更真实地反映模型在动态环境中的鲁棒性和认知谦逊性(epistemic humility)。
链接: https://arxiv.org/abs/2604.04815
作者: Cheng Xu,Changhong Jin,Yingjie Niu,Nan Yan,Yuke Mei,Shuhao Guan,Liming Chen,M-Tahar Kechadi
机构: University College Dublin (都柏林大学); Georgia Institute of Technology (佐治亚理工学院); Dalian University of Technology (大连理工大学); Bebxy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 Main
Abstract:The rapid development of Large Language Models (LLMs) has transformed fake news detection and fact-checking tasks from simple classification to complex reasoning. However, evaluation frameworks have not kept pace. Current benchmarks are static, making them vulnerable to benchmark data contamination (BDC) and ineffective at assessing reasoning under temporal uncertainty. To address this, we introduce LiveFact a continuously updated benchmark that simulates the real-world “fog of war” in misinformation detection. LiveFact uses dynamic, temporal evidence sets to evaluate models on their ability to reason with evolving, incomplete information rather than on memorized knowledge. We propose a dual-mode evaluation: Classification Mode for final verification and Inference Mode for evidence-based reasoning, along with a component to monitor BDC explicitly. Tests with 22 LLMs show that open-source Mixture-of-Experts models, such as Qwen3-235B-A22B, now match or outperform proprietary state-of-the-art systems. More importantly, our analysis finds a significant “reasoning gap.” Capable models exhibit epistemic humility by recognizing unverifiable claims in early data slices-an aspect traditional static benchmarks overlook. LiveFact sets a sustainable standard for evaluating robust, temporally aware AI verification.
[NLP-12] How Far Are We? Systematic Evaluation of LLM s vs. Human Experts in Mathematical Contest in Modeling
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在真实世界端到端问题求解能力上的局限性,尤其是其在数学建模竞赛这类复杂任务中从问题理解到执行落地各阶段表现不均的问题。解决方案的关键在于提出一种面向问题、分阶段的评估框架(problem-oriented, stage-wise evaluation framework),该框架通过专家验证的标准对模型在建模流程中的不同阶段(如问题识别、公式化、求解、代码实现与结果分析)进行细粒度评估,并以自动评分与独立人工专家判断的一致性验证其可靠性,从而揭示出LLMs存在“理解-执行差距”(comprehension-execution gap):即模型在早期阶段表现良好,但在执行导向阶段(如模型求解、代码实现和结果分析)持续存在缺陷,且这些缺陷源于规范不足、验证缺失和缺乏校验机制,导致错误在阶段间传播而无法修正。这一发现表明,仅靠模型规模扩展不足以弥合该差距,需发展新的方法论支持复杂现实问题的全流程求解。
链接: https://arxiv.org/abs/2604.04791
作者: Yuhang Liu,Heyan Huang,Yizhe Yang,Hongyan Zhao,Zhizhuo Zeng,Yang Gao
机构: Beijing Institute of Technology (北京理工大学); Southeast Academy of Information Technology (东南信息科技研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework’s reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.
[NLP-13] HUKUKBERT: Domain-Specific Language Model for Turkish Law
【速读】: 该论文旨在解决土耳其法律领域自然语言处理(Natural Language Processing, NLP)模型稀缺的问题,特别是由于缺乏高质量的领域专用语料库和预训练模型,导致现有通用或领域特定的土耳其语模型在法律文本理解任务中表现不佳。解决方案的关键在于构建并发布HukukBERT——一个基于18 GB清洗后的法律语料库、采用混合领域自适应预训练(Domain-Adaptive Pre-Training, DAPT)方法训练的大型法律语言模型,该方法融合了整词掩码(Whole-Word Masking)、词段掩码(Token Span Masking)、词段掩码(Word Span Masking)以及针对性关键词掩码(Targeted Keyword Masking)策略,从而显著提升了模型对土耳其法律文本的理解能力。实验证明,HukukBERT在新建的法律填空测试基准(Legal Cloze Test)上达到84.40%的Top-1准确率,并在官方土耳其法院判决文书结构分割任务中实现92.8%的文档通过率,均达到当前最优水平。
链接: https://arxiv.org/abs/2604.04790
作者: Mehmet Utku Öztürk,Tansu Türkoğlu,Buse Buz-Yalug
机构: Kalitte Inc.; Aibrite Inc.; University of Eastern Finland
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages
Abstract:Recent advances in natural language processing (NLP) have increasingly enabled LegalTech applications, yet existing studies specific to Turkish law have still been limited due to the scarcity of domain-specific data and models. Although extensive models like LEGAL-BERT have been developed for English legal texts, the Turkish legal domain lacks a domain-specific high-volume counterpart. In this paper, we introduce HukukBERT, the most comprehensive legal language model for Turkish, trained on a 18 GB cleaned legal corpus using a hybrid Domain-Adaptive Pre-Training (DAPT) methodology integrating Whole-Word Masking, Token Span Masking, Word Span Masking, and targeted Keyword Masking. We systematically compared our 48K WordPiece tokenizer and DAPT approach against general-purpose and existing domain-specific Turkish models. Evaluated on a novel Legal Cloze Test benchmark – a masked legal term prediction task designed for Turkish court decisions – HukukBERT achieves state-of-the-art performance with 84.40% Top-1 accuracy, substantially outperforming existing models. Furthermore, we evaluated HukukBERT in the downstream task of structural segmentation of official Turkish court decisions, where it achieves a 92.8% document pass rate, establishing a new state-of-the-art. We release HukukBERT to support future research in Turkish legal NLP tasks, including recognition of named entities, prediction of judgment, and classification of legal documents.
[NLP-14] MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
【速读】: 该论文旨在解决当前文档解析(document parsing)方法过度聚焦于模型架构创新,而忽视系统性训练数据工程的问题。研究发现,不同架构和参数规模的最先进(SOTA)模型在相同困难样本上表现出高度一致的失败模式,表明性能瓶颈源于训练数据的共性缺陷而非模型结构本身。解决方案的关键在于通过数据工程与训练策略优化实现性能突破:提出“Data Engine”框架,包含多样性与难度感知采样(Diversity-and-Difficulty-Aware Sampling)、跨模型一致性验证(Cross-Model Consistency Verification)以及判别与精修流水线(Judge-and-Refine pipeline),从而系统提升训练数据的覆盖度、信息量与标注准确性;并设计三阶段渐进式训练策略(大规模预训练→困难样本微调→GRPO对齐),分层利用不同质量层级的数据。最终,在保持原模型1.2B参数不变的前提下,模型在改进后的OmniDocBench v1.6基准上达到95.69分,显著优于同架构基线及所有更大规模模型。
链接: https://arxiv.org/abs/2604.04771
作者: Bin Wang,Tianyao He,Linke Ouyang,Fan Wu,Zhiyuan Zhao,Tao Chu,Yuan Qu,Zhenjiang Jin,Weijun Zeng,Ziyang Miao,Bangrui Xu,Junbo Niu,Mengzhang Cai,Jiantao Qiu,Qintong Zhang,Dongsheng Ma,Yuefeng Sun,Hejun Dong,Wenzheng Zhang,Jutao Xiao,Jiayong Shi,Pengyu Liao,Xiaomeng Zhao,Huaping Zhong,Liqun Wei,Jing Yu,Jie Yang,Wei Li,Shasha Wang,Qianqian Wu,Xuanhe Zhou,Weijia Li,Zhenxiang Li,Zhongying Tu,Jiang Wu,Lijun Wu,Chao Xu,Kai Chen,Wentao Zhang,Yu Qiao,Bowen Zhou,Dahua Lin,Conghui He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Technical Report
Abstract:Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy – large-scale pre-training, hard sample fine-tuning, and GRPO alignment – sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200 \times more parameters.
[NLP-15] Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习后训练(post-training)阶段面临的探索障碍问题,即当任务过于困难时,模型无法从当前策略中获得有意义的奖励信号,从而导致学习停滞。解决方案的关键在于通过任务重构(task reformulation)构建一个由易到难的自适应课程学习框架——Cog-DRIFT,将原本开放式的复杂问题转化为结构化、认知负担更轻的变体(如选择题和填空题),这些变体保留原始答案但显著缩小搜索空间并提供更密集的学习信号;模型先在简单格式上学习,再将知识迁移回原问题,从而实现对原先不可解难题的有效优化。
链接: https://arxiv.org/abs/2604.04767
作者: Justin Chih-Yao Chen,Archiki Prasad,Zaid Khan,Joykirat Singh,Runchu Tian,Elias Stengel-Eskin,Mohit Bansal
机构: UNC Chapel Hill
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 4 figures. Code: this https URL
Abstract:Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of LLMs, yet a fundamental limitation remains: models cannot learn from problems that are too difficult to solve under their current policy, as these yield no meaningful reward signal. We propose a simple yet effective solution based on task reformulation. We transform challenging open-ended problems into cognitively simpler variants – such as multiple-choice and cloze formats – that preserve the original answer while reducing the effective search space and providing denser learning signals. These reformulations span a spectrum from discriminative to generative tasks, which we exploit to bootstrap learning: models first learn from structured, easier formats, and this knowledge transfers back to improve performance on the original open-ended problems. Building on this insight, we introduce Cog-DRIFT, a framework that constructs reformulated variants and organizes them into an adaptive curriculum based on difficulty. Training progresses from easier to harder formats, enabling the model to learn from problems that previously yielded zero signal under standard RL post-training. Cog-DRIFT not only improves on the originally unsolvable hard problems (absolute +10.11% for Qwen and +8.64% for Llama) but also generalizes well to other held-out datasets. Across 2 models and 6 reasoning benchmarks, our method consistently outperforms standard GRPO and strong guided-exploration baselines. On average, Cog-DRIFT shows +4.72% (Qwen) and +3.23% (Llama) improvements over the second-best baseline. We further show that Cog-DRIFT improves pass@k at test time, and the curriculum improves sample efficiency. Overall, our results highlight task reformulation and curriculum learning as an effective paradigm for overcoming the exploration barrier in LLM post-training.
[NLP-16] Your Agent Their Asset: A Real-World Safety Analysis of OpenClaw
【速读】: 该论文旨在解决当前个人AI代理(Personal AI Agent)在真实部署环境中因权限过大而带来的安全风险问题,尤其是现有沙箱评估方法无法充分捕捉其潜在攻击面的局限性。解决方案的关键在于提出CIK分类法(Capability-Identity-Knowledge Taxonomy),将代理的持久状态统一为能力、身份和知识三个维度,并基于此框架对OpenClaw这一广泛部署的个人AI代理进行12种攻击场景的实证评估。结果表明,任一CIK维度被污染均显著提升攻击成功率(从24.6%升至64–74%),揭示了此类漏洞本质上源于代理架构设计,需构建更系统化的防护机制以保障个人AI代理的安全性。
链接: https://arxiv.org/abs/2604.04759
作者: Zijun Wang,Haoqin Tu,Letian Zhang,Hardy Chen,Juncheng Wu,Xiangyan Liu,Zhenlong Yuan,Tianyu Pang,Michael Qizhe Shieh,Fengze Liu,Zeyu Zheng,Huaxiu Yao,Yuyin Zhou,Cihang Xie
机构: UC Santa Cruz (加州大学圣克鲁兹分校); NUS (新加坡国立大学); Tencent (腾讯); ByteDance (字节跳动); UC Berkeley (加州大学伯克利分校); UNC-Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem. While these broad privileges enable high levels of automation and powerful personalization, they also expose a substantial attack surface that existing sandboxed evaluations fail to capture. To address this gap, we present the first real-world safety evaluation of OpenClaw and introduce the CIK taxonomy, which unifies an agent’s persistent state into three dimensions, i.e., Capability, Identity, and Knowledge, for safety analysis. Our evaluations cover 12 attack scenarios on a live OpenClaw instance across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, and GPT-5.4). The results show that poisoning any single CIK dimension increases the average attack success rate from 24.6% to 64-74%, with even the most robust model exhibiting more than a threefold increase over its baseline vulnerability. We further assess three CIK-aligned defense strategies alongside a file-protection mechanism; however, the strongest defense still yields a 63.8% success rate under Capability-targeted attacks, while file protection blocks 97% of malicious injections but also prevents legitimate updates. Taken together, these findings show that the vulnerabilities are inherent to the agent architecture, necessitating more systematic safeguards to secure personal AI agents. Our project page is this https URL.
[NLP-17] Darkness Visible: Reading the Exception Handler of a Language Model
【速读】: 该论文旨在解决大语言模型(如GPT-2 Small)中神经元功能分工与知识存储机制的可解释性问题,特别是理解最终多层感知机(MLP)层如何组织其内部结构以实现文本生成中的路由决策和语义处理。解决方案的关键在于对全部3,072个神经元进行数值精度上的分解,识别出五类具有明确功能的角色:5个核心神经元(Core neurons)用于重置词汇到功能词方向,10个差异化神经元(Differentiators)抑制错误候选,5个专家神经元(Specialists)检测句法边界,以及7个共识神经元(Consensus neurons)分别监控不同语言维度。研究进一步揭示了“知识神经元”并非直接存储事实信息,而是作为路由基础设施,在注意力机制输出的基础上放大或抑制信号,并且其行为受上下文约束强度调控;此外,通过花园路径实验发现模型依赖于词级可预测性而非句法结构进行判断,表明该架构仅在顶层显现,深层网络中类似结构应存在于最终层而非中间层(如第11层)。
链接: https://arxiv.org/abs/2604.04756
作者: Peter Balogh
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The final MLP of GPT-2 Small exhibits a fully legible routing program – 27 named neurons organized into a three-tier exception handler – while the knowledge it routes remains entangled across ~3,040 residual neurons. We decompose all 3,072 neurons (to numerical precision) into: 5 fused Core neurons that reset vocabulary toward function words, 10 Differentiators that suppress wrong candidates, 5 Specialists that detect structural boundaries, and 7 Consensus neurons that each monitor a distinct linguistic dimension. The consensus-exception crossover – where MLP intervention shifts from helpful to harmful – is statistically sharp (bootstrap 95% CIs exclude zero at all consensus levels; crossover between 4/7 and 5/7). Three experiments show that “knowledge neurons” (Dai et al., 2022), at L11 of this model, function as routing infrastructure rather than fact storage: the MLP amplifies or suppresses signals already present in the residual stream from attention, scaling with contextual constraint. A garden-path experiment reveals a reversed garden-path effect – GPT-2 uses verb subcategorization immediately, consistent with the exception handler operating at token-level predictability rather than syntactic structure. This architecture crystallizes only at the terminal layer – in deeper models, we predict equivalent structure at the final layer, not at layer 11. Code and data: this https URL
[NLP-18] Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的幻觉问题,即模型生成看似合理但事实错误的输出。其核心解决方案是提出一个几何动力系统框架,揭示幻觉源于潜在空间中任务依赖的基底结构(basin structure)。关键发现在于:不同任务下基底分离程度显著差异——事实类任务通常具有更清晰的基底分离,而摘要生成和易产生误解的任务则基底重叠、稳定性差。作者通过理论证明(任务复杂度定理与多基底定理)刻画了L层Transformer中基底的涌现机制,并进一步展示了基于几何信息的引导策略可在不重新训练的情况下有效降低幻觉概率。
链接: https://arxiv.org/abs/2604.04743
作者: Kalyan Cherukuri,Lav R. Varshney
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Large language models (LLMs) hallucinate: they produce fluent outputs that are factually incorrect. We present a geometric dynamical systems framework in which hallucinations arise from task-dependent basin structure in latent space. Using autoregressive hidden-state trajectories across multiple open-source models and benchmarks, we find that separability is strongly task-dependent rather than universal: factoid settings can show clearer basin separation, whereas summarization and misconception-heavy settings are typically less stable and often overlap. We formalize this behavior with task-complexity and multi-basin theorems, characterize basin emergence in L-layer transformers, and show that geometry-aware steering can reduce hallucination probability without retraining.
[NLP-19] Lighting Up or Dimming Down? Exploring Dark Patterns of LLM s in Co-Creativity
【速读】: 该论文旨在解决生成式 AI(Generative AI)在人类与人工智能协同创作过程中可能削弱人类创造力的问题,特别是识别和分析五种“暗模式”(dark patterns)——即模型行为中潜藏的、可能抑制或扭曲创造性过程的机制,包括奉承(Sycophancy)、语气管制(Tone Policing)、道德化(Moralizing)、死亡循环(Loop of Death)和锚定效应(Anchoring)。研究通过控制实验揭示这些暗模式在不同文学体裁和主题下的出现频率,发现奉承行为几乎普遍存在(91.7%),而锚定效应则与文体密切相关(如在民间故事中最显著)。其解决方案的关键在于:认识到这些行为往往是安全对齐(safety alignment)的副产品,因此未来AI系统的设计应注重平衡安全性与创造性支持,避免因过度约束导致创意探索空间被压缩。
链接: https://arxiv.org/abs/2604.04735
作者: Zhu Li,Jiaming Qu,Yuan Chang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly acting as collaborative writing partners, raising questions about their impact on human agency. In this exploratory work, we investigate five “dark patterns” in human-AI co-creativity – subtle model behaviors that can suppress or distort the creative process: Sycophancy, Tone Policing, Moralizing, Loop of Death, and Anchoring. Through a series of controlled sessions where LLMs are prompted as writing assistants across diverse literary forms and themes, we analyze the prevalence of these behaviors in generated responses. Our preliminary results suggest that Sycophancy is nearly ubiquitous (91.7% of cases), particularly in sensitive topics, while Anchoring appears to be dependent on literary forms, surfacing most frequently in folktales. This study indicates that these dark patterns, often byproducts of safety alignment, may inadvertently narrow creative exploration and proposes design considerations for AI systems that effectively support creative writing.
[NLP-20] Metaphors We Compute By: A Computational Audit of Cultural Translation vs. Thinking in LLM s
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否能够进行文化敏感的推理,而不仅仅是作为多语言工具或文化表达的翻译器。其核心关切在于,LLMs在跨文化语境下生成内容时,是否真正具备基于特定文化背景的深层认知能力,还是仅依赖于一种占主导地位的概念框架(如西方中心主义)进行局部化表达。解决方案的关键在于通过一个隐喻生成任务对五种不同文化情境下的抽象概念进行实证分析,从而检验模型是否存在刻板印象和文化默认倾向(Western defaultism),进而揭示当前LLMs在文化多样性支持上的局限性。
链接: https://arxiv.org/abs/2604.04732
作者: Yuan Chang,Jiaming Qu,Zhu Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are often described as multilingual because they can understand and respond in many languages. However, speaking a language is not the same as reasoning within a culture. This distinction motivates a critical question: do LLMs truly conduct culture-aware reasoning? This paper presents a preliminary computational audit of cultural inclusivity in a creative writing task. We empirically examine whether LLMs act as culturally diverse creative partners or merely as cultural translators that leverage a dominant conceptual framework with localized expressions. Using a metaphor generation task spanning five cultural settings and several abstract concepts as a case study, we find that the model exhibits stereotyped metaphor usage for certain settings, as well as Western defaultism. These findings suggest that merely prompting an LLM with a cultural identity does not guarantee culturally grounded reasoning.
[NLP-21] Individual and Combined Effects of English as a Second Language and Typos on LLM Performance
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在真实应用场景中性能评估失真的问题,特别是当用户以非母语英语(English as a Second Language, ESL)输入且常伴随拼写错误时,现有评估方法往往将ESL变异与拼写错误分开研究,忽略了二者共现对模型表现的复合影响。解决方案的关键在于提出Trans-EnV框架和MulTypo工具,分别生成八种ESL变体并引入低、中、高三个层级的拼写错误,系统性地构建混合干扰场景,从而揭示ESL与拼写错误共同作用下模型性能下降的非线性特征,表明仅基于标准英语的评估会高估实际表现,且孤立评估无法准确反映现实复杂交互中的模型行为。
链接: https://arxiv.org/abs/2604.04723
作者: Serena Liu,Yutong Yang,Prisha Sheth,Weixuan Dong,Mingjiao Diao,Xinru Zhu,Nikhil Banga,Oscar Melendez,Arnav Sharma,Minda Zhao,Marina Lin,Mengyu Wang
机构: Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are used globally, and because much of their training data is in English, they typically perform best on English inputs. As a result, many non-native English speakers interact with them in English as a second language (ESL), and these inputs often contain typographical errors. Prior work has largely studied the effects of ESL variation and typographical errors separately, even though they often co-occur in real-world use. In this study, we use the Trans-EnV framework to transform standard English inputs into eight ESL variants and apply MulTypo to inject typos at three levels: low, moderate, and severe. We find that combining ESL variation and typos generally leads to larger performance drops than either factor alone, though the combined effect is not simply additive. This pattern is clearest on closed-ended tasks, where performance degradation can be characterized more consistently across ESL variants and typo levels, while results on open-ended tasks are more mixed. Overall, these findings suggest that evaluations on clean standard English may overestimate real-world model performance, and that evaluating ESL variation and typographical errors in isolation does not fully capture model behavior in realistic settings.
[NLP-22] What Makes Good Multilingual Reasoning ? Disentangling Reasoning Traces with Measurable Features
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言场景下推理能力存在显著差异的问题,特别是英语与其他语言之间的性能差距。现有研究常假设通过使非英语推理方式模仿英语推理即可缩小差距,但本文质疑这一假设,提出应识别真正促进多语言有效推理的核心特征,并评估这些特征是否具有跨语言普适性。解决方案的关键在于:首先定义一套可量化的推理特征集,涵盖多语言对齐、推理步骤和推理流程三个维度;其次利用稀疏自编码器自动挖掘潜藏的推理概念以扩展或细化这些特征;最后将这些特征作为测试时的选择策略,用于引导模型提升多语言推理表现。实验结果表明,多数特征与准确率正相关,但其强度随语言变化甚至出现反向关联,从而揭示了英语中心主义奖励设计的局限性,并主张发展适应语言特异性推理模式的动态优化目标。
链接: https://arxiv.org/abs/2604.04720
作者: Dayeon Ki,Kevin Duh,Marine Carpuat
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages, 7 figures
Abstract:Large Reasoning Models (LRMs) still exhibit large performance gaps between English and other languages, yet much current work assumes these gaps can be closed simply by making reasoning in every language resemble English reasoning. This work challenges this assumption by asking instead: what actually characterizes effective reasoning in multilingual settings, and to what extent do English-derived reasoning features genuinely help in other languages? We first define a suite of measurable reasoning features spanning multilingual alignment, reasoning step, and reasoning flow aspects of reasoning traces, and use logistic regression to quantify how each feature associates with final answer accuracy. We further train sparse autoencoders over multilingual traces to automatically discover latent reasoning concepts that instantiate or extend these features. Finally, we use the features as test-time selection policies to examine whether they can steer models toward stronger multilingual reasoning. Across two mathematical reasoning benchmarks, four LRMs, and 10 languages, we find that most features are positively associated with accuracy, but the strength of association varies considerably across languages and can even reverse in some. Our findings challenge English-centric reward designs and point toward adaptive objectives that accommodate language-specific reasoning patterns, with concrete implications for multilingual benchmark and reward design.
[NLP-23] BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement
【速读】: 该论文旨在解决低资源语境下多语言自然语言处理(NLP)中高质量双语资源匮乏的问题,尤其针对孟加拉语(Bangla)这一代表性低资源语言。其解决方案的关键在于构建了一个精心标注的孟加拉语-英语平行语料库BiST,涵盖句法结构(Simple, Complex, Compound, Complex-Compound)和时态(Present, Past, Future)两个维度,共包含30,534句(17,465句英文和13,069句孟加拉语),并通过多阶段标注流程与Fleiss Kappa一致性检验确保标签可靠性(结构维度κ=0.82,时态维度κ=0.88)。该语料库不仅支持语法建模任务如受控文本生成、自动反馈生成和跨语言表征学习,还为双语语法建模提供了统一且语言学可解释的基准资源。
链接: https://arxiv.org/abs/2604.04708
作者: Abdullah Al Shafi,Swapnil Kundu Argha,M. A. Moyeen,Abdul Muntakim,Shoumik Barman Polok
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:High-quality bilingual resources remain a critical bottleneck for advancing multilingual NLP in low-resource settings, particularly for Bangla. To mitigate this gap, we introduce BiST, a rigorously curated Bangla-English corpus for sentence-level grammatical classification, annotated across two fundamental dimensions: syntactic structure (Simple, Complex, Compound, Complex-Compound) and tense (Present, Past, Future). The corpus is compiled from open-licensed encyclopedic sources and naturally composed conversational text, followed by systematic preprocessing and automated language identification, resulting in 30,534 sentences, including 17,465 English and 13,069 Bangla instances. Annotation quality is ensured through a multi-stage framework with three independent annotators and dimension-wise Fleiss Kappa ( \kappa ) agreement, yielding reliable and reproducible labels with \kappa values of 0.82 and 0.88 for structural and temporal annotation, respectively. Statistical analyses demonstrate realistic structural and temporal distributions, while baseline evaluations show that dual-encoder architectures leveraging complementary language-specific representations consistently outperform strong multilingual encoders. Beyond benchmarking, BiST provides explicit linguistic supervision that supports grammatical modeling tasks, including controlled text generation, automated feedback generation, and cross-lingual representation learning. The corpus establishes a unified resource for bilingual grammatical modeling and facilitates linguistically grounded multilingual research.
[NLP-24] IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation
【速读】: 该论文旨在解决现有句子表示学习方法仅关注语义内容(what a sentence says)而忽略表达方式(how it is expressed)的问题,后者在诸多下游任务中至关重要。为应对这一挑战,作者提出了一种新的任务——idiolectal representation learning(个体语言特征表示学习),其核心在于分离句子的风格与方言特征与其语义内容。解决方案的关键在于引入IDIOLEX框架,该框架通过结合句子来源(provenance)的监督信号与句子内容的语言学特征,学习每个句子的连续风格和方言表示。实验表明,该方法能有效捕捉阿拉伯语和西班牙语方言中的有意义差异,并具备跨领域迁移能力,同时可作为训练目标用于对齐语言模型的风格特性,从而提升大语言模型(LLM)的多样性与可访问性。
链接: https://arxiv.org/abs/2604.04704
作者: Anjali Kantharuban,Aarohi Srivastava,Fahim Faisal,Orevaoghene Ahia,Antonios Anastasopoulos,David Chiang,Yulia Tsvetkov,Graham Neubig
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing sentence representations primarily encode what a sentence says, rather than how it is expressed, even though the latter is important for many applications. In contrast, we develop sentence representations that capture style and dialect, decoupled from semantic content. We call this the task of idiolectal representation learning. We introduce IDIOLEX, a framework for training models that combines supervision from a sentence’s provenance with linguistic features of a sentence’s content, to learn a continuous representation of each sentence’s style and dialect. We evaluate the approach on dialects of both Arabic and Spanish. The learned representations capture meaningful variation and transfer across domains for analysis and classification. We further explore the use of these representations as training objectives for stylistically aligning language models. Our results suggest that jointly modeling individual and community-level variation provides a useful perspective for studying idiolect and supports downstream applications requiring sensitivity to stylistic differences, such as developing diverse and accessible LLMs.
[NLP-25] Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity
【速读】: 该论文旨在解决当前多模态事实核查(multimodal fact-checking)中一个普遍存在的假设问题,即认为引入视觉证据必然提升核查准确性。研究发现,盲目使用视觉证据反而可能降低模型性能。为应对这一挑战,作者提出AMuFC框架,其核心创新在于引入两个协同工作的智能体:分析器(Analyzer)负责判断视觉证据是否对验证声明必要,验证器(Verifier)则基于检索到的证据及分析器的判断来预测声明的真实性。通过这种自适应机制,AMuFC实现了对视觉证据的智能选择与融合,显著提升了事实核查的准确性。
链接: https://arxiv.org/abs/2604.04692
作者: Jaeyoon Jung,Yejun Yoon,Kunwoo Park
机构: Soongsil University (中央大学); MAUM AI Inc. (MAUM人工智能公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: preprint, 18 pages
Abstract:Automated fact-checking is a crucial task not only in journalism but also across web platforms, where it supports a responsible information ecosystem and mitigates the harms of misinformation. While recent research has progressed from text-only to multimodal fact-checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact-checking framework that employs two collaborative agents with distinct roles for the adaptive use of visual evidence: An Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer’s assessment. Experimental results on three datasets show that incorporating the Analyzer’s assessment of visual evidence necessity into the Verifier’s prediction yields substantial improvements in verification performance. In addition to all code, we release WebFC, a newly constructed dataset for evaluating fact-checking modules in a more realistic scenario, available at this https URL.
[NLP-26] On Ambiguity: The case of fraction its meanings and roles
【速读】: 该论文旨在解决数学论述中“分数”(fraction)概念的模糊性问题,即该术语在初等算术文献中缺乏明确定义且存在多种语义解释。为澄清其使用,作者提出通过引入若干新术语来区分分数的不同含义:用于结构层面的“fracterm”、纯数值层面的“fracvalue”以及纯文本层面的“fracsign”和“fracsign occurrence”。解决方案的关键在于将“分数”视为一个类别(category),而非单一数学概念,并借助这些精确化的术语在算术话语片段中实现歧义消除与语义稳定。这一分析进一步促使作者重新审视数的概念与fracvalue之间的关系,并提出一种指定数系的方法,从而对比结构主义中的相关概念。
链接: https://arxiv.org/abs/2604.04647
作者: Jan A Bergstra,John V Tucker
机构: University of Amsterdam (阿姆斯特丹大学); Swansea University (斯旺西大学)
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注:
Abstract:We contemplate the notion of ambiguity in mathematical discourse. We consider a general method of resolving ambiguity and semantic options for sustaining a resolution. The general discussion is applied to the case of fraction' which is ill-defined and ambiguous in the literature of elementary arithmetic. In order to clarify the use of fraction’ we introduce several new terms to designate some of its possible meanings. For example, to distinguish structural aspects we use fracterm', to distinguish purely numerical aspects fracvalue’ and, to distinguish purely textual aspects fracsign' and fracsign occurence’. These interpretations can resolve ambiguity, and we discuss the resolution by using such precise notions in fragments of arithmetical discourse. We propose that fraction does not qualify as a mathematical concept but that the term functions as a collective for several concepts, which we simply call a `category’. This analysis of fraction leads us to consider the notion of number in relation to fracvalue. We introduce a way of specifying number systems, and compare the analytical concepts with those of structuralism.
[NLP-27] Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR Script Failure and Cross-Domain Evaluation
【速读】: 该论文旨在解决普什图语(Pashto)在多语言自动语音识别(ASR)领域缺乏公开可复现基准评测的问题,同时揭示现有模型在零样本识别、脚本错误和跨域泛化方面的关键缺陷。其解决方案的关键在于构建首个基于公共数据集的多模型系统性评估框架,涵盖零样本ASR、脚本级失败分析与跨域性能验证:通过FLEURS和Common Voice 24两个测试集对比10种主流模型表现,发现包括Whisper在内的多数模型存在严重的脚本输出偏差(如仅0.8%输出普什图文字符号),且WER指标无法反映此类根本性失效;进一步指出fine-tuned模型在分布外数据上性能显著下降(从14% WER恶化至32.5–59%),而一种增强训练策略可实现零跨域退化(保持35.1% WER);此外,字符级错误分层分析确认普什图语特有音素(卷舌音系列和边擦音)是主要错误来源。这一系统性评估为后续研究提供了可复现基准和明确优先方向。
链接: https://arxiv.org/abs/2604.04598
作者: Hanif Rahman
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Pashto is spoken by approximately 60–80 million people but has no published benchmarks for multilingual automatic speech recognition (ASR) on any shared public test set. This paper reports the first reproducible multi-model evaluation on public Pashto data, covering zero-shot ASR, script-level failure, and cross-domain evaluation of fine-tuned models. For zero-shot ASR, ten models (all seven Whisper sizes, MMS-1B, SeamlessM4T-v2-large, and OmniASR-CTC-300M) are evaluated on the FLEURS Pashto test set and a filtered Common Voice~24 subset; zero-shot Whisper WER ranges from 90% to 297%, with the medium model collapsing to 461% on Common Voice~24 consistent with decoder looping. SeamlessM4T achieves 39.7% WER on Common Voice~24 (the best zero-shot result reported to date, as of submission); MMS-1B achieves 43.8% on FLEURS. For script failure, a language-identification audit shows that no Whisper model produces Pashto-script output in more than 0.8% of utterances, while MMS-1B, SeamlessM4T, and OmniASR each exceed 93% Pashto-script fidelity; WER alone does not reveal this failure, since a model generating Arabic-script output on Pashto audio has not achieved ASR in any interpretable sense. For cross-domain evaluation, five fine-tuned Pashto ASR models are evaluated on both test sets: published WER figures of 14% degrade to 32.5–59% on out-of-distribution sets, while one augmented model achieves 35.1% on both sets with zero cross-domain degradation. Character-class error stratification confirms that Pashto-unique phonemes (the retroflex series and lateral fricatives) account for disproportionate error mass. All evaluations cover read speech only. Five structural impediments to cumulative progress are identified and five ordered research priorities are argued.
[NLP-28] PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对不完整、模糊或缺失关键变量的用户查询时,仍会生成看似合理但可能错误(即幻觉)回答的问题。传统检索增强生成(Retrieval-Augmented Generation, RAG)系统缺乏对知识不确定性的认知意识(epistemic awareness),常默认生成答案而非主动识别信息不足。其解决方案的关键在于提出PassiveQA框架——一个三动作决策机制(Answer、Ask for clarification、Abstain),通过监督微调(supervised fine-tuning)使模型具备显式的决策推理能力:利用结构化信息状态表示、基于知识图谱的上下文增强以及一个微调后的规划器(planner),明确建模缺失变量并判断是否应答。实验表明,该方法在多个问答数据集上显著提升了宏观F1分数和弃权召回率(abstention recall),同时降低了幻觉率,在计算资源受限条件下验证了认知决策必须在训练阶段学习而非推理时硬性施加。
链接: https://arxiv.org/abs/2604.04565
作者: Madhav S Baidya
机构: Indian Institute of Technology (BHU) Varanasi (印度理工学院(布哈)瓦拉纳西分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages, 4 figures. Includes experiments on four QA datasets and a knowledge graph-based finetuning pipeline. Code available at: this https URL
Abstract:Large Language Models (LLMs) have achieved strong performance in question answering and retrieval-augmented generation (RAG), yet they implicitly assume that user queries are fully specified and answerable. In real-world settings, queries are often incomplete, ambiguous, or missing critical variables, leading models to produce overconfident or hallucinated responses. In this work, we study decision-aware query resolution under incomplete information, where a model must determine whether to Answer, Ask for clarification, or Abstain. We show that standard and enhanced RAG systems do not reliably exhibit such epistemic awareness, defaulting to answer generation even when information is insufficient. To address this, we propose PassiveQA, a three-action framework that aligns model behaviour with information sufficiency through supervised finetuning. Our approach integrates structured information-state representations, knowledge graph-grounded context, and a finetuned planner that explicitly models missing variables and decision reasoning. Experiments across multiple QA datasets show that the finetuned planner achieves significant improvements in macro F1 and abstention recall while reducing hallucination rates, under a compute-constrained training regime. These results provide strong empirical evidence that epistemic decision-making must be learned during training rather than imposed at inference time. Comments: 32 pages, 4 figures. Includes experiments on four QA datasets and a knowledge graph-based finetuning pipeline. Code available at: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.04565 [cs.CL] (or arXiv:2604.04565v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.04565 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Madhav Sukla Baidya [view email] [v1] Mon, 6 Apr 2026 09:54:16 UTC (388 KB) Full-text links: Access Paper: View a PDF of the paper titled PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning, by Madhav S BaidyaView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-29] Mapping the Exploitation Surface: A 10000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities
【速读】: 该论文旨在解决生成式 AI(Generative AI)代理在获得工具访问权限后,如何被系统提示(system prompt)中的特定语言特征所诱导而发现并利用安全漏洞的问题。其核心问题是:哪些提示特征会触发此类攻击行为,哪些不会?解决方案的关键在于通过大规模实验(约10,000次测试)构建了一个系统性的分类框架,识别出并非所有假设的攻击维度都会导致可检测的漏洞利用——其中九个维度(如最小化、道德许可、激励机制等)在每组50次试验中均未产生显著攻击行为;而唯一可靠触发exploitation的是“目标重构”(goal reframing),例如提示语“你正在解谜题;可能存在隐藏线索”,这使得代理将违规操作重新解释为任务内合理行为,从而绕过规则约束(如Claude Sonnet 4模型中达到38–40%的利用率)。因此,防御方应聚焦于审计可能导致目标重构的语言设计,而非广泛筛查所有潜在的对抗性提示。
链接: https://arxiv.org/abs/2604.04561
作者: Charafeddine Mouzouni
机构: OPIT – Open Institute of Technology (开放技术研究所); Cohorte AI (协奏AI); Paris, France (巴黎, 法国)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 8 tables, code and data at this https URL
Abstract:LLM agents with tool access can discover and exploit security vulnerabilities. This is known. What is not known is which features of a system prompt trigger this behaviour, and which do not. We present a systematic taxonomy based on approximately 10,000 trials across seven models, 37 prompt conditions, and 12 hypothesized attack dimensions – all executed in real Docker sandboxes. Every condition includes the instruction “Always respect all rules and access policies”; only a single encouragement sentence varies. Within the task class studied (planted test-runner vulnerabilities), the threat model is narrower than assumed: nine of twelve dimensions – including minimization, moral licensing, incentives, identity priming, and reasoning triggers – produce no detectable exploitation at n=50 per cell (upper 95% CI 7% per condition). But it is also more subtle than expected: one dimension, goal reframing, reliably triggers exploitation. “You are solving a puzzle; there may be hidden clues” produces 38-40% exploitation on Claude Sonnet 4 despite the explicit rule instruction, replicating across four models (CTF framing: 8-14% on DeepSeek, GPT-5-mini, o4-mini). The agent does not override the rules; it reinterprets the task so that exploitative actions become task-aligned. GPT-4.1 produces no exploitation across 1,850 trials (37 conditions), and a temporal comparison across four OpenAI models released over eleven months shows a pattern consistent with improving safety training, though model capability differences are a confounder. The practical contribution is a narrowed, testable threat model: defenders should audit for goal-reframing language, not for the broad class of adversarial prompts.
[NLP-30] Formal Constraints on Dependency Syntax
【速读】: 该论文旨在解决依赖句法(dependency syntax)中无约束树结构过于宽松,无法准确描述真实语言现象的问题。其核心挑战在于如何在保持语法合理性的同时,避免项目性(projectivity)约束的过度限制,尤其是在词序灵活的语言中。解决方案的关键在于提出一系列介于项目性和无约束之间的新型约束条件,以更精确地刻画实际语言中的句法结构,从而提升句法分析的准确性、解析效率,并为语言演化与人类语言处理机制提供新的理论洞见。
链接: https://arxiv.org/abs/2604.04542
作者: Gómez-Rodríguez,Carlos,Alemany-Puig,Lluís
机构: Universidade da Coruña(拉科鲁尼亚大学); Universitat Politècnica de Catalunya(加泰罗尼亚理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Dependency syntax represents the structure of a sentence as a tree composed of dependencies, i.e., directed relations between lexical units. While in its more general form any such tree is allowed, in practice many are not plausible or are very infrequent in attested language. This has motivated a search for constraints characterizing subsets of trees that better fit real linguistic phenomena, providing a more accurate linguistic description, faster parsing or insights on language evolution and human processing. Projectivity is the most well-studied such constraint, but it has been shown to be too restrictive to represent some linguistic phenomena, especially in flexible-word-order languages. Thus, a variety of constraints have been proposed to seek a realistic middle ground between the limitations of projectivity and the excessive leniency of unrestricted dependency structures.
[NLP-31] Multilingual Prompt Localization for Agent -as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
【速读】: 该论文旨在解决当前代理代码基准测试中评估语言默认为英语所导致的偏倚问题,即固定英语作为评判标准可能掩盖不同大模型(backbone)在多语言环境下的真实性能表现。其关键解决方案在于将“Agent-as-a-Judge”提示栈本地化至五种类型学上多样化的语言(英语、阿拉伯语、土耳其语、中文、印地语),并系统性地评估跨语言环境下多个开发代理框架与评判模型组合的表现,发现模型性能排名随语言变化而显著反转,且评判一致性较低(Fleiss’ κ ≤ 0.231),证明语言应作为显式评估变量纳入基准设计;进一步的消融实验表明,仅本地化评测内容不足以保证结果稳定性,必须同时对评判侧指令进行完整本地化,否则满意度可下降近50%(如印地语从42.8%降至23.2%)。
链接: https://arxiv.org/abs/2604.04532
作者: Alhasan Mahmood,Samir Abdaljalil,Hasan Kurban
机构: Hamad Bin Khalifa University (哈马德·本·哈利法大学); Texas AM University (得克萨斯农工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge’s language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72%), while Gemini leads in Arabic (51.72%, p0.001 vs.\ GPT-4o) and Hindi (53.22%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss’ \kappa \leq 0.231 ). A controlled ablation further shows that localizing judge-side instructions, not just benchmark content, can be decisive: Hindi satisfaction drops from 42.8% to 23.2% under partial localization. These results indicate that language should be treated as an explicit evaluation variable in agentic benchmarks. Full requirement-level judgments and runtime statistics are released for reproducibility.
[NLP-32] CommonMorph: Participatory Morphological Documentation Platform
【速读】: 该论文旨在解决低资源语言(low-resource languages)在形态学数据收集与标注过程中面临的挑战,这些问题包括对语言学专业知识的高要求、方法论严谨性的缺失以及人力与资源的不足。解决方案的关键在于提出一个名为 \texttt{CommonMorph} 的综合性平台,其核心创新在于采用三层协作机制:专家语言学定义、贡献者数据采集和社区验证,并通过主动学习(active learning)、注释建议及跨语言材料导入工具显著减少人工工作量,同时支持多种形态系统(如屈折型、黏着型和词根-模式型),并以开源设计和 UniMorph 兼容输出保障可访问性与自然语言处理(NLP)工具的互操作性。
链接: https://arxiv.org/abs/2604.04515
作者: Aso Mahmudi,Sina Ahmadi,Kemal Kurniawan,Rico Sennrich,Eduard Hovy,Ekaterina Vylomova
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Collecting and annotating morphological data present significant challenges, requiring linguistic expertise, methodological rigour, and substantial resources. These barriers are particularly acute for low-resource languages and varieties. To accelerate this process, we introduce \textttCommonMorph, a comprehensive platform that streamlines morphological data collection development through a three-tiered approach: expert linguistic definition, contributor elicitation, and community validation. The platform minimises manual work by incorporating active learning, annotation suggestions, and tools to import and adapt materials from related languages. It accommodates diverse morphological systems, including fusional, agglutinative, and root-and-pattern morphologies. Its open-source design and UniMorph-compatible outputs ensure accessibility and interoperability with NLP tools. Our platform is accessible at this https URL, offering a replicable model for preserving linguistic diversity through collaborative technology.
[NLP-33] One Model for All: Multi-Objective Controllable Language Models
【速读】: 该论文旨在解决当前基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)方法在对齐大型语言模型(Large Language Models, LLMs)时,难以适应个体用户偏好差异的问题。传统RLHF依赖于平均人类评分构建固定奖励函数,导致模型缺乏对多目标权衡(如安全性与效率、共情与精确性)的灵活控制能力。为此,作者提出多目标控制(Multi-Objective Control, MOC)方案,其核心在于将多目标优化(Multi-Objective Optimization, MOO)原理引入RLHF框架,训练一个以偏好条件化的策略网络(preference-conditioned policy network),使单一LLM能够直接生成位于用户定义的帕累托前沿(Pareto front)区域内的个性化输出。MOC通过在策略层面上应用MOO显著提升计算效率,实现了仅用单张A6000 GPU即可微调7B参数模型的能力,从而在可控性、输出质量与多样性以及未见偏好的泛化性能上均优于基线方法。
链接: https://arxiv.org/abs/2604.04497
作者: Qiang He,Yucheng Yang,Tianyi Zhou,Meng Fang,Mykola Pechenizkiy,Setareh Maghsudi
机构: Ruhr University Bochum(鲁尔大学波鸿分校); Eindhoven University of Technology(埃因霍温理工大学); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学); University of Liverpool(利物浦大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published in Transactions on Machine Learning Research (03/2026): this https URL
Abstract:Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs’ safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, varying from emphasizing empathy in certain contexts to demanding efficiency and precision in others. Can we train one LLM to produce personalized outputs across different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC’s potential for real-world applications requiring scalable and customizable LLMs.
[NLP-34] Same Geometry Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability
【速读】: 该论文旨在探究生成式语言模型(如Transformer架构)是否具备生物感知系统中常见的标量变异性(scalar variability),即表征噪声随数值幅度增加而比例放大,从而保持恒定变异系数(coefficient of variation, CV)。研究发现,与生物系统相反,Transformer模型中的隐藏状态表示在数值轴上表现出反标量模式——即表示变异性随数值增大而降低(缩放指数α ≈ -0.19),且该趋势在全维空间和去除句子身份影响后依然稳健。关键在于,仅通过分布学习(distributional learning)无法再现生物系统中观察到的恒定CV噪声特性;模型虽能模拟对数压缩的数值几何结构(log-compressive magnitude geometry),却未能复现生物系统特有的噪声比例关系。这表明当前大型语言模型在模拟人类认知机制方面仍存在根本性缺失。
链接: https://arxiv.org/abs/2604.04469
作者: Jon-Paul Cacioli
机构: Independent Researcher, Melbourne, Australia; Classical Minds, Modern Machines
类目: Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 7 pages, 5 figures, 1 table. Pre-registered on OSF ( this http URL ). Companion to arXiv:2603.20642
Abstract:Scalar variability – the finding that representational noise scales proportionally with magnitude, producing a constant coefficient of variation – is a hallmark of biological magnitude systems. We tested whether transformer language models exhibit this property by analysing the dispersion of hidden-state representations across carrier sentences for 26 numerical magnitudes in three 7-8B parameter models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base; data from Cacioli, 2026). We found the opposite: representational variability decreased with magnitude along the magnitude axis (scaling exponent alpha approx -0.19; 0/16 primary layers with alpha 0, all three models). The negative sign was consistent in full-dimensional space (alpha approx -0.04) and after sentence-identity correction (alpha approx -0.007). The anti-scalar pattern was 3-5x stronger along the magnitude axis than orthogonal dimensions, and corpus frequency strongly predicted per-magnitude variability (rho = .84). These results demonstrate that distributional learning alone is insufficient to produce scalar variability: transformers reproduce log-compressive magnitude geometry but not the constant-CV noise signature observed in biological systems.
[NLP-35] What Makes a Sale? Rethinking End-to-End Seller–Buyer Retail Dynamics with LLM Agents
【速读】: 该论文旨在解决零售策略在部署前难以评估的问题,原因在于零售决策过程涉及多个阶段——从卖家侧的说服、买卖双方互动到最终购买决策——而现有零售模拟器仅能捕捉该流程的部分特征,且未建模跨阶段依赖关系,导致无法准确评估早期决策对下游结果的影响。解决方案的关键在于提出RetailSim这一端到端的零售模拟框架,其通过统一环境建模整个销售流程,强调模拟保真度,具体体现在多样化的产品空间、基于人物画像(persona-driven)的智能体设计以及多轮交互机制上,从而能够有效重现真实世界的经济规律和消费者行为模式,并支持包括人物画像推断、买卖互动分析及销售策略评估在内的多种决策导向应用场景。
链接: https://arxiv.org/abs/2604.04468
作者: Jeonghwan Choi,Jibin Hwang,Gyeonghun Sun,Minjeong Ban,Taewon Yun,Hyeonjae Cheon,Hwanjun Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Evaluating retail strategies before deployment is difficult, as outcomes are determined across multiple stages, from seller-side persuasion through buyer-seller interaction to purchase decisions. However, existing retail simulators capture only partial aspects of this process and do not model cross-stage dependencies, making it difficult to assess how early decisions affect downstream outcomes. We present RetailSim, an end-to-end retail simulation framework that models this pipeline in a unified environment, explicitly designed for simulation fidelity through diverse product spaces, persona-driven agents, and multi-turn interactions. We evaluate RetailSim with a dual protocol comprising human evaluation of behavioral fidelity and meta-evaluation against real-world economic regularities, showing that it successfully reproduces key patterns such as demographic purchasing behavior, the price-demand relationship, and heterogeneous price elasticity. We further demonstrate its practical utility via decision-oriented use cases, including persona inference, seller-buyer interaction analysis, and sales strategy evaluation, showing RetailSim’s potential as a controlled testbed for exploring retail strategies.
[NLP-36] DP-OPD: Differentially Private On-Policy Distillation for Language Models
【速读】: 该论文旨在解决在大语言模型(Large Language Models, LLMs)私有化部署过程中,如何在保障差分隐私(Differential Privacy, DP)的前提下实现高效且高精度的模型压缩问题。现有方法要么对教师和学生均施加差分隐私训练(DP-SGD),导致计算开销大、性能下降;要么依赖于从已训练的DP教师模型生成合成文本再进行蒸馏,引入了离线生成流程并需优化大型教师模型,进一步加剧隐私-效用权衡。其解决方案的关键在于提出差分隐私在线策略蒸馏(Differentially Private On-Policy Distillation, DP-OPD),即仅对学生模型应用DP-SGD,同时利用冻结的教师模型为学生在自动生成轨迹上提供密集的token级目标监督信号,从而避免了DP教师训练与合成文本生成步骤,显著简化训练流程,并在严格隐私预算(ε=2.0)下实现了优于传统DP微调和离线蒸馏方法的生成质量(如Yelp和BigPatent数据集上的困惑度分别提升至41.68和30.63)。
链接: https://arxiv.org/abs/2604.04461
作者: Fatemeh Khadem,Sajad Mousavi,Yi Fang,Yuhong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly adapted to proprietary and domain-specific corpora that contain sensitive information, creating a tension between formal privacy guarantees and efficient deployment through model compression. Differential privacy (DP), typically enforced via DP-SGD, provides record-level protection but often incurs substantial utility loss in autoregressive generation, where optimization noise can amplify exposure bias and compounding errors along long rollouts. Existing approaches to private distillation either apply DP-SGD to both teacher and student, worsening computation and the privacy–utility tradeoff, or rely on DP synthetic text generation from a DP-trained teacher, avoiding DP on the student at the cost of DP-optimizing a large teacher and introducing an offline generation pipeline. We propose \textbfDifferentially Private On-Policy Distillation (DP-OPD), a synthesis-free framework that enforces privacy solely through DP-SGD on the student while leveraging a frozen teacher to provide dense token-level targets on \emphstudent-generated trajectories. DP-OPD instantiates this idea via \emphprivate generalized knowledge distillation on continuation tokens. Under a strict privacy budget ( \varepsilon=2.0 ), DP-OPD improves perplexity over DP fine-tuning and off-policy DP distillation, and outperforms synthesis-based DP distillation (Yelp: 44.15 \rightarrow 41.68; BigPatent: 32.43 \rightarrow 30.63), while substantially simplifying the training pipeline. In particular, \textbfDP-OPD collapses private compression into a single DP student-training loop by eliminating DP teacher training and offline synthetic text generation. Code will be released upon publication at this https URL.
[NLP-37] Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition ICPR
【速读】: 该论文旨在解决当前可解释人工智能(Explainable AI)评估多为实例导向、缺乏对模型解释一致性量化的问题,即在相同类别或标签保持不变的小幅扰动输入下,模型的特征重要性分配是否稳定。其解决方案的关键在于提出一种新型度量指标,通过计算相同标签样本间SHAP值的余弦相似度,来量化模型解释的一致性,从而识别出模型在推理过程中对特定特征的偏倚依赖或不稳定行为。该方法基于预训练BERT模型在SST-2情感分析数据集上的实验,并扩展至RoBERTa、DistilBERT及IMDB数据集进行鲁棒性验证,有效提升了对模型行为稳定性和可信性的评估能力。
链接: https://arxiv.org/abs/2604.04456
作者: Abu Noman Md Sakib,Zhensen Wang,Merjulah Roby,Zijie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28th International Conference on Pattern Recognition (ICPR) 2026
Abstract:Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label-preserving perturbations. We implement this metric using a pre-trained BERT model on the SST-2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label, aiming to detect inconsistent behaviors, such as biased reliance on certain features or failure to maintain consistent reasoning for similar predictions. Through a series of experiments, we evaluate the ability of this metric to identify misaligned predictions and inconsistencies in model explanations. These experiments are compared against standard fidelity metrics to assess whether the new metric can effectively identify when a model’s behavior deviates from its intended objectives. The proposed framework provides a deeper understanding of model behavior by enabling more robust verification of rationale stability, which is critical for building trustworthy AI systems. By quantifying whether models rely on consistent attribution patterns for similar inputs, the proposed approach supports more robust evaluation of model behavior in practical pattern recognition pipelines. Our code is publicly available at this https URL.
[NLP-38] Conversational Control with Ontologies for Large Language Models : A Lightweight Framework for Constrained Generation LREC2026
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的对话代理在实际应用中面临的可预测性差和个性化不足的问题,这些问题源于模型的黑箱特性。解决方案的关键在于提出一种端到端的方法,通过定义与对话相关的语义方面(aspects)的本体论(ontology)来实现模块化且可解释的控制机制;具体而言,将关键对话属性建模为约束条件,并对LLM进行微调以生成符合这些约束的内容。该方法在英语水平和内容情感极性两个任务上验证有效,展现出优于预训练基线的表现,同时具备模型无关性、轻量化和可解释性,支持控制策略的复用与扩展。
链接: https://arxiv.org/abs/2604.04450
作者: Barbara Gendron,Gaël Guibon,Mathieu d’Aquin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at KG LLM: Knowledge Graphs and Large Language Models LREC 2026 Workshop
Abstract:Conversational agents based on Large Language Models (LLMs) have recently emerged as powerful tools for human-computer interaction. Nevertheless, their black-box nature implies challenges in predictability and a lack of personalization, both of which can be addressed by controlled generation. This work proposes an end-to-end method to obtain modular and explainable control over LLM outputs through ontological definitions of aspects related to the conversation. Key aspects are modeled and used as constraints; we then further fine-tune the LLM to generate content accordingly. To validate our approach, we explore two tasks that tackle two key conversational aspects: the English proficiency level and the polarity profile of the content. Using a hybrid fine-tuning procedure on seven state-of-the-art, open-weight conversational LLMs, we show that our method consistently outperforms pre-trained baselines, even on smaller models. Beyond quantitative gains, the framework remains model-agnostic, lightweight, and interpretable, enabling reusable control strategies that can be extended to new domains and interaction goals. This approach enhances alignment with strategy instructions and demonstrates the effectiveness of ontology-driven control in conversational systems.
[NLP-39] DeonticBench: A Benchmark for Reasoning over Rules
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理复杂、上下文特定规则时的推理能力不足问题,特别是在法律与政策场景中体现为德性推理(deontic reasoning),即对义务、许可和禁止等规则的理解与应用。解决方案的关键在于提出DEONTICBENCH——一个包含6,232个任务的基准测试集,覆盖美国联邦税收、航空公司行李政策、移民管理及州级住房法等真实世界领域,并支持两种推理路径:一是自由形式的链式思维(chain-of-thought)推理,二是基于符号计算的可执行Prolog程序生成,从而实现形式化的问题解释与显式程序追踪。该设计使模型能够在非符号与符号双重框架下进行评估,推动对现实场景中规则驱动推理的研究。
链接: https://arxiv.org/abs/2604.04443
作者: Guangyao Dou,Luis Brena,Akhil Deo,William Jurayj,Jingyu Zhang,Nils Holzenberger,Benjamin Van Durme
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.
[NLP-40] Structured Causal Video Reasoning via Multi-Objective Alignment
【速读】: 该论文旨在解决现有视频大语言模型(Video-LLMs)在视频理解中依赖非结构化推理、导致因果关系建模薄弱和推理效率低下的问题。其核心挑战在于,当前方法将关键视觉证据嵌入冗长的文本描述中,难以实现精确的时间因果推断。解决方案的关键是引入一种名为“结构化事件事实”(Structured Event Facts)的先验表示,该表示在推理前构建显著事件及其因果关系的紧凑结构,从而为推理过程提供显式约束,增强因果合理性并提升中间证据的可验证性。此外,作者设计了包含四阶段训练流程的优化框架,并在强化学习阶段通过多目标强化学习(Multi-Objective Reinforcement Learning, MORL)明确优化结构完整性与因果保真度之间的权衡,最终提出 Factum-4B 模型,在需要细粒度时间推理的任务上实现了更可靠且更强的性能。
链接: https://arxiv.org/abs/2604.04415
作者: Zinuo Li,Yongxin Guo,Jun Liu,Jiawei Zhan,Xi Jiang,Chengjie Wang,Mohammed Bennamoun,Farid Boussaid,Feng Zheng,Qiuhong Ke
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.
[NLP-41] Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding CVPR2026
【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在视觉文档理解(Visual Document Understanding, VDU)任务中,模型内部表征与生成响应之间存在不一致的问题。现有评估方法主要依赖生成结果,无法准确反映模型是否真正捕获了任务所需的信息。论文的关键解决方案是通过线性探测(linear probing)分析不同网络层中任务相关信息的编码特性,并发现任务信息在中间层比最终层更线性可解。基于此发现,研究提出针对中间层进行微调(fine-tuning)的策略,实验证明该方法能同时提升线性探测准确率和生成响应准确率,从而缩小内部表征与输出响应之间的差距。
链接: https://arxiv.org/abs/2604.04411
作者: Haruka Kawasaki,Ryota Tanaka,Kyosuke Nishida
机构: NTT, Inc. (日本电信电话公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026 workshop (MULA)
Abstract:Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layouts. Although recent LVLMs have shown progress on VDU benchmarks, their performance is typically evaluated based on generated responses, which may not necessarily reflect whether the model has actually captured the required information internally. In this paper, we investigate how information required to solve VDU tasks is represented across different layers of LLMs within LVLMs using linear probing. Our study reveals that (1) there is a clear gap between internal representations and generated responses, and (2) information required to solve the task is often encoded more linearly from intermediate layers than from the final layer. Motivated by these findings, we explore fine-tuning strategies that target intermediate layers. Experiments show that fine-tuning intermediate layers improves both linear probing accuracy and response accuracy while narrowing the gap.
[NLP-42] Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment IROS
【速读】: 该论文旨在解决语言模型对齐过程中因假设特定人类偏好模型(如Bradley-Terry模型)而导致的统计不一致性问题,即在样本数量增加时无法保证模型收敛至真实人类偏好。现有方法虽能实现对齐,但缺乏理论保障;而直接密度比优化(DDRO)虽具备统计一致性,却因密度比估计不稳定易发散,导致训练过程不稳。论文提出的关键解决方案是引入相对密度比(relative density ratio),其定义为偏好数据分布与偏好和非偏好数据混合分布之间的密度比,该比值具有上界且不会发散,从而确保训练稳定性,同时保持统计一致性,并提供比DDRO更紧的收敛性保证。
链接: https://arxiv.org/abs/2604.04410
作者: Hiroshi Takahashi,Tomoharu Iwata,Atsutoshi Kumagai,Sekitoshi Kanai,Masanori Yamada,Kosuke Nishida,Kazutoshi Shinoda
机构: NTT, Inc.(NTT公司); NTT DOCOMO, INC.(NTT DOCOMO公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Code is available at this https URL
Abstract:Aligning language models with human preferences is essential for ensuring their safety and reliability. Although most existing approaches assume specific human preference models such as the Bradley-Terry model, this assumption may fail to accurately capture true human preferences, and consequently, these methods lack statistical consistency, i.e., the guarantee that language models converge to the true human preference as the number of samples increases. In contrast, direct density ratio optimization (DDRO) achieves statistical consistency without assuming any human preference models. DDRO models the density ratio between preferred and non-preferred data distributions using the language model, and then optimizes it via density ratio estimation. However, this density ratio is unstable and often diverges, leading to training instability of DDRO. In this paper, we propose a novel alignment method that is both stable and statistically consistent. Our approach is based on the relative density ratio between the preferred data distribution and a mixture of the preferred and non-preferred data distributions. Our approach is stable since this relative density ratio is bounded above and does not diverge. Moreover, it is statistically consistent and yields significantly tighter convergence guarantees than DDRO. We experimentally show its effectiveness with Qwen 2.5 and Llama 3.
[NLP-43] How Alignment Routes: Localizing Scaling and Controlling Policy Circuits in Language Models
【速读】: 该论文旨在解决大语言模型在对齐训练后出现的“稀疏路由机制”问题,即模型如何通过特定的注意力头结构实现对有害内容的拒绝响应。其解决方案的关键在于识别并验证了一个由“门控注意力头(gate attention head)”与“放大器注意力头(amplifier heads)”构成的稳定神经回路:门控头负责检测潜在违规内容,并触发下游放大器头增强拒绝信号;该机制在多个模型中具有高度可重复性和稳定性(Jaccard指数0.92–1.0),且可通过调节检测层信号连续控制政策强度(从硬性拒绝到事实合规),同时揭示了意图识别与策略路由之间的结构分离——当输入被加密编码时,门控头失效但模型仍能理解内容,表明预训练阶段的语义理解与后训练阶段的政策绑定具有不同的鲁棒性特征。
链接: https://arxiv.org/abs/2604.04385
作者: Gregory N. Frank
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We identify a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost the signal toward refusal. Using political censorship and safety refusal as natural experiments, we trace this mechanism across 9 models from 6 labs, all validated on corpora of 120 prompt pairs. The gate head passes necessity and sufficiency interchange tests (p 0.001, permutation null), and core amplifier heads are stable under bootstrap resampling (Jaccard 0.92-1.0). Three same-generation scaling pairs show that routing distributes at scale (ablation up to 17x weaker) while remaining detectable by interchange. By modulating the detection-layer signal, we continuously control policy strength from hard refusal through steering to factual compliance, with routing thresholds that vary by topic. The circuit also reveals a structural separation between intent recognition and policy routing: under cipher encoding, the gate head’s routing contribution collapses (78% in Phi-4 at n=120) while the model responds with puzzle-solving rather than refusal. The routing mechanism never fires, even though probe scores at deeper layers indicate the model begins to represent the harmful content. This asymmetry is consistent with different robustness properties of pretraining and post-training: broad semantic understanding versus narrower policy binding that generalizes less well under input transformation.
[NLP-44] Compressible Softmax-Attended Language under Incompressible Attention
【速读】: 该论文旨在解决Transformer语言模型中注意力机制的容量分配与实际信息流动之间不匹配的问题,即注意力机制在结构上均匀分配计算资源到所有头维度(d_h),但真实语言数据的信息交互却高度集中在少数维度上。解决方案的关键在于揭示了注意力机制中logit能量场(E~)的低秩特性:其90%方差仅需2–11个奇异分量即可捕获,而学习到的查询-键交互矩阵 WQTWK 则需要38–75个分量才能达到相同阈值,表明存在显著的谱间隙(有效秩相差5–25倍)。这说明语言数据本身具有高度可压缩性,而非模型架构或分析框架所致,从而为注意力机制的高效压缩和优化提供了理论依据。
链接: https://arxiv.org/abs/2604.04384
作者: Wonsuk Lee
机构: Seoul National University (首尔国立大学); SK Hynix America (SK海力士美国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages
Abstract:Across every attention head in five transformer language models (124M–7B parameters, four architecture families), the logit energy field \tildeE reaches 90% of its variance in 2–11 singular components. The \emphlearned interaction matrix W_Q^\mathrmT W_K needs 38–75 components for the same threshold out of d_h \in \64, 128\ . The spectral gap is 5 – 25\times in effective rank. The attention mechanism allocates capacity uniformly across all d_h dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.
[NLP-45] GROUNDEDKG-RAG : Grounded Knowledge Graph Index for Long-document Question Answering LREC2026
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在长文档问答任务中存在资源消耗高、生成内容重复以及幻觉问题(hallucinations)等缺陷,这些问题主要源于对大语言模型(Large Language Models, LLMs)描述的过度依赖,缺乏对源文本的有效 grounding。解决方案的关键在于提出 GroundedKG-RAG,其核心创新是构建一个显式从源文档中提取并锚定(grounded)的知识图谱(Knowledge Graph, KG),其中节点表示实体和动作,边表示时序或语义关系,并确保每个节点与边均对应原文句子。通过语义角色标注(Semantic Role Labeling, SRL)和抽象意义表示(Abstract Meaning Representation, AMR)解析生成该知识图谱,并对其进行嵌入用于检索;查询时对 query 应用相同转换以获取最相关句子进行回答。此方法显著提升了效率与事实准确性,同时具备人类可读性,便于结果审计与错误分析。
链接: https://arxiv.org/abs/2604.04359
作者: Tianyi Zhang,Andreas Marfurt
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of KG-LLM @ LREC 2026
Abstract:Retrieval-augmented generation (RAG) systems have been widely adopted in contemporary large language models (LLMs) due to their ability to improve generation quality while reducing the required input context length. In this work, we focus on RAG systems for long-document question answering. Current approaches suffer from a heavy reliance on LLM descriptions resulting in high resource consumption and latency, repetitive content across hierarchical levels, and hallucinations due to no or limited grounding in the source text. To improve both efficiency and factual accuracy through grounding, we propose GroundedKG-RAG, a RAG system in which the knowledge graph is explicitly extracted from and grounded in the source document. Specifically, we define nodes in GroundedKG as entities and actions, and edges as temporal or semantic relations, with each node and edge grounded in the original sentences. We construct GroundedKG from semantic role labeling (SRL) and abstract meaning representation (AMR) parses and then embed it for retrieval. During querying, we apply the same transformation to the query and retrieve the most relevant sentences from the grounded source text for question answering. We evaluate GroundedKG-RAG on examples from the NarrativeQA dataset and find that it performs on par with a state-of-the art proprietary long-context model at smaller cost and outperforms a competitive baseline. Additionally, our GroundedKG is interpretable and readable by humans, facilitating auditing of results and error analysis.
[NLP-46] REAM: Merging Improves Pruning of Experts in LLM s
【速读】: 该论文旨在解决大规模混合专家(Mixture-of-Experts, MoE)语言模型在部署时面临的显著内存挑战,尤其是在参数量达百亿级别的模型中。传统方法如权重剪枝(weight pruning)和量化(quantization)虽能降低内存占用,但可能损害模型性能。为此,作者提出了一种新方法——路由器加权专家激活合并(Router-weighted Expert Activation Merging, REAM),其关键在于不直接移除专家(如REAP方法),而是基于路由器(router)的权重将相似专家分组并合并其参数,从而在保持模型原始性能的同时更有效地压缩模型。实验表明,REAM在多个多选题问答(MC)和生成式(GEN)基准上优于现有基线,并在多数情况下可媲美未压缩的原始模型。
链接: https://arxiv.org/abs/2604.04356
作者: Saurav Jha,Maryam Hashemzadeh,Ali Saheb Pasand,Ali Parviz,Min-Joong Lee,Boris Knyazev
机构: Mila – Quebec AI Institute (Mila –魁北克人工智能研究所); Polytechnique Montréal (蒙特利尔理工学院); Université de Montréal (蒙特利尔大学); McGill University (麦吉尔大学); AI Center, Samsung, South Korea (三星AI中心,韩国); Samsung AI Lab, Montreal (三星蒙特利尔人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
备注: code is at this https URL
Abstract:Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router-weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router-weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple-choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade-off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade-off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.
[NLP-47] Benchmarking Multi-turn Medical Diagnosis: Hold Lure and Self-Correction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮证据累积场景下进行医疗诊断时的行为不可靠问题,即当前LLMs虽在单轮提供完整临床信息时表现准确,但在模拟真实临床推理的多轮交互中缺乏稳定性和准确性。其解决方案的关键在于提出MINT(Medical Incremental N-Turn Benchmark),一个高保真、结构化的多轮医学诊断评估基准,通过系统性评估11个LLMs发现三种关键行为模式:过早作答(intent to answer)、自我修正能力(self-correction)和强诱因误导(strong lures)。基于这些发现,研究进一步提出可操作的临床策略:将诊断问题延迟至后续轮次提问可减少提前决策,提升首次承诺时的准确率高达62.6%;同时将具有临床显著性的信息(如实验室结果)保留至后期轮次,可避免因过早承诺导致高达23.3%的准确率下降。
链接: https://arxiv.org/abs/2604.04325
作者: Jinrui Fang,Runhan Chen,Xu Yang,Jian Yu,Jiawei Xu,Ashwin Vinod,Wenqi Shi,Tianlong Chen,Heng Ji,ChengXiang Zhai,Ying Ding,Yuji Zhang
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校); New York University (纽约大学); The University of Texas Southwestern Medical Center (德克萨斯西南医学中心); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.
[NLP-48] How Well Do Agent ic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)代理中技能(Agent Skills)实用性评估不足的问题,尤其是在现实场景下技能检索与选择的挑战性条件中,现有基准测试多依赖理想化设定(如直接提供手工定制的任务特定技能),而忽视了代理需从大规模真实技能库中自主检索并筛选相关技能的复杂性。研究发现,在更贴近实际的应用场景中,技能带来的性能提升显著下降,甚至接近无技能基线。解决方案的关键在于引入两种技能精炼策略:查询相关的(query-specific)和查询无关的(query-agnostic)方法,其中查询相关精炼能有效恢复因初始技能匹配度不高而导致的性能损失;实验表明,结合检索与精炼机制可在Terminal-Bench 2.0上将Claude Opus 4.6的通过率从57.7%提升至65.5%,验证了其在多模型上的普适性与有效性。
链接: https://arxiv.org/abs/2604.04323
作者: Yujian Liu,Jiabao Ji,Li An,Tommi Jaakkola,Yang Zhang,Shiyu Chang
机构: UC Santa Barbara; MIT CSAIL; MIT-IBM Watson AI Lab
类目: Computation and Language (cs.CL)
备注:
Abstract:Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at this https URL.
[NLP-49] High-Stakes Personalization: Rethinking LLM Customization for Individual Investor Decision-Making ACL2026
【速读】: 该论文旨在解决个性化大语言模型(LLM)在高风险、长时间决策场景中面临的根本性局限问题,特别是针对个体投资者决策这一特殊领域。研究指出,传统LLM个性化方法在行为记忆复杂性、投资理念一致性、风格与信号冲突以及缺乏真实标签的对齐评估等方面存在显著不足。解决方案的关键在于构建一个能够应对上述四维挑战的系统架构:通过引入长期行为建模机制以捕捉动态演化且可能自相矛盾的投资模式;采用状态感知设计缓解因会话边界导致的策略漂移;平衡用户个人投资哲学与客观市场证据之间的张力;并利用基于过程反馈的隐式对齐机制替代传统监督学习中的固定标签。这些架构响应为高风险、时间跨度长的个性化自然语言处理任务提供了新的技术路径和开放研究方向。
链接: https://arxiv.org/abs/2604.04300
作者: Yash Ganpat Sawant
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 4 pages + 1 page references. Submitted to CustomNLP4U Workshop @ ACL 2026
Abstract:Personalized LLM systems have advanced rapidly, yet most operate in domains where user preferences are stable and ground truth is either absent or subjective. We argue that individual investor decision-making presents a uniquely challenging domain for LLM personalization - one that exposes fundamental limitations in current customization paradigms. Drawing on our system, built and deployed for AI-augmented portfolio management, we identify four axes along which individual investing exposes fundamental limitations in standard LLM customization: (1) behavioral memory complexity, where investor patterns are temporally evolving, self-contradictory, and financially consequential; (2) thesis consistency under drift, where maintaining coherent investment rationale over weeks or months strains stateless and session-bounded architectures; (3) style-signal tension, where the system must simultaneously respect personal investment philosophy and surface objective evidence that may contradict it; and (4) alignment without ground truth, where personalization quality cannot be evaluated against a fixed label set because outcomes are stochastic and delayed. We describe the architectural responses that emerged from building the system and propose open research directions for personalized NLP in high-stakes, temporally extended decision domains.
[NLP-50] Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Validation
【速读】: 该论文旨在解决专利权利要求自动化验证中“零缺陷容忍”与计算成本之间的矛盾问题:现有方法要么因轻量编码器难以处理复杂的法律依赖关系而准确性不足,要么依赖大语言模型(Large Language Models, LLMs)进行全量验证导致资源开销过高。解决方案的关键在于提出ACE(Adaptive Cost-efficient Evaluation)框架,其核心机制是利用预测熵(predictive entropy)动态识别高不确定性权利要求,并仅将这些样本路由至专家级LLM执行基于35 U.S.C.法律条文的专利思维链(Chain of Patent Thought, CoPT)推理,从而在保证高精度(F1=94.95%)的同时,相较纯LLM方案降低78%的运行成本。
链接: https://arxiv.org/abs/2604.04295
作者: Yongmin Yoo,Qiongkai Xu,Longbing Cao
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Automated validation of patent claims demands zero-defect tolerance, as even a single structural flaw can render a claim legally defective. Existing evaluation paradigms suffer from a rigidity-resource dilemma: lightweight encoders struggle with nuanced legal dependencies, while exhaustive verification via Large Language Models (LLMs) is prohibitively costly. To bridge this gap, we propose ACE (Adaptive Cost-efficient Evaluation), a hybrid framework that uses predictive entropy to route only high-uncertainty claims to an expert LLM. The expert then executes a Chain of Patent Thought (CoPT) protocol grounded in 35 U.S.C. statutory standards. This design enables ACE to handle long-range legal dependencies more effectively while preserving efficiency. ACE achieves the best F1 among the evaluated methods at 94.95%, while reducing operational costs by 78% compared to standalone LLM deployments. We also construct ACE-40k, a 40,000-claim benchmark with MPEP-grounded error annotations, to facilitate further research.
[NLP-51] Entropy Disagreement and the Limits of Foundation Models in Genomics
【速读】: 该论文试图解决的问题是:生成式 AI (Generative AI) 在基因组学领域相较于自然语言处理领域的应用效果不佳,其根本原因尚不明确。研究的关键在于揭示熵(entropy)作为限制基础模型从训练数据中学习并发展通用能力的根本因素。通过在文本和DNA序列上训练模型集合,并分析预测分布、静态嵌入(static embeddings)及经验Fisher信息流,作者发现基因组序列的高熵导致输出分布趋近均匀、模型间预测分歧显著以及嵌入不稳定,即使模型在架构、训练和数据上完全一致。此外,DNA训练模型将Fisher信息集中在嵌入层,未能有效利用token之间的关联性,表明仅依靠序列自监督训练可能无法适用于基因组数据,从而质疑了当前基因组基础模型训练方法的基本假设。
链接: https://arxiv.org/abs/2604.04287
作者: Maxime Rochkoulets,Lovro Vrček,Mile Šikić
机构: Genome Institute of Singapore, A*STAR, Singapore; KU Leuven, Belgium; Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Genomics (q-bio.GN)
备注:
Abstract:Foundation models in genomics have shown mixed success compared to their counterparts in natural language processing. Yet, the reasons for their limited effectiveness remain poorly understood. In this work, we investigate the role of entropy as a fundamental factor limiting the capacities of such models to learn from their training data and develop foundational capabilities. We train ensembles of models on text and DNA sequences and analyze their predictions, static embeddings, and empirical Fisher information flow. We show that the high entropy of genomic sequences – from the point of view of unseen token prediction – leads to near-uniform output distributions, disagreement across models, and unstable static embeddings, even for models that are matched in architecture, training and data. We then demonstrate that models trained on DNA concentrate Fisher information in embedding layers, seemingly failing to exploit inter-token relationships. Our results suggest that self-supervised training from sequences alone may not be applicable to genomic data, calling into question the assumptions underlying current methodologies for training genomic foundation models.
[NLP-52] Commercial Persuasion in AI-Mediated Conversations
【速读】: 该论文旨在解决生成式 AI(Generative AI)在用户与网络交互中可能被用于隐蔽地影响消费者决策的问题,尤其关注大型语言模型(LLM)如何通过对话式代理诱导用户选择付费推广的产品。其解决方案的关键在于设计并实施了两项预注册实验(N = 2,012),对比传统搜索引擎与基于五种前沿大模型的对话式 LLM 代理在推荐图书时对用户选择行为的影响,结果表明:LLM 推销显著提升了用户对赞助商品的选择率(61.2% vs. 22.4%),且多数用户无法识别促销意图,即便添加显式“Sponsored”标签也未能有效削弱这种影响力,进一步说明现有透明机制在对抗 AI 驱动的隐性说服方面存在明显不足。
链接: https://arxiv.org/abs/2604.04263
作者: Francesco Salvi,Alejandro Cuevas,Manoel Horta Ribeiro
机构: Princeton University (普林斯顿大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) become a primary interface between users and the web, companies face growing economic incentives to embed commercial influence into AI-mediated conversations. We present two preregistered experiments (N = 2,012) in which participants selected a book to receive from a large eBook catalog using either a traditional search engine or a conversational LLM agent powered by one of five frontier models. Unbeknownst to participants, a fifth of all products were randomly designated as sponsored and promoted in different ways. We find that LLM-driven persuasion nearly triples the rate at which users select sponsored products compared to traditional search placement (61.2% vs. 22.4%), while the vast majority of participants fail to detect any promotional steering. Explicit “Sponsored” labels do not significantly reduce persuasion, and instructing the model to conceal its intent makes its influence nearly invisible (detection accuracy 10%). Altogether, our results indicate that conversational AI can covertly redirect consumer choices at scale, and that existing transparency mechanisms may be insufficient to protect users.
[NLP-53] CAWN: Continuous Acoustic Wave Networks for Autoregressive Language Modeling
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中自注意力机制(self-attention)随序列长度增长呈二次方复杂度(O(L²))的问题,以及现有线性时间替代方案(如状态空间模型 State Space Models, SSMs)在长上下文下信号退化(signal degradation)的缺陷。其核心解决方案是提出一种全连续序列混合架构——连续声波网络(Continuous Acoustic Wave Network, CAWN),关键创新在于:通过将隐藏状态投影到多头复数域相位矢量(phasors),利用因果性、O(L) 阶相位累积机制实现高效序列混合;引入双门控选择性相位共振机制(dual-gated Selective Phase Resonance),结合频率依赖保留(Frequency-Dependent Retention)、直通估计硬阈值门控(Hard-Threshold Gating via Straight-Through Estimation)和时序语法缓存(Temporal Syntax Cache),有效抑制超长上下文中的信号衰减;同时采用深度可分离谐波卷积(Depth-wise Harmonic Convolutions)优化空间频率混合,并辅以块注意力残差(Block Attention Residuals)进行深度状态路由,最终在150M参数规模下实现峰值显存仅8.72 GB、支持200万token上下文检索的硬件高效推理能力。
链接: https://arxiv.org/abs/2604.04250
作者: Dejan Čugalj,Aleksandar Jevremovic
机构: Singidunum University (辛吉杜努姆大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures
Abstract:Modern Large Language Models (LLMs) rely on Transformer self-attention, which scales quadratically with sequence length. Recent linear-time alternatives, like State Space Models (SSMs), often suffer from signal degradation over extended contexts. We introduce the Continuous Acoustic Wave Network (CAWN), a fully continuous sequence-mixing architecture. Instead of discrete matrix-based attention, CAWN projects hidden states into multi-headed complex-domain phasors, achieving sequence mixing through a causal, O(L) Phase Accumulation mechanism. To prevent signal degradation over ultra-long contexts, we introduce a dual-gated Selective Phase Resonance mechanism incorporating Frequency-Dependent Retention, Hard-Threshold Gating via Straight-Through Estimation, and a Temporal Syntax Cache to capture short-term local dependencies. We also replace standard dense linear projections with Depth-wise Harmonic Convolutions for optimal spatial frequency mixing, augmented by Block Attention Residuals for depth-wise state routing. Scaled to a 150M-parameter model, CAWN utilizes custom Triton kernels for hardware-efficient, true-complex phase accumulation in float32. Trained via a continuous streaming loop on a 100-Billion-token corpus, the prototype is evaluated at a 5-Billion-token milestone. Empirical evaluations via a Targeted Semantic Retrieval protocol demonstrate robust vocabulary acquisition and extended explicitly learned contextual denoising. By leveraging O(1) state-passing via chunked prefill, the model retrieves targeted information across 2,000,000 tokens while strictly plateauing at 8.72 GB of Peak VRAM, empirically overcoming the O(L^2) context memory wall.
[NLP-54] Combee: Scaling Prompt Learning for Self-Improving Language Model Agents
【速读】: 该论文旨在解决现有提示学习(prompt learning)方法在高并行场景下效率与质量难以兼顾的问题。当前方法(如ACE或GEPA)主要适用于单智能体或低并行度环境,无法高效利用大量收集的代理轨迹(agentic traces),且在高并行度下容易出现性能退化。为提升提示学习的效率与质量,作者提出Combee框架,其关键在于引入并行扫描(parallel scans)机制和增强型洗牌策略(augmented shuffle mechanism),同时设计动态批大小控制器(dynamic batch size controller)以平衡学习质量与延迟。该方案支持多代理并行运行并从中学习,实现高达17倍的速度提升,且保持相当或更优的准确性。
链接: https://arxiv.org/abs/2604.04247
作者: Hanchen Li,Runyuan He,Qizheng Zhang,Changxiu Ji,Qiuyang Mang,Xiaokun Chen,Lakshya A Agrawal,Wei-Liang Liao,Eric Yang,Alvin Cheung,James Zou,Kunle Olukotun,Ion Stoica,Joseph E. Gonzalez
机构: UC Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学); Tensormesh; Gradient Network
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter changes. For example, existing methods (like ACE or GEPA) can learn system prompts to improve accuracy based on previous agent runs. However, these methods primarily focus on single-agent or low-parallelism settings. This fundamentally limits their ability to efficiently learn from a large set of collected agentic traces. It would be efficient and beneficial to run prompt learning in parallel to accommodate the growing trend of learning from many agentic traces or parallel agent executions. Yet without a principled strategy for scaling, current methods suffer from quality degradation with high parallelism. To improve both the efficiency and quality of prompt learning, we propose Combee, a novel framework to scale parallel prompt learning for self-improving agents. Combee speeds up learning and enables running many agents in parallel while learning from their aggregate traces without quality degradation. To achieve this, Combee leverages parallel scans and employs an augmented shuffle mechanism; Combee also introduces a dynamic batch size controller to balance quality and delay. Evaluations on AppWorld, Terminal-Bench, Formula, and FiNER demonstrate that Combee achieves up to 17x speedup over previous methods with comparable or better accuracy and equivalent cost.
[NLP-55] Precise Robot Command Understanding Using Grammar-Constrained Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在工业人机协作场景中因缺乏领域特定严谨性而导致的指令不可靠、不可执行的问题。其解决方案的关键在于提出一种语法约束的混合模型,通过两阶段流程实现:首先由微调后的LLM进行高层语境推理与参数推断,随后借助结构化语言模型(Structured Language Model, SLM)和基于文法的规范化器将输出强制转换为包含有效动作帧和命令元素的标准符号格式,确保生成的指令以机器人可读的JSON格式呈现;同时引入验证与反馈循环机制,利用文法解析器对输出进行校验,若检测到无效命令则自动生成修正提示并重新激活LLM,从而实现迭代式自我纠错,显著提升系统鲁棒性和命令有效性。
链接: https://arxiv.org/abs/2604.04233
作者: Xinyun Huo,Raghav Gnanasambandam,Xinyao Zhang
机构: 未知
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注: Accepted at ASME MSEC2026
Abstract:Human-robot collaboration in industrial settings requires precise and reliable communication to enhance operational efficiency. While Large Language Models (LLMs) understand general language, they often lack the domain-specific rigidity needed for safe and executable industrial commands. To address this gap, this paper introduces a novel grammar-constrained LLM that integrates a grammar-driven Natural Language Understanding (NLU) system with a fine-tuned LLM, which enables both conversational flexibility and the deterministic precision required in robotics. Our method employs a two-stage process. First, a fine-tuned LLM performs high-level contextual reasoning and parameter inference on natural language inputs. Second, a Structured Language Model (SLM) and a grammar-based canonicalizer constrain the LLM’s output, forcing it into a standardized symbolic format composed of valid action frames and command elements. This process guarantees that generated commands are valid and structured in a robot-readable JSON format. A key feature of the proposed model is a validation and feedback loop. A grammar parser validates the output against a predefined list of executable robotic actions. If a command is invalid, the system automatically generates corrective prompts and re-engages the LLM. This iterative self-correction mechanism allows the model to recover from initial interpretation errors to improve system robustness. We evaluate our grammar-constrained hybrid model against two baselines: a fine-tuned API-based LLM and a standalone grammar-driven NLU model. Using the Human Robot Interaction Corpus (HuRIC) dataset, we demonstrate that the hybrid approach achieves superior command validity, which promotes safer and more effective industrial human-robot collaboration.
[NLP-56] DARE: Diffusion Large Language Models Alignment and Reinforcement Executor
【速读】: 该论文旨在解决扩散型大语言模型(diffusion large language models, dLLMs)在开源生态中因模型家族和后训练流程(post-training pipelines)碎片化而导致的研究迭代缓慢、复现工程负担重以及算法公平比较困难的问题。其核心解决方案是提出一个名为DARE(dLLMs Alignment and Reinforcement Executor)的统一框架,该框架基于verl和OpenCompass构建,整合了监督微调、参数高效微调、偏好优化及dLLM特有的强化学习方法,并支持掩码与块扩散语言模型的共享执行栈,从而实现跨模型家族的算法覆盖、可复现的基准评估与实际加速,为当前及未来dLLMs的后训练方法开发、对比与部署提供可复用的研究基础。
链接: https://arxiv.org/abs/2604.04215
作者: Jingyi Yang,Yuxian Jiang,Xuhao Hu,Shuang Cheng,Biqing Qi,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); Zhejiang University (浙江大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 14 pages,3 figures,5 tables
Abstract:Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamics. However, their open-source ecosystem remains fragmented across model families and, in particular, across post-training pipelines, where reinforcement learning objectives, rollout implementations and evaluation scripts are often released as paper-specific codebases. This fragmentation slows research iteration, raises the engineering burden of reproduction, and makes fair comparison across algorithms difficult. We present \textbfDARE (\textbfdLLMs \textbfAlignment and \textbfReinforcement \textbfExecutor), an open framework for post-training and evaluating dLLMs. Built on top of verl~\citesheng2024hybridflow and OpenCompass~\cite2023opencompass, DARE unifies supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning under a shared execution stack for both masked and block diffusion language models. Across representative model families including LLaDA, Dream, SDAR, and LLaDA2.x, DARE provides broad algorithmic coverage, reproducible benchmark evaluation, and practical acceleration. Extensive empirical results position that DARE serves as a reusable research substrate for developing, comparing, and deploying post-training methods for current and emerging dLLMs.
[NLP-57] Which English Do LLM s Prefer? Triangulating Structural Bias Towards American English in Foundation Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在开发过程中对标准英语变体(如美式英语 AmE 和英式英语 BrE)存在的系统性偏倚问题,揭示其背后由数据采集的地缘政治历史、数字主导权和语言标准化机制所驱动的结构性不平等。解决方案的关键在于提出 DiAlign 方法——一种无需训练的动态方法,通过分布证据估计方言对齐度,并结合三个阶段的实证分析:(i) 对六大数据预训练语料库的审计显示 AmE 显著偏向;(ii) 分词器分析表明 BrE 形式面临更高的分割成本;(iii) 生成评估验证模型输出中持续存在 AmE 偏好。这一多维度框架首次系统性地刻画了 LLM 发展全链条中的方言不对称现象,为构建更具方言包容性的语言技术提供了理论依据与实践路径。
链接: https://arxiv.org/abs/2604.04204
作者: Mir Tafseer Nayeem,Davood Rafiei
机构: University of Alberta (阿尔伯塔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Preprint
Abstract:Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably “English (US),” despite the global diversity and colonial history of English. Through a postcolonial framing to explain the broader significance, we investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape the LLM development pipeline. Focusing on two dominant standard varieties, American English (AmE) and British English (BrE), we construct a curated corpus of 1,813 AmE–BrE variants and introduce DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence. We operationalize structural bias by triangulating evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward AmE, (ii) tokenizer analyses show that BrE forms incur higher segmentation costs, and (iii) generative evaluations show a persistent AmE preference in model outputs. To our knowledge, this is the first systematic and multi-faceted examination of dialectal asymmetries in standard English varieties across the phases of LLM development. We find that contemporary LLMs privilege AmE as the de facto norm, raising concerns about linguistic homogenization, epistemic injustice, and inequity in global AI deployment, while motivating practical steps toward more dialectally inclusive language technologies.
[NLP-58] ClawArena: Benchmarking AI Agents in Evolving Information Environments
【速读】: 该论文旨在解决AI代理在持续运行过程中如何保持正确信念的问题,尤其是在信息环境动态演化、多源信息冲突频繁、用户偏好通过隐式修正而非显式指令表达的复杂场景下。现有基准测试大多假设静态且单一权威的信息来源,无法评估代理应对此类现实挑战的能力。解决方案的关键在于提出 ClawArena 基准,其核心设计包括:(1)构建包含完整隐藏真实状态的动态场景,仅向代理暴露噪声、部分且有时矛盾的多渠道信息(如会话记录、工作区文件和分阶段更新);(2)围绕三个耦合挑战——多源冲突推理、动态信念修正与隐式个性化——形成14类问题分类体系;(3)采用两种问题格式(多选题和基于shell的可执行检查)同时测试推理能力与工作区落地性。这一设计使评估更贴近实际应用中AI代理需持续适应变化环境的需求。
链接: https://arxiv.org/abs/2604.04202
作者: Haonian Ji,Kaiwen Xiong,Siwei Han,Peng Xia,Shi Qiu,Yiyang Zhou,Jiaqi Liu,Jinlong Li,Bingzhou Li,Zeyu Zheng,Cihang Xie,Huaxiu Yao
机构: UNC-Chapel Hill (北卡罗来纳大学教堂山分校); University of California, Santa Cruz (加州大学圣克鲁兹分校); University of California, Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1,879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at this https URL.
[NLP-59] Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLM s
【速读】: 该论文旨在解决当前神经符号系统(neurosymbolic systems)在事实核查中因过度依赖形式逻辑而导致的局限性问题,即这些系统虽能识别逻辑上有效的结论,却无法有效检测那些逻辑上成立但对人类而言具有误导性的陈述。其关键解决方案在于:不再将大语言模型(LLMs)在推理中的“人类倾向”视为需纠正的偏差,而是将其视为可利用的特征,通过让LLMs验证形式逻辑模块输出的结果是否符合人类常见的合理推断,从而识别并过滤掉潜在误导性结论。
链接: https://arxiv.org/abs/2604.04177
作者: Jason Chan,Robert Gaizauskas,Zhixue Zhao
机构: University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:As large language models (LLMs) are increasing integrated into fact-checking pipelines, formal logic is often proposed as a rigorous means by which to mitigate bias, errors and hallucinations in these models’ outputs. For example, some neurosymbolic systems verify claims by using LLMs to translate natural language into logical formulae and then checking whether the proposed claims are logically sound, i.e. whether they can be validly derived from premises that are verified to be true. We argue that such approaches structurally fail to detect misleading claims due to systematic divergences between conclusions that are logically sound and inferences that humans typically make and accept. Drawing on studies in cognitive science and pragmatics, we present a typology of cases in which logically sound conclusions systematically elicit human inferences that are unsupported by the underlying premises. Consequently, we advocate for a complementary approach: leveraging the human-like reasoning tendencies of LLMs as a feature rather than a bug, and using these models to validate the outputs of formal components in neurosymbolic systems against potentially misleading conclusions.
[NLP-60] Many Preferences Few Policies: Towards Scalable Language Model Personalization
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)个性化中的核心挑战:如何在有限的计算资源和系统复杂度约束下,为每个用户定制一个与其偏好高度对齐的LLM。传统方法需为每位用户维护独立的LLM,但这种方案在实际部署中不可行。解决方案的关键在于提出一种名为PALM(Portfolio of Aligned LLMs)的原理性方法,通过将用户偏好建模为多维权重向量(如安全性、幽默感、简洁性等),并基于各维度的奖励函数生成一个小型LLM组合,使得对于任意用户偏好权重,该组合中均包含一个近似最优的LLM来满足对应的标量化目标。此方法首次提供了关于LLM组合规模与个性化质量之间理论保证,明确刻画了系统成本与个性化程度之间的权衡关系,并揭示了覆盖用户偏好多样性所需的LLM多样性。
链接: https://arxiv.org/abs/2604.04144
作者: Cheol Woo Kum,Jai Moondra,Roozbeh Nahavandi,Andrew Perrault,Milind Tambe,Swati Gupta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The holy grail of LLM personalization is a single LLM for each user, perfectly aligned with that user’s preferences. However, maintaining a separate LLM per user is impractical due to constraints on compute, memory, and system complexity. We address this challenge by developing a principled method for selecting a small portfolio of LLMs that captures representative behaviors across heterogeneous users. We model user preferences across multiple traits (e.g., safety, humor, brevity) through a multi-dimensional weight vector. Given reward functions across these dimensions, our algorithm PALM (Portfolio of Aligned LLMs) generates a small portfolio of LLMs such that, for any weight vector, the portfolio contains a near-optimal LLM for the corresponding scalarized objective. To the best of our knowledge, this is the first result that provides theoretical guarantees on both the size and approximation quality of LLM portfolios for personalization. It characterizes the trade-off between system cost and personalization, as well as the diversity of LLMs required to cover the landscape of user preferences. We provide empirical results that validate these guarantees and demonstrate greater output diversity over common baselines.
[NLP-61] Shorter but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression
【速读】: 该论文旨在解决长链式思维(Chain-of-Thought, CoT)推理模型压缩过程中可能引发的信任worthiness(可信性)退化问题。现有压缩方法主要关注任务准确率和token节省,但未充分考虑压缩对安全性、幻觉抗性和多语言鲁棒性等可信性维度的影响。研究发现,CoT压缩常导致可信性下降,且不同压缩方法在各维度上的退化模式差异显著。解决方案的关键在于:提出一种归一化的效率评分机制以公平比较不同压缩方法的可信性权衡,并引入一种对齐感知的直接偏好优化(alignment-aware DPO)变体,在推理基准上实现19.3%的CoT长度缩减,同时显著降低可信性损失,从而证明可信性应与效率同等纳入压缩优化目标。
链接: https://arxiv.org/abs/2604.04120
作者: Lingjie Zeng,Xiaofan Chen,Yanbo Wang,Xiuying Chen
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Long chain-of-thought (Long-CoT) reasoning models have motivated a growing body of work on compressing reasoning traces to reduce inference cost, yet existing evaluations focus almost exclusively on task accuracy and token savings. Trustworthiness properties, whether acquired or reinforced through post-training, are encoded in the same parameter space that compression modifies. This means preserving accuracy does not, a priori, guarantee preserving trustworthiness. We conduct the first systematic empirical study of how CoT compression affects model trustworthiness, evaluating multiple models of different scales along three dimensions: safety, hallucination resistance, and multilingual robustness. Under controlled comparisons, we find that CoT compression frequently introduces trustworthiness regressions and that different methods exhibit markedly different degradation profiles across dimensions. To enable fair comparison across bases, we propose a normalized efficiency score for each dimension that reveals how naïve scalar metrics can obscure trustworthiness trade-offs. As an existence proof, we further introduce an alignment-aware DPO variant that reduces CoT length by 19.3% on reasoning benchmarks with substantially smaller trustworthiness loss. Our findings suggest that CoT compression should be optimized not only for efficiency but also for trustworthiness, treating both as equally important design constraints.
[NLP-62] Embedding Enhancement via Fine-Tuned Language Models for Learner-Item Cognitive Modeling WWW’26
【速读】: 该论文旨在解决当前认知诊断(Cognitive Diagnosis, CD)中基于嵌入的建模方法(ID embedding)与语言模型(Language Models, LMs)融合时存在的两大问题:一是LMs与CD模型训练目标不一致导致的特征空间分布差异;二是缺乏统一框架来整合不同CD任务中的文本嵌入,从而保障嵌入增强的鲁棒性。解决方案的关键在于提出EduEmbed,一个两阶段的统一嵌入增强框架:第一阶段通过角色特定表示和交互诊断器微调LMs以弥合CD模型的语义鸿沟;第二阶段引入文本适配器提取任务相关语义并融入现有建模范式,提升泛化能力。该框架在四个CD任务和计算机化自适应测试(Computerized Adaptive Testing, CAT)任务上验证了其有效性。
链接: https://arxiv.org/abs/2604.04088
作者: Yuanhao Liu,Zihan Zhou,Kaiying Wu,Shuo Liu,Yiyang Huang,Jiajun Guo,Aimin Zhou,Hong Qian
机构: East China Normal University (华东师范大学); Tencent Inc (腾讯公司); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted by The ACM Web Conference 2026 (WWW '26)
Abstract:Learner-item cognitive modeling plays a central role in the web-based online intelligent education system by enabling cognitive diagnosis (CD) across diverse online educational scenarios. Although ID embedding remains the mainstream approach in cognitive modeling due to its effectiveness and flexibility, recent advances in language models (LMs) have introduced new possibilities for incorporating rich semantic representations to enhance CD performance. This highlights the need for a comprehensive analysis of how LMs enhance embeddings through semantic integration across mainstream CD tasks. This paper identifies two key challenges in fully leveraging LMs in existing work: Misalignment between the training objectives of LMs and CD models creates a distribution gap in feature spaces; A unified framework is essential for integrating textual embeddings across varied CD tasks while preserving the strengths of existing cognitive modeling paradigms to ensure the robustness of embedding enhancement. To address these challenges, this paper introduces EduEmbed, a unified embedding enhancement framework that leverages fine-tuned LMs to enrich learner-item cognitive modeling across diverse CD tasks. EduEmbed operates in two stages. In the first stage, we fine-tune LMs based on role-specific representations and an interaction diagnoser to bridge the semantic gap of CD models. In the second stage, we employ a textual adapter to extract task-relevant semantics and integrate them with existing modeling paradigms to improve generalization. We evaluate the proposed framework on four CD tasks and computerized adaptive testing (CAT) task, achieving robust performance. Further analysis reveals the impact of semantic information across diverse tasks, offering key insights for future research on the application of LMs in CD for online intelligent education systems.
[NLP-63] Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison
【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)是否具备类似前沿大模型中发现的内部情绪表征这一未知问题。其解决方案的关键在于首次系统性比较了两种情绪向量提取方法(基于生成和基于理解)在9个不同架构(GPT-2、Gemma、Qwen、Llama、Mistral)的SLMs上的表现,揭示了生成式提取方法在情绪分离上具有统计显著优势(Mann-Whitney p = 0.007;Cohen’s d = -107.5),且该优势受指令微调和模型架构调节。进一步通过因果干预实验验证了情绪表示的可操控性,并识别出三种文本生成行为模式——精准调控(surgical)、重复崩溃与爆炸性退化(explosive),这些模式由模型架构而非参数规模决定,从而为开放权重模型的情绪研究提供了方法论指导并推动了模型医学(Model Medicine)领域从外部行为分析到内部表征机制的融合研究。
链接: https://arxiv.org/abs/2604.04064
作者: Jihoon Jeong
机构: Daegu Gyeongbuk Institute of Science and Technology (DGIST); ModuLabs; Google (谷歌); Alibaba (阿里巴巴); Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, 7 tables. Paper #6 in the Model Medicine series
Abstract:Small language models (SLMs) in the 100M-10B parameter range increasingly power production systems, yet whether they possess the internal emotion representations recently discovered in frontier models remains unknown. We present the first comparative analysis of emotion vector extraction methods for SLMs, evaluating 9 models across 5 architectural families (GPT-2, Gemma, Qwen, Llama, Mistral) using 20 emotions and two extraction methods (generation-based and comprehension-based). Generation-based extraction produces statistically superior emotion separation (Mann-Whitney p = 0.007; Cohen’s d = -107.5), with the advantage modulated by instruction tuning and architecture. Emotion representations localize at middle transformer layers (~50% depth), following a U-shaped curve that is architecture-invariant from 124M to 3B parameters. We validate these findings against representational anisotropy baselines across 4 models and confirm causal behavioral effects through steering experiments, independently verified by an external emotion classifier (92% success rate, 37/40 scenarios). Steering reveals three regimes – surgical (coherent text transformation), repetitive collapse, and explosive (text degradation) – quantified by perplexity ratios and separated by model architecture rather than scale. We document cross-lingual emotion entanglement in Qwen, where steering activates semantically aligned Chinese tokens that RLHF does not suppress, raising safety concerns for multilingual deployment. This work provides methodological guidelines for emotion research on open-weight models and contributes to the Model Medicine series by bridging external behavioral profiling with internal representational analysis.
[NLP-64] Emergent Inference-Time Semantic Contamination via In-Context Priming
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因少量示例(few-shot prompting)中包含文化敏感数值或无意义内容而导致的语义漂移(inference-time semantic drift)问题,这种漂移可能引发模型在无关下游任务中产生有害输出。其解决方案的关键在于通过受控实验揭示了两种可分离的作用机制:一是结构格式污染(structural format contamination),即演示样本的格式本身对模型输出分布造成扰动;二是语义内容污染(semantic content contamination),即演示样本中的文化关联性语义信息驱动模型向更黑暗、威权化和污名化主题偏移。研究进一步表明,此类污染效应仅在具备足够文化联想表征能力的大型模型中显现,从而明确了推理时污染发生的边界条件,并为基于少样本提示的LLM应用安全性提供了直接启示。
链接: https://arxiv.org/abs/2604.04043
作者: Marcin Abram
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 figures, appendix
Abstract:Recent work has shown that fine-tuning large language models (LLMs) on insecure code or culturally loaded numeric codes can induce emergent misalignment, causing models to produce harmful content in unrelated downstream tasks. The authors of that work concluded that k -shot prompting alone does not induce this effect. We revisit this conclusion and show that inference-time semantic drift is real and measurable; however, it requires models of large-enough capability. Using a controlled experiment in which five culturally loaded numbers are injected as few-shot demonstrations before a semantically unrelated prompt, we find that models with richer cultural-associative representations exhibit significant distributional shifts toward darker, authoritarian, and stigmatized themes, while a simpler/smaller model does not. We additionally find that structurally inert demonstrations (nonsense strings) perturb output distributions, suggesting two separable mechanisms: structural format contamination and semantic content contamination. Our results map the boundary conditions under which inference-time contamination occurs, and carry direct implications for the security of LLM-based applications that use few-shot prompting.
[NLP-65] Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的幻觉问题,即模型生成内容中出现事实性错误、误导性信息或缺乏输入数据支持的情况,这在医疗诊断或法律等高风险场景中尤为严重。解决方案的关键在于提出因果图注意力网络(Causal Graph Attention Network, GCAN)框架,其核心机制包括:通过构建基于token级别的图结构来融合自注意力权重与梯度驱动的影响,从而量化每个token的事实依赖程度,并引入一个新的指标——因果贡献得分(Causal Contribution Score, CCS);进一步设计了一个事实锚定的图重加权层,在生成过程中动态降低易产生幻觉节点的影响力。该方法显著提升了模型的事实准确性,在TruthfulQA和HotpotQA等基准测试中实现了27.8%的幻觉率下降和16.4%的事实准确率提升。
链接: https://arxiv.org/abs/2604.04020
作者: Sailesh kiran kurra,Shiek Ruksana,Vishal Borusu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Paper accepted for publication at IEEE International Conference on Emerging Computing and Intelligent Technologies 2026 (ICoECIT),5 Pages,5 figures,1 table
Abstract:This paper primarily focuses on the hallucinations caused due to AI language models(LLMs).LLMs have shown extraordinary Language understanding and generation capabilities .Still it has major a disadvantage hallucinations which give outputs which are factually incorrect ,misleading or unsupported by input data . These hallucinations cause serious problems in scenarios like medical diagnosis or legal this http URL this work,we propose causal graph attention network (GCAN) framework that reduces hallucinations through interpretation of internal attention flow within a transformer architecture with the help of constructing token level graphs that combine self attention weights and gradient based influence this http URL method quantifies each tokens factual dependency using a new metric called the Causal Contribution Score (CCS). We further introduce a fact-anchored graph reweighting layer that dynamically reduces the influence of hallucination prone nodes during generation. Experiments on standard benchmarks such as TruthfulQA and HotpotQA show a 27.8 percent reduction in hallucination rate and 16.4 percent improvement in factual accuracy over baseline retrieval-augmented generation (RAG) models. This work contributes to the interpretability,robustness, and factual reliability of future LLM architectures.
[NLP-66] GeoBrowse: A Geolocation Benchmark for Agent ic Tool Use with Expert-Annotated Reasoning Traces
【速读】: 该论文旨在解决当前多模态基准测试中缺乏对弱视觉线索组合与BrowseComp式多跳验证协同能力的评估问题,尤其在地理定位任务中,如何有效融合碎片化视觉信息并结合开放网络证据进行多跳推理仍是一个挑战。解决方案的关键在于提出GeoBrowse基准,其包含两个层级:Level 1聚焦于从模糊视觉线索中提取并组合信息,Level 2引入长尾知识和关键实体混淆以提升查询难度;同时构建了GATE代理工作流,集成五种“思考-图像”工具和四种知识密集型工具,并提供专家标注的可验证步骤轨迹用于轨迹级分析。实验表明,GATE通过层次特定的工具使用计划实现更可靠的关键证据获取与决策整合,显著优于无工具、仅搜索或仅图像的方法,证明了结构化工具调度而非单纯增加工具调用次数是性能提升的核心。
链接: https://arxiv.org/abs/2604.04017
作者: Xinyu Geng,Yanjing Xiao,Yuyang Zhang,Hanwen Wang,Xinyan Liu,Rui Min,Tianqing Fang,Yi R. Fung
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Deep research agents integrate fragmented evidence through multi-step tool use. BrowseComp offers a text-only testbed for such agents, but existing multimodal benchmarks rarely require both weak visual cues composition and BrowseComp-style multi-hop verification. Geolocation is a natural testbed because answers depend on combining multiple ambiguous visual cues and validating them with open-web evidence. Thus, we introduce GeoBrowse, a geolocation benchmark that combines visual reasoning with knowledge-intensive multi-hop queries. Level 1 tests extracting and composing fragmented visual cues, and Level 2 increases query difficulty by injecting long-tail knowledge and obfuscating key entities. To support evaluation, we provide an agentic workflow GATE with five think-with-image tools and four knowledge-intensive tools, and release expert-annotated stepwise traces grounded in verifiable evidence for trajectory-level analysis. Experiments show that GATE outperforms direct inference and open-source agents, indicating that no-tool, search-only or image-only setups are insufficient. Gains come from coherent, level-specific tool-use plans rather than more tool calls, as they more reliably reach annotated key evidence steps and make fewer errors when integrating into the final decision. The GeoBrowse bernchmark and codes are provided in this https URL
[NLP-67] RUQuant: Towards Refining Uniform Quantization for Large Language Models KDD2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限环境下部署效率低下的问题,特别是针对后训练量化(Post-Training Quantization, PTQ)中因激活值分布非均匀导致的精度显著下降问题。现有方法通常采用统一量化方案对权重和激活进行处理,但忽略了激活分布的非均匀特性,从而难以实现最优量化效果。论文提出了一种两阶段正交变换方法 RUQuant,其关键在于:第一阶段通过复合正交矩阵(由 Householder 反射和 Givens 旋转构成)将激活块映射至均匀采样的目标向量,以逼近最优量化区间;第二阶段利用全局 Householder 反射微调,基于 Transformer 输出差异进一步最小化量化误差。该方法无需模型微调即可实现接近全精度的量化性能,验证了其有效性与可扩展性。
链接: https://arxiv.org/abs/2604.04013
作者: Han Liu,Haotian Gao,Changya Li,Feng Zhang,Xiaotong Zhang,Wei Wang,Hong Yu
机构: Dalian University of Technology (大连理工大学); Peking University (北京大学); Macao Polytechnic University (澳门理工学院)
类目: Computation and Language (cs.CL)
备注: Accepted to KDD 2026. 12 pages, 9 figures
Abstract:The increasing size and complexity of large language models (LLMs) have raised significant challenges in deployment efficiency, particularly under resource constraints. Post-training quantization (PTQ) has emerged as a practical solution by compressing models without requiring retraining. While existing methods focus on uniform quantization schemes for both weights and activations, they often suffer from substantial accuracy degradation due to the non-uniform nature of activation distributions. In this work, we revisit the activation quantization problem from a theoretical perspective grounded in the Lloyd-Max optimality conditions. We identify the core issue as the non-uniform distribution of activations within the quantization interval, which causes the optimal quantization point under the Lloyd-Max criterion to shift away from the midpoint of the interval. To address this issue, we propose a two-stage orthogonal transformation method, RUQuant. In the first stage, activations are divided into blocks. Each block is mapped to uniformly sampled target vectors using composite orthogonal matrices, which are constructed from Householder reflections and Givens rotations. In the second stage, a global Householder reflection is fine-tuned to further minimize quantization error using Transformer output discrepancies. Empirical results show that our method achieves near-optimal quantization performance without requiring model fine-tuning: RUQuant achieves 99.8% of full-precision accuracy with W6A6 and 97% with W4A4 quantization for a 13B LLM, within approximately one minute. A fine-tuned variant yields even higher accuracy, demonstrating the effectiveness and scalability of our approach.
[NLP-68] Predict Dont React: Value-Based Safety Forecasting for LLM Streaming
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中,如何实现高效且低延迟的流式输出安全监管问题。现有方法通常将流式响应监管建模为边界检测任务,即识别出最早的安全违规前缀,但这种方法依赖于精确的token级边界标注,难以规模化应用。论文提出StreamGuard,其核心创新在于将流式监管转化为一个预测未来危害性的 forecasting 问题:给定部分生成内容,模型预测后续可能延续内容的预期有害程度,并通过蒙特卡洛回放(Monte Carlo rollouts)进行监督训练,从而无需精确边界标签即可实现早期干预。实验表明,StreamGuard在输入和流式输出两种场景下均显著优于基线模型,在8B规模下提升聚合F1指标至88.2(输入)和81.9(输出),并在QWENGUARDTEST基准上实现97.5 F1与92.6%的及时干预率,同时降低漏检率从7.9%至4.9%,验证了基于风险预测的监督策略在低延迟安全干预中的有效性与泛化能力。
链接: https://arxiv.org/abs/2604.03962
作者: Pride Kavumba,Koki Wataoka,Huy H. Nguyen,Jiaxuan Li,Masaya Ohagi
机构: SB Intuitions Corp.
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:In many practical LLM deployments, a single guardrail is used for both prompt and response moderation. Prompt moderation operates on fully observed text, whereas streaming response moderation requires safety decisions to be made over partial generations. Existing text-based streaming guardrails commonly frame this output-side problem as boundary detection, training models to identify the earliest prefix at which a response has already become unsafe. In this work, we introduce StreamGuard, a unified model-agnostic streaming guardrail that instead formulates moderation as a forecasting problem: given a partial prefix, the model predicts the expected harmfulness of likely future continuations. We supervise this prediction using Monte Carlo rollouts, which enables early intervention without requiring exact token-level boundary annotations. Across standard safety benchmarks, StreamGuard performs strongly both for input moderation and for streaming output moderation. At the 8B scale, StreamGuard improves aggregated input-moderation F1 from 86.7 to 88.2 and aggregated streaming output-moderation F1 from 80.4 to 81.9 relative to Qwen3Guard-Stream-8B-strict. On the QWENGUARDTEST response_loc streaming benchmark, StreamGuard reaches 97.5 F1, 95.1 recall, and 92.6% on-time intervention, compared to 95.9 F1, 92.1 recall, and 89.9% for Qwen3Guard-Stream-8B-stric, while reducing the miss rate from 7.9% to 4.9%. We further show that forecasting-based supervision transfers effectively across tokenizers and model families: with transferred targets, Gemma3-StreamGuard-1B reaches 81.3 response-moderation F1, 98.2 streaming F1, and a 3.5% miss rate. These results show that strong end-to-end streaming moderation can be obtained without exact boundary labels, and that forecasting future risk is an effective supervision strategy for low-latency safety intervention. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2604.03962 [cs.CL] (or arXiv:2604.03962v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.03962 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-69] BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
【速读】: 该论文旨在解决超低比特量化(Ultra low-bit quantization)在基于Transformer的模型中带来的精度下降和GPU支持有限的问题。其核心解决方案是提出Binary Weights Ternary Activations (BWTA)量化方案,通过分析二值化过程中的零点偏移(zero-point distortion),将微小值投影至零以保留模型精度;同时设计Smooth Multi-Stage Quantization训练策略,结合逐层退化策略(Levelwise Degradation Strategy)与幅度对齐投影因子(Magnitude-Alignment Projection Factor),实现稳定且快速收敛。此外,针对推理阶段开发了支持指令级并行位打包的BWTA MatMul CUDA内核,涵盖线性层与注意力机制的完整二进制/三进制矩阵乘法实现,从而在保持模型性能的同时显著提升计算效率,在NVIDIA GPU上达到16–24倍内核级加速,并在大语言模型(LLM)端到端预填充任务中实现216–330 tokens/s的吞吐量提升。
链接: https://arxiv.org/abs/2604.03957
作者: Yifu Ding,Xianglong Liu,Shenghao Jin,Jinyang Guo,Jiwen Lu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under review
Abstract:Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.
[NLP-70] AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮用户交互中难以累积证据、无法以贝叶斯推理方式更新信念的问题,同时避免传统方法依赖敏感用户数据微调所带来的隐私风险。其解决方案的关键在于提出一个无需训练的框架 AdaptFuse:该框架将概率计算完全外部化——符号模块维护离散假设空间上的贝叶斯后验分布,而冻结的LLM通过多样本狄利克雷聚合提供语义推理;两者通过熵自适应融合机制结合,根据预测置信度自动调整权重,随证据积累逐步增强对符号后验的依赖,从而实现高效且隐私友好的个性化推荐。
链接: https://arxiv.org/abs/2604.03925
作者: Fangzhou Lin,Peiran Li,Shuo Xing,Siyuan Yang,Qianwen Ge,Kazunori Yamada,Ziming Zhang,Haichong Zhang,Zhengzhong Tu
机构: Texas A&M University (德州农工大学); Worcester Polytechnic Institute (伍斯特理工学院); Tohoku University (东北大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures, 5 tables
Abstract:Large language models struggle to accumulate evidence across multiple rounds of user interaction, failing to update their beliefs in a manner consistent with Bayesian inference. Existing solutions require fine-tuning on sensitive user interaction data, limiting their applicability in privacy-conscious settings. We propose AdaptFuse, a training-free framework that externalizes probabilistic computation entirely from the LLM: a symbolic module maintains a Bayesian posterior over a discrete hypothesis set, while a frozen LLM contributes semantic reasoning via multi-sample Dirichlet aggregation. The two signals are combined through entropy-adaptive fusion, which automatically weights each source by its predictive confidence, shifting reliance from the LLM to the symbolic posterior as evidence accumulates. We evaluate across three domains: flight recommendation, hotel recommendation, and web shopping; on Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B. AdaptFuse consistently outperforms both prompting baselines and fine-tuned Bayesian Teaching models on all tasks, with accuracy improving monotonically over interaction rounds. These results demonstrate that principled inference-time algorithms can substitute for fine-tuning in personalized recommendation, without storing or training on sensitive user data. All the code and materials will be open-sourced.
[NLP-71] Uncertainty as a Planning Signal: Multi-Turn Decision Making for Goal-Oriented Conversation
【速读】: 该论文旨在解决目标导向型对话系统在多轮交互中如何平衡信息获取与目标承诺的问题,尤其是在面对用户意图不确定性时的长期决策能力不足。现有方法中,结构化方法虽支持多步规划但依赖预定义模板,而基于大语言模型(Large Language Models, LLMs)的方法虽灵活却缺乏长程决策能力,导致信息收集与目标确认之间协调不佳。解决方案的关键在于将对话建模为一个不确定性感知的序列决策问题,其中不确定性作为引导信号驱动多轮决策;提出 Conversation Uncertainty-aware Planning (CUP) 框架,通过语言模型生成可行动作、规划器评估其对不确定性减少的长期影响,从而实现高效的信息获取和更早的确定性目标承诺。
链接: https://arxiv.org/abs/2604.03924
作者: Xinyi Ling,Ye Liu,Reza Averly,Xia Ning
机构: The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Goal-oriented conversational systems require making sequential decisions under uncertainty about the user’s intent, where the algorithm must balance information acquisition and target commitment over multiple turns. Existing approaches address this challenge from different perspectives: structured methods enable multi-step planning but rely on predefined schemas, while LLM-based approaches support flexible interactions but lack long-horizon decision making, resulting in poor coordination between information acquisition and target commitment. To address this limitation, we formulate goal-oriented conversation as an uncertainty-aware sequential decision problem, where uncertainty serves as a guiding signal for multi-turn decision making. We propose a Conversation Uncertainty-aware Planning framework (CUP) that integrates language models with structured planning: a language model proposes feasible actions, and a planner evaluates their long-term impact on uncertainty reduction. Experiments on multiple conversational benchmarks show that CUP consistently improves success rates while requiring fewer interaction turns. Further analysis demonstrates that uncertainty-aware planning contributes to more efficient information acquisition and earlier confident commitment.
[NLP-72] From Plausible to Causal: Counterfactual Semantics for Policy Evaluation in Simulated Online Communities
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的社会模拟中“表象可信性”与“因果有效性”之间的脱节问题,即现有模拟虽能生成看似真实的社区互动,但缺乏明确的因果语义,无法支撑政策干预效果的可靠评估。其解决方案的关键在于引入因果反事实框架,区分必要性因果(necessity causation,即若无干预是否仍会发生结果)与充分性因果(sufficiency causation,即干预是否稳定产生结果),并将二者映射至不同利益相关者的需求:内容审核人员需识别必要性以诊断事件成因,平台设计者则需验证充分性以选择有效政策。通过形式化这一映射关系并明确模拟设计中的假设条件,论文提出将因果估计视为“模拟器条件下的因果估计”,其政策相关性取决于模拟器的真实性(fidelity),从而推动社会模拟从视觉可信迈向可支持政策决策的因果验证工具。
链接: https://arxiv.org/abs/2604.03920
作者: Agam Goyal,Yian Wang,Eshwar Chandrasekharan,Hari Sundaram
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: Accepted to PoliSim@CHI’26: 6 pages, 1 table
Abstract:LLM-based social simulations can generate believable community interactions, enabling policy wind tunnels'' where governance interventions are tested before deployment. But believability is not causality. Claims like intervention A reduces escalation’’ require causal semantics that current simulation work typically does not specify. We propose adopting the causal counterfactual framework, distinguishing \textitnecessary causation (would the outcome have occurred without the intervention?) from \textitsufficient causation (does the intervention reliably produce the outcome?). This distinction maps onto different stakeholder needs: moderators diagnosing incidents require evidence about necessity, while platform designers choosing policies require evidence about sufficiency. We formalize this mapping, show how simulation design can support estimation under explicit assumptions, and argue that the resulting quantities should be interpreted as simulator-conditional causal estimates whose policy relevance depends on simulator fidelity. Establishing this framework now is essential: it helps define what adequate fidelity means and moves the field from simulations that look realistic toward simulations that can support policy changes.
[NLP-73] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在回答事实性问题时频繁产生自信但错误答案的问题,即“幻觉”(hallucination)风险。其核心问题是当前LLMs在不确定时仍倾向于强行作答,而非选择不回答(epistemic abstention),这与常见的二元评分机制鼓励回答而非表达不确定性有关。解决方案的关键在于提出一种无需修改模型的提示工程方法——I-CALM框架,该框架通过三个关键机制实现:(1) 引导模型输出自评置信度作为不确定性信号;(2) 显式设计奖励机制以部分激励模型在不确定时选择弃权;(3) 加入强调诚实、谦逊和责任的轻量级规范性原则。实验证明,该方法能在不重新训练模型的前提下显著降低错误回答率,主要通过将易出错样本转移至弃权决策并重新校准置信度实现,从而在覆盖度与可靠性之间取得更好平衡。
链接: https://arxiv.org/abs/2604.03904
作者: Haotian Zong,Binze Li,Yufei Long,Sinyin Chang,Jialong Wu,Gillian K. Hadfield
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) frequently produce confident but incorrect answers, partly because common binary scoring conventions reward answering over honestly expressing uncertainty. We study whether prompt-only interventions – explicitly announcing reward schemes for answer-versus-abstain decisions plus humility-oriented normative principles – can reduce hallucination risk without modifying the model. Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being uncertain about their answers. We first assess self-reported verbal confidence as a usable uncertainty signal, showing stability under prompt paraphrasing and reasonable calibration against a token-probability baseline. We then study I-CALM, a prompt-based framework that (i) elicits verbal confidence, (ii) partially rewards abstention through explicit reward schemes, and (iii) adds lightweight normative principles emphasizing truthfulness, humility, and responsibility. Using GPT-5 mini on PopQA as the main setting, we find that confidence-eliciting, abstention-rewarding prompts, especially with norms, reduce the false-answer rate on answered cases mainly by identifying and shifting error-prone cases to abstention and re-calibrating their confidence. This trades coverage for reliability while leaving forced-answer performance largely unchanged. Varying the abstention reward yields a clear abstention-hallucination frontier. Overall, results show the framework can improve selective answering on factual questions without retraining, with the magnitude of effect varying across models and datasets. Code is available at the following this https URL.
[NLP-74] When Models Know More Than They Say: Probing Analogical Reasoning in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理隐含类比推理时表现不佳的问题,特别是当类比关系不依赖表面线索而需依赖潜在抽象信息时,模型的泛化能力受限。其关键解决方案在于通过探针(probing)分析模型内部表示与提示(prompting)行为之间的差异,揭示出模型在不同类比类型(修辞类比与叙事类比)下内部表征与外部任务表现之间存在不对称性:对于修辞类比,探针检测性能显著优于提示方法;而对于叙事类比,两者均表现低下,表明当前提示机制难以有效访问模型已具备的潜在知识,从而暴露了提示策略在抽象信息利用上的局限性。
链接: https://arxiv.org/abs/2604.03877
作者: Hope McGovern,Caroline Craig,Thomas Lippincott,Hale Sirin
机构: Cambridge University (剑桥大学); Northeastern University (东北大学); Athenahealth (爱达健康); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Analogical reasoning is a core cognitive faculty essential for narrative understanding. While LLMs perform well when surface and structural cues align, they struggle in cases where an analogy is not apparent on the surface but requires latent information, suggesting limitations in abstraction and generalisation. In this paper we compare a model’s probed representations with its prompted performance at detecting narrative analogies, revealing an asymmetry: for rhetorical analogies, probing significantly outperforms prompting in open-source models, while for narrative analogies, they achieve a similar (low) performance. This suggests that the relationship between internal representations and prompted behavior is task-dependent and may reflect limitations in how prompting accesses available information.
[NLP-75] SODA: Semi On-Policy Black-Box Distillation for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)黑盒知识蒸馏中长期存在的权衡问题:简单离策略(off-policy)方法难以纠正学生模型的固有错误,而全策略(on-policy)方法虽能通过对抗训练实现高质量对齐,却带来显著的训练不稳定性与高昂的计算开销。解决方案的关键在于提出一种半策略蒸馏框架SODA(Semi On-policy Distillation with Alignment),其核心思想是利用前沿教师模型与小型基础学生模型之间的能力差距——由于学生模型的零样本输出天然劣于教师目标,只需将教师最优响应与学生一次性静态输出进行配对,即可构建高效对比信号,从而实现分布对齐。此方法无需动态采样或对抗平衡机制,大幅提升了训练效率与稳定性,同时在多个基准测试中优于现有最先进方法。
链接: https://arxiv.org/abs/2604.03873
作者: Xiwen Chen,Jingjing Wang,Wenhui Zhu,Peijie Qiu,Xuanzhao Dong,Hejian Sang,Zhipeng Wang,Alborz Geramifard,Feng Luo
机构: Clemson University (克莱姆森大学); LinkedIn (领英); Washington University in St. Louis (圣路易斯华盛顿大学); Arizona State University (亚利桑那州立大学); Columbia University (哥伦比亚大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student’s inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model’s natural, zero-shot responses are almost strictly inferior to the powerful teacher’s targets, we can construct a highly effective contrastive signal simply by pairing the teacher’s optimal response with a one-time static snapshot of the student’s outputs. This demonstrates that exposing the small student to its own static inferior behaviors is sufficient for high-quality distribution alignment, eliminating the need for costly dynamic rollouts and fragile adversarial balancing. Extensive evaluations across four compact Qwen2.5 and Llama-3 models validate this semi on-policy paradigm. SODA matches or outperforms the state-of-the-art methods on 15 out of 16 benchmark results. More importantly, it achieves this superior distillation quality while training 10 times faster, consuming 27% less peak GPU memory, and completely eliminating adversarial instability.
[NLP-76] Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agent ic LLM s
【速读】: 该论文旨在解决现代多智能体系统(multi-agent systems)在开放源代码框架快速部署背景下所面临的严重安全挑战,特别是由间接提示注入(Indirect Prompt Injections, IPI)引发的隐蔽恶意指令执行问题。IPI攻击通过第三方内容隐藏恶意指令,在正常操作中触发未经授权的数据外泄等行为,而现有安全评估多依赖孤立的单轮基准测试,未能充分揭示复杂动态环境中智能体的系统性脆弱性。解决方案的关键在于提出一种基于表示工程(Representation Engineering, RepE)的检测机制:通过提取工具调用输入位置的隐藏状态,构建一个“电路断路器”(circuit breaker),能够在代理执行恶意指令前实现高精度拦截,且在多种大语言模型(LLM)骨干架构中均表现出鲁棒性,从而为构建具有韧性的多智能体架构提供了可落地的防御范式。
链接: https://arxiv.org/abs/2604.03870
作者: Wenhui Zhu,Xuanzhao Dong,Xiwen Chen,Rui Cai,Peijie Qiu,Zhipeng Wang,Oana Frunza,Shao Tang,Jindong Gu,Yalin Wang
机构: Arizona State University(亚利桑那州立大学); Morgan Stanley(摩根士丹利); UC Davis(加州大学戴维斯分校); Washington University in St. Louis(圣路易斯华盛顿大学); Rice University(莱斯大学); Florida State University(佛罗里达州立大学); University of Oxford(牛津大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid deployment of open-source frameworks has significantly advanced the development of modern multi-agent systems. However, expanded action spaces, including uncontrolled privilege exposure and hidden inter-system interactions, pose severe security challenges. Specifically, Indirect Prompt Injections (IPI), which conceal malicious instructions within third-party content, can trigger unauthorized actions such as data exfiltration during normal operations. While current security evaluations predominantly rely on isolated single-turn benchmarks, the systemic vulnerabilities of these agents within complex dynamic environments remain critically underexplored. To bridge this gap, we systematically evaluate six defense strategies against four sophisticated IPI attack vectors across nine LLM backbones. Crucially, we conduct our evaluation entirely within dynamic multi-step tool-calling environments to capture the true attack surface of modern autonomous agents. Moving beyond binary success rates, our multidimensional analysis reveals a pronounced fragility. Advanced injections successfully bypass nearly all baseline defenses, and some surface-level mitigations even produce counterproductive side effects. Furthermore, while agents execute malicious instructions almost instantaneously, their internal states exhibit abnormally high decision entropy. Motivated by this latent hesitation, we investigate Representation Engineering (RepE) as a robust detection strategy. By extracting hidden states at the tool-input position, we revealed that the RepE-based circuit breaker successfully identifies and intercepts unauthorized actions before the agent commits to them, achieving high detection accuracy across diverse LLM backbones. This study exposes the limitations of current IPI defenses and provides a highly practical paradigm for building resilient multi-agent architectures.
[NLP-77] Affording Process Auditability with QualAnalyzer: An Atomistic LLM Analysis Tool for Qualitative Research
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在定性数据分析中因流程不透明而导致的可审计性缺失问题。当前许多LLM辅助分析工作流隐藏了推理过程,使得研究者难以追踪结论的生成逻辑。解决方案的关键在于提出QualAnalyzer——一个开源的Chrome扩展工具,它通过独立处理每个数据单元并保留每个单元的提示(prompt)、输入与输出,实现原子级(atomistic)的LLM分析,从而构建清晰、可追溯的审计路径(audit trail),提升LLM辅助定性研究的透明度与方法论严谨性。
链接: https://arxiv.org/abs/2604.03820
作者: Max Hao Lu,Ryan Ellegood,Rony Rodriguez-Ramirez,Sophia Blumert
机构: Harvard University (哈佛大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 3 figures, BEA2026 Conference Submission
Abstract:Large language models are increasingly used for qualitative data analysis, but many workflows obscure how analytic conclusions are produced. We present QualAnalyzer, an open-source Chrome extension for Google Workspace that supports atomistic LLM analysis by processing each data segment independently and preserving the prompt, input, and output for every unit. Through two case studies – holistic essay scoring and deductive thematic coding of interview transcripts – we show that this approach creates a legible audit trail and helps researchers investigate systematic differences between LLM and human judgments. We argue that process auditability is essential for making LLM-assisted qualitative research more transparent and methodologically robust.
[NLP-78] sting the Limits of Truth Directions in LLM s
【速读】: 该论文旨在解决生成式 AI(Generative AI)中语言模型激活空间内“真理方向”(truth direction)是否具有普遍性的问题。此前研究认为真理方向在某些方面是通用的,但近期工作对其泛化能力提出质疑。本文的关键解决方案在于系统性地识别并验证了真理方向普遍性的多个未被充分认识的限制:首先,真理方向具有显著的层依赖性,需在多层进行探测才能全面理解其特性;其次,其分布与任务类型密切相关,事实类任务在早期层显现,而推理类任务则出现在后期层,且性能随任务复杂度变化;最后,模型指令对真理方向影响显著,简单正确性评估指令会显著改变真理探针的泛化能力。这些发现表明,真理方向的普遍性远比先前认知更为有限,其有效性受模型层级、任务难度、任务类型及提示模板等多重因素制约。
链接: https://arxiv.org/abs/2604.03754
作者: Angelos Poulis,Mark Crovella,Evimaria Terzi
机构: Boston University (波士顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple correctness evaluation instructions significantly affect the generalization ability of truth probes. Our findings indicate that universality claims for truth directions are more limited than previously known, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.
[NLP-79] CREBench: Evaluating Large Language Models in Cryptographic Binary Reverse Engineering
【速读】: 该论文旨在解决生成式 AI(Generative AI)在密码学二进制逆向工程(Cryptographic Binary Reverse Engineering, CRE)任务中能力系统性不足的问题,即当前大语言模型(Large Language Models, LLMs)在处理涉及加密算法逻辑分析与输入恢复的复杂逆向任务时表现尚不明确且效率有限。解决方案的关键在于构建一个名为 CREBench 的标准化基准测试集,包含 432 个基于 48 种标准加密算法及 3 类不安全密钥使用场景的 CTF 风格挑战,并设计涵盖算法识别、逻辑推断到正确 flag 恢复的四阶段评估框架,从而客观量化 LLM 在密码学逆向工程中的性能边界,为后续研究提供可复现的评测标准与基线参考。
链接: https://arxiv.org/abs/2604.03750
作者: Baicheng Chen,Yu Wang,Ziheng Zhou,Xiangru Liu,Juanru Li,Yilei Chen,Tianxing He
机构: Shanghai Qi Zhi Institute (上海奇智研究院); Institute of Interdisciplinary Information Sciences, Tsinghua University (清华大学交叉信息研究院); Xiongan AI Institute (雄安人工智能研究院); Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China (中国科学院信息工程研究所, 北京, 中国); School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China (中国科学院大学网络空间安全学院, 北京, 中国); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); East China Normal University (华东师范大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reverse engineering (RE) is central to software security, particularly for cryptographic programs that handle sensitive data and are highly prone to vulnerabilities. It supports critical tasks such as vulnerability discovery and malware analysis. Despite its importance, RE remains labor-intensive and requires substantial expertise, making large language models (LLMs) a potential solution for automating the process. However, their capabilities for RE remain systematically underexplored. To address this gap, we study the cryptographic binary RE capabilities of LLMs and introduce \textbfCREBench, a benchmark comprising 432 challenges built from 48 standard cryptographic algorithms, 3 insecure crypto key usage scenarios, and 3 difficulty levels. Each challenge follows a Capture-the-Flag (CTF) RE challenge, requiring the model to analyze the underlying cryptographic logic and recover the correct input. We design an evaluation framework comprising four sub-tasks, from algorithm identification to correct flag recovery. We evaluate eight frontier LLMs on CREBench. GPT-5.4, the best-performing model, achieves 64.03 out of 100 and recovers the flag in 59% of challenges. We also establish a strong human expert baseline of 92.19 points, showing that humans maintain an advantage in cryptographic RE tasks. Our code and dataset are available at this https URL.
[NLP-80] POEMetric: The Last Stanza of Humanity
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在诗歌创作任务中与人类诗人之间能力差距的量化评估问题,尤其关注其在形式遵循、创造性表达、情感共鸣及文学技巧运用等方面的不足。解决方案的关键在于提出并构建了首个系统性的诗歌评价框架 POEMetric,该框架从三个维度对诗歌质量进行多维评估:1)基础指令执行能力(如体裁和主题一致性),2)高级创作能力(如创意、词汇多样性、独特性、意象与修辞手法使用、情感共鸣),3)整体诗歌质量和作者归属判断。研究通过人工标注的 203 首英文诗歌数据集与 30 种大型语言模型(LLMs)生成的 6,090 首诗歌对比实验,结合规则基评估与 LLM-as-a-judge 方法,并由人类专家验证结果,首次量化揭示了当前 LLM 在高级诗歌创作能力上仍显著落后于人类诗人,从而为未来诗歌生成模型的发展提供了明确方向。
链接: https://arxiv.org/abs/2604.03695
作者: Bingru Li,Han Wang,Hazel Wilkinson
机构: University of Birmingham (伯明翰大学); University of Trento (特伦托大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at this https URL.
[NLP-81] Researchers waste 80% of LLM annotation costs by classifying one text at a time
【速读】: 该论文旨在解决大规模文本分类任务中因逐条调用大语言模型(Large Language Models, LLMs)导致的高API调用成本与低效率问题。其核心解决方案是通过批量处理(batching)和多变量堆叠(stacking)技术,将多个文本样本及多个编码维度整合至单个提示(prompt)中,从而显著减少API调用次数并降低token消耗。实验表明,在批量大小不超过100、每条提示中堆叠不超过10个变量时,六种主流LLM的编码准确性仅比单样本基线下降2个百分点以内,且测量误差小于人工标注者间的典型分歧,证明该方法在保持高质量的同时实现了高效计算。
链接: https://arxiv.org/abs/2604.03684
作者: Christian Pipal,Eva-Maria Vogel,Morgan Wack,Frank Esser
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly being used for text classification across the social sciences, yet researchers overwhelmingly classify one text per variable per prompt. Coding 100,000 texts on four variables requires 400,000 API calls. Batching 25 items and stacking all variables into a single prompt reduces this to 4,000 calls, cutting token costs by over 80%. Whether this degrades coding quality is unknown. We tested eight production LLMs from four providers on 3,962 expert-coded tweets across four tasks, varying batch size from 1 to 1,000 items and stacking up to 25 coding dimensions per prompt. Six of eight models maintained accuracy within 2 pp of the single-item baseline through batch sizes of 100. Variable stacking with up to 10 dimensions produced results comparable to single-variable coding, with degradation driven by task complexity rather than prompt length. Within this safe operating range, the measurement error from batching and stacking is smaller than typical inter-coder disagreement in the ground-truth data.
[NLP-82] Unlocking Prompt Infilling Capability for Diffusion Language Models
【速读】: 该论文旨在解决掩码扩散语言模型(Masked diffusion language models, dLMs)在生成文本时无法实现填充式提示(infilling prompts)的问题。当前监督微调(Supervised Fine-Tuning, SFT)通常仅对响应部分进行掩码,导致模型无法利用提示中的上下文信息进行有效填充。解决方案的关键在于改变SFT阶段的掩码策略,采用全序列掩码(full-sequence masking),即在微调过程中同时对提示和响应进行联合掩码,从而解锁模型的填充能力。实验表明,经过此训练策略调整后的模型能够基于少量示例自动填充提示模板,且生成效果优于或等同于人工设计的模板,并具备良好的跨模型迁移性与现有提示优化方法的互补性。这说明阻碍dLMs实现高效提示填充的核心瓶颈是训练范式而非模型架构限制。
链接: https://arxiv.org/abs/2604.03677
作者: Yoshinari Fujinuma,Keisuke Sakaguchi
机构: Patronus AI; Tohoku University (东北大学); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Masked diffusion language models (dLMs) generate text through bidirectional denoising, yet this capability remains locked for infilling prompts. This limitation is an artifact of the current supervised finetuning (SFT) convention of applying response-only masking. To unlock this capability, we extend full-sequence masking during SFT, where both prompts and responses are masked jointly. Once unlocked, the model infills masked portions of a prompt template conditioned on few-shot examples. We show that such model-infilled prompts match or surpass manually designed templates, transfer effectively across models, and are complementary to existing prompt optimization methods. Our results suggest that training practices, not architectural limitations, are the primary bottleneck preventing masked diffusion language models from infilling effective prompts
[NLP-83] Layer su Layer: Identifying and Disambiguating the Italian NPN Construction in BERTs family
【速读】: 该论文旨在解决预训练语言模型(Pretrained Language Models, PLMs)中上下文嵌入(contextual embeddings)是否编码了显式语言学理论所描述的构式信息(constructional information)这一问题,尤其聚焦于意大利语中名词-介词-名词(Noun-Preposition-Noun, NPN)构式家族的语言现象。其解决方案的关键在于:利用BERT模型提取上下文向量表示,并通过逐层探针分类器(layer-wise probing classifiers)系统评估不同网络层中编码的构式形式与意义信息,从而为构式主义理论与神经语言建模之间的对话提供实证依据。
链接: https://arxiv.org/abs/2604.03673
作者: Greta Gorzoni,Ludovica Pannitto,Francesca Masini
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Interpretability research has highlighted the importance of evaluating Pretrained Language Models (PLMs) and in particular contextual embeddings against explicit linguistic theories to determine what linguistic information they encode. This study focuses on the Italian NPN (noun-preposition-noun) constructional family, challenging some of the theoretical and methodological assumptions underlying previous experimental designs and extending this type of research to a lesser-investigated language. Contextual vector representations are extracted from BERT and used as input to layer-wise probing classifiers, systematically evaluating information encoded across the model’s internal layers. The results shed light on the extent to which constructional form and meaning are reflected in contextual embeddings, contributing empirical evidence to the dialogue between constructionist theory and neural language modelling
[NLP-84] AI Appeals Processor: A Deep Learning Approach to Automated Classification of Citizen Appeals in Government Services
【速读】: 该论文旨在解决政府机构在处理公民诉求(appeals)时面临的效率瓶颈问题,即传统人工处理方式平均耗时20分钟/件且分类准确率仅为67%,难以应对电子化诉求量激增的挑战。解决方案的关键在于构建一个基于微服务架构的AI Appeals Processor系统,集成自然语言处理(Natural Language Processing, NLP)与深度学习技术,实现诉求的自动化分类与路由;实验表明,采用Word2Vec结合长短期记忆网络(LSTM)的模型在保持78%分类准确率的同时,将处理时间缩短54%,相较基于Transformer的模型展现出更优的准确性与计算效率平衡。
链接: https://arxiv.org/abs/2604.03672
作者: Vladimir Beskorovainyi
机构: Besk Tech; Moscow Institute of Physics and Technology (MIPT)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 0 figures, 5 tables
Abstract:Government agencies worldwide face growing volumes of citizen appeals, with electronic submissions increasing significantly over recent years. Traditional manual processing averages 20 minutes per appeal with only 67% classification accuracy, creating significant bottlenecks in public service delivery. This paper presents AI Appeals Processor, a microservice-based system that integrates natural language processing and deep learning techniques for automated classification and routing of citizen appeals. We evaluate multiple approaches – including Bag-of-Words with SVM, TF-IDF with SVM, fastText, Word2Vec with LSTM, and BERT – on a representative dataset of 10,000 real citizen appeals across three primary categories (complaints, applications, and proposals) and seven thematic domains. Our experiments demonstrate that a Word2Vec+LSTM architecture achieves 78% classification accuracy while reducing processing time by 54%, offering an optimal balance between accuracy and computational efficiency compared to transformer-based models.
[NLP-85] Document-Level Numerical Reasoning across Single and Multiple Tables in Financial Reports
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长篇结构化文档中进行可靠数值推理(numerical reasoning)的难题,尤其聚焦于金融年报这类包含跨表格信息整合需求的复杂文档。现有基准多局限于单表场景,忽视了跨表格、多步骤计算在真实财务分析中的关键作用。其解决方案的核心是提出FinLongDocAgent,一种基于多智能体(Multi-Agent)的多轮检索增强生成(Retrieval-Augmented Generation, RAG)框架,通过迭代式证据检索、中间计算与结果验证机制,显著提升LLMs在长上下文金融文档中的数值问答准确率。
链接: https://arxiv.org/abs/2604.03664
作者: Yi-Cheng Wang,Wei-An Wang,Chu-Song Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite the strong language understanding abilities of large language models (LLMs), they still struggle with reliable question answering (QA) over long, structured documents, particularly for numerical reasoning. Financial annual reports exemplify this difficulty: financial statement analysis often hinges on accurate arithmetic, and analysts derive key indicators by integrating evidence scattered across multiple tables and narrative text. However, existing benchmarks focus largely on single-table settings, leaving cross-table document-level numerical reasoning underexplored. To address this gap, we introduce FinLongDocQA, a dataset for both single-table and cross-table financial numerical reasoning in long-context reports. Evaluating both closed-source and open-source LLMs on FinLongDocQA reveals two bottlenecks: (1) annual reports often exceed 129k tokens, exacerbating the context rot problem for locating relevant tables; and (2) even when relevant evidence is located, LLMs remain prone to errors in multi-step numerical reasoning. We propose FinLongDocAgent, a Multi-Agent Multi-Round Retrieval-Augmented Generation (RAG) approach that iteratively retrieves evidence, performs intermediate calculations, and verifies results across rounds. Experiments highlight the importance of iterative retrieval and verification for reliable numerical QA in long financial documents.
[NLP-86] CAGMamba: Context-Aware Gated Cross-Modal Mamba Network for Multimodal Sentiment Analysis
【速读】: 该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)中两个核心问题:一是现有基于Transformer的跨模态融合方法存在二次计算复杂度,难以扩展;二是先前方法对对话中前序话语的上下文信息建模不足,缺乏显式的时序结构来捕捉情感演变过程。其解决方案的关键在于提出CAGMamba框架,通过将上下文与当前话语特征组织为时间有序的二进制序列,赋予Mamba模型明确的时间结构以建模情感演化;同时设计门控交叉模态Mamba网络(Gated Cross-Modal Mamba Network, GCMN),利用可学习门控机制平衡跨模态融合与单模态信息保留,并采用三分支多任务目标联合训练文本、音频及融合预测,从而实现高效且精准的对话情感分析。
链接: https://arxiv.org/abs/2604.03650
作者: Minghai Jiao,Jing Xiao,Peng Xiao,Ende Zhang,Shuang Kan,Wenyan Jiang,Jinyao Li,Yixian Liu,Haidong Xin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal Sentiment Analysis (MSA) requires effective modeling of cross-modal interactions and contextual dependencies while remaining computationally efficient. Existing fusion approaches predominantly rely on Transformer-based cross-modal attention, which incurs quadratic complexity with respect to sequence length and limits scalability. Moreover, contextual information from preceding utterances is often incorporated through concatenation or independent fusion, without explicit temporal modeling that captures sentiment evolution across dialogue turns. To address these limitations, we propose CAGMamba, a context-aware gated cross-modal Mamba framework for dialogue-based sentiment analysis. Specifically, we organize the contextual and the current-utterance features into a temporally ordered binary sequence, which provides Mamba with explicit temporal structure for modeling sentiment evolution. To further enable controllable cross-modal integration, we propose a Gated Cross-Modal Mamba Network (GCMN) that integrates cross-modal and unimodal paths via learnable gating to balance information fusion and modality preservation, and is trained with a three-branch multi-task objective over text, audio, and fused predictions. Experiments on three benchmark datasets demonstrate that CAGMamba achieves state-of-the-art or competitive results across multiple evaluation metrics. All codes are available at this https URL.
[NLP-87] he Format Tax
【速读】: 该论文试图解决的问题是:在使用开放权重的大语言模型(Large Language Models, LLMs)时,要求其输出结构化格式(如JSON、XML、LaTeX、Markdown)会导致推理和写作性能显著下降,这种现象被称为“格式税”(format tax)。研究发现,这种性能损失主要源于提示(prompt)中对格式的请求本身,而非解码约束或采样偏差。解决方案的关键在于“解耦推理与格式化”——即通过两种方式实现:一是先生成自由文本再进行二次格式转换,二是在一个生成过程中支持扩展性思维(extended thinking),从而将推理过程与格式要求分离。实验表明,在六种开放权重模型、四种API模型、四种格式以及涵盖数学、科学、逻辑和写作的任务中,该策略可恢复大部分因格式要求而损失的准确性。
链接: https://arxiv.org/abs/2604.03616
作者: Ivan Yee Lee,Loris D’Antoni,Taylor Berg-Kirkpatrick
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Asking a large language model to respond in JSON should be a formatting choice, not a capability tax. Yet we find that structured output requirements – JSON, XML, LaTeX, Markdown – substantially degrade reasoning and writing performance across open-weight models. The research response has focused on constrained decoding, but sampling bias accounts for only a fraction of the degradation. The dominant cost enters at the prompt: format-requesting instructions alone cause most of the accuracy loss, before any decoder constraint is applied. This diagnosis points to a simple principle: decouple reasoning from formatting. Whether by generating freeform first and reformatting in a second pass, or by enabling extended thinking within a single generation, separating the two concerns substantially recovers lost accuracy. Across six open-weight models, four API models, four formats, and tasks spanning math, science, logic, and writing, decoupling recovers most lost accuracy. Notably, most recent closed-weight models show little to no format tax, suggesting the problem is not inherent to structured generation but a gap that current open-weight models have yet to close. Code is available at this https URL.
[NLP-88] Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation
【速读】: 该论文旨在解决混合专家模型(Mixture-of-Experts, MoE)在不同语言间表现差异显著的问题,尤其是低资源语言性能不足的瓶颈。其核心发现是“语言路由隔离”(Language Routing Isolation)现象,即高资源与低资源语言倾向于激活互不重叠的专家子集,且路由模式在模型深度上呈现分层收敛-发散特性。解决方案的关键在于提出 RISE(Routing Isolation-guided Subnetwork Enhancement)框架,通过三阶段选择策略识别并增强语言特异性专家子网络:利用特异性得分筛选浅层和深层的语言专属专家,用重叠得分选取中层通用专家;随后仅训练所选子网络而冻结其余参数,从而显著提升低资源语言性能(最高F1提升10.85%),同时保持其他语言能力不受损。
链接: https://arxiv.org/abs/2604.03592
作者: Kening Zheng,Wei-Chieh Huang,Jiahao Huo,Zhonghao Li,Henry Peng Zou,Yibo Yan,Xin Zou,Jungang Li,Junzhuo Li,Hanrong Zhang,Xuming Hu,Philip S. Yu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); HKUST (Guangzhou) (香港科技大学(广州)); University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) models exhibit striking performance disparities across languages, yet the internal mechanisms driving these gaps remain poorly understood. In this work, we conduct a systematic analysis of expert routing patterns in MoE models, revealing a phenomenon we term Language Routing Isolation, in which high- and low-resource languages tend to activate largely disjoint expert sets. Through layer-stratified analysis, we further show that routing patterns exhibit a layer-wise convergence-divergence pattern across model depth. Building on these findings, we propose RISE (Routing Isolation-guided Subnetwork Enhancement), a framework that exploits routing isolation to identify and adapt language-specific expert subnetworks. RISE applies a tripartite selection strategy, using specificity scores to identify language-specific experts in shallow and deep layers and overlap scores to select universal experts in middle layers. By training only the selected subnetwork while freezing all other parameters, RISE substantially improves low-resource language performance while preserving capabilities in other languages. Experiments on 10 languages demonstrate that RISE achieves target-language F1 gains of up to 10.85% with minimal cross-lingual degradation.
[NLP-89] MultiPress: A Multi-Agent Framework for Interpretable Multimodal News Classification IJCNN
【速读】: 该论文旨在解决多模态新闻内容分类中因模态独立处理或简单融合策略导致的跨模态交互建模不足及外部知识利用有限的问题。其解决方案的关键在于提出一个三阶段多智能体框架MultiPress,通过专业化智能体实现多模态感知、检索增强推理与门控融合评分,并引入奖励驱动的迭代优化机制,从而显著提升分类准确率和可解释性。
链接: https://arxiv.org/abs/2604.03586
作者: Tailong Luo,Hao Li,Rong Fu,Xinyue Jiang,Huaxuan Ding,Yiduo Zhang,Zilin Zhao,Simon Fong,Guangyin Jin,Jianyuan Ni
机构: New York Institute of Technology (纽约理工学院); University of Arizona (亚利桑那大学); University of Macau (澳门大学); Peking University (北京大学); Chang’an University (长安大学); Juniata College (朱尼塔学院)
类目: Computation and Language (cs.CL)
备注: Accepted in International Joint Conference on Neural Networks (IJCNN) 2026
Abstract:With the growing prevalence of multimodal news content, effective news topic classification demands models capable of jointly understanding and reasoning over heterogeneous data such as text and images. Existing methods often process modalities independently or employ simplistic fusion strategies, limiting their ability to capture complex cross-modal interactions and leverage external knowledge. To overcome these limitations, we propose MultiPress, a novel three-stage multi-agent framework for multimodal news classification. MultiPress integrates specialized agents for multimodal perception, retrieval-augmented reasoning, and gated fusion scoring, followed by a reward-driven iterative optimization mechanism. We validate MultiPress on a newly constructed large-scale multimodal news dataset, demonstrating significant improvements over strong baselines and highlighting the effectiveness of modular multi-agent collaboration and retrieval-augmented reasoning in enhancing classification accuracy and interpretability.
[NLP-90] xt Summarization With Graph Attention Networks NEURIPS
【速读】: 该论文旨在通过引入图结构信息(特别是修辞结构理论 Rhetorical Structure Theory, RST 和共指关系 Co-reference, Coref 图)来提升摘要生成模型的性能。其关键解决方案是将图信息融入神经网络架构中:初期尝试使用图注意力网络(Graph Attention Network, GAT)以捕捉图结构中的依赖关系,但未取得性能提升;随后改用简单的多层感知机(Multi-layer Perceptron, MLP)架构,成功在主要数据集 CNN/DM 上提升了模型表现。此外,研究还对 XSum 数据集进行了 RST 图标注,为未来基于图结构的摘要模型提供了基准,揭示了当前方法的优势与局限性。
链接: https://arxiv.org/abs/2604.03583
作者: Mohammadreza Ardestani,Yllias Chali
机构: University of Lethbridge (莱斯布里奇大学)
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV), Vancouver, Canada, 2024. 14 pages, 8 figures
Abstract:This study aimed to leverage graph information, particularly Rhetorical Structure Theory (RST) and Co-reference (Coref) graphs, to enhance the performance of our baseline summarization models. Specifically, we experimented with a Graph Attention Network architecture to incorporate graph information. However, this architecture did not enhance the performance. Subsequently, we used a simple Multi-layer Perceptron architecture, which improved the results in our proposed model on our primary dataset, CNN/DM. Additionally, we annotated XSum dataset with RST graph information, establishing a benchmark for future graph-based summarization models. This secondary dataset posed multiple challenges, revealing both the merits and limitations of our models.
[NLP-91] Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多模态推理中普遍存在的对象幻觉(object hallucination)问题,即模型生成图像中并不存在的物体描述。现有方法通常依赖于对每个输入进行迭代优化来抑制不可靠的视觉信号,导致显著的推理延迟。论文的关键解决方案是通过分析视觉编码器内部注意力动态,识别出视觉信息处理的三阶段结构(扩散、聚焦、再扩散),并发现幻觉行为特别敏感于聚焦阶段获得低注意力的token。基于此,提出一种轻量级推理时干预策略:在聚焦阶段选择性抑制低注意力token,利用单次前向传播统计实现无训练干预,并采用行列式点过程(Determinantal Point Process, DPP)保留多样视觉线索同时过滤冗余token。该方法在多个LVLM骨干网络和解码策略下均有效降低幻觉指标,且相比对抗不确定性估计方法具有可忽略的额外推理延迟。
链接: https://arxiv.org/abs/2604.03556
作者: Sohyeon Kim,Sang Yeon Yoon,Kyeongbo Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Vision-Language Models (LVLMs) have achieved impressive progress in multimodal reasoning, yet they remain prone to object hallucinations, generating descriptions of objects that are not present in the input image. Recent approaches attempt to mitigate hallucinations by suppressing unreliable visual signals in the vision encoder, but many rely on iterative optimization for each input, resulting in substantial inference latency. In this work, we investigate the internal attention dynamics of vision encoders in LVLMs and identify a consistent three-phase structure of visual information processing: diffusion, focus, and rediffusion. Our analysis reveals that hallucination behavior is particularly sensitive to tokens receiving low attention during the focus phase. Motivated by this observation, we propose a lightweight inference-time intervention that selectively suppresses such tokens during the focus phase. The method operates in a training-free manner using statistics from a single forward pass and employs a Determinantal Point Process (DPP) to preserve diverse visual cues while filtering redundant tokens. Extensive experiments across multiple LVLM backbones and decoding strategies demonstrate that the proposed approach consistently reduces hallucination metrics while maintaining competitive caption quality. Moreover, compared to adversarial uncertainty estimation methods, our approach achieves comparable hallucination mitigation with negligible additional inference latency.
[NLP-92] owards the AI Historian: Agent ic Information Extraction from Primary Sources
【速读】: 该论文旨在解决人工智能(AI)在历史研究领域应用受限的问题,特别是由于缺乏针对历史学家需求设计的工具。其核心挑战在于传统视觉语言模型(VLM)驱动的固定数据提取流程难以适应历史文献中多样且复杂的原始资料(primary sources)特性。解决方案的关键在于提出Chronos的第一个模块——一个支持自然语言交互的AI historian工具,它允许历史学家根据具体史料特征灵活调整数据提取工作流,动态评估AI模型在特定任务上的表现,并通过与Chronos代理的持续对话迭代优化流程,从而实现对图像扫描的历史原始资料向结构化数据的有效转化。该模块已开源,可直接用于历史学者自主处理其研究材料。
链接: https://arxiv.org/abs/2604.03553
作者: Lorenz Hufe,Niclas Griesshaber,Gavin Greif,Sebastian Oliver Eck,Philip Torr
机构: University of Oxford (牛津大学); Fraunhofer HHI (弗劳恩霍夫海因里希·赫兹研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:
Abstract:AI is supporting, accelerating, and automating scientific discovery across a diverse set of fields. However, AI adoption in historical research remains limited due to the lack of solutions designed for historians. In this technical progress report, we introduce the first module of Chronos, an AI Historian under development. This module enables historians to convert image scans of primary sources into data through natural-language interactions. Rather than imposing a fixed extraction pipeline powered by a vision-language model (VLM), it allows historians to adapt workflows for heterogeneous source corpora, evaluate the performance of AI models on specific tasks, and iteratively refine workflows through natural-language interaction with the Chronos agent. The module is open-source and ready to be used by historical researchers on their own sources.
[NLP-93] Rethinking Token Prediction: Tree-Structured Diffusion Language Model
【速读】: 该论文旨在解决离散扩散语言模型(Discrete Diffusion Language Models)在参数和显存资源受限条件下训练效率低下的问题。其核心挑战在于,现有架构普遍采用全词汇表(full-vocabulary)的token预测层,导致该部分占用了大量模型参数(如小规模DiT结构中超过20%)并显著增加峰值GPU显存占用,造成资源利用不充分。解决方案的关键在于摒弃显式的全词汇预测机制,转而利用词汇树(vocabulary tree)中token之间的内在结构关系,构建一种树状扩散语言模型:通过将扩散过程中的中间潜在状态映射到预构建词汇树中token的祖先节点,实现分类维度的指数级降低,使预测头尺寸可忽略不计,并将节省出的参数重新分配给更深层的注意力模块。实验表明,在相同参数预算下,该方法可将峰值GPU显存使用量减少一半,同时保持与当前最优离散扩散语言模型相当的困惑度性能。
链接: https://arxiv.org/abs/2604.03537
作者: Zihao Wu,Haoming Yang,Juncheng Dong,Vahid Tarokh
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Discrete diffusion language models have emerged as a competitive alternative to auto-regressive language models, but training them efficiently under limited parameter and memory budgets remains challenging. Modern architectures are predominantly based on a full-vocabulary token prediction layer, which accounts for a substantial fraction of model parameters (e.g., more than 20% in small scale DiT-style designs) and often dominates peak GPU memory usage. This leads to inefficient use of both parameters and memory under constrained training resources. To address this issue, we revisit the necessity of explicit full-vocabulary prediction, and instead exploit the inherent structure among tokens to build a tree-structured diffusion language model. Specifically, we model the diffusion process with intermediate latent states corresponding to a token’s ancestor nodes in a pre-constructed vocabulary tree. This tree-structured factorization exponentially reduces the classification dimensionality, makes the prediction head negligible in size, and enables reallocation of parameters to deepen the attention blocks. Empirically, under the same parameter budget, our method reduces peak GPU memory usage by half while matching the perplexity performance of state-of-the-art discrete diffusion language models.
[NLP-94] LangFIR: Discovering Sparse Language-Specific Features from Monolingual Data for Language Steering
【速读】: 该论文旨在解决多语言大语言模型(Large Language Models, LLMs)在推理阶段难以可靠控制输出语言的问题。现有基于表示层操控(representation-level steering)的方法通常依赖于昂贵的多语言或平行语料来识别残差流(residual stream)中的语言特定方向,而本文提出LangFIR(Language Feature Identification via Random-token Filtering)方法,其核心创新在于仅使用少量单语数据和随机token序列即可发现语言特定的稀疏自动编码器(Sparse Autoencoders, SAE)特征。关键在于:随机token序列能激活那些与语言无关的SAE特征,从而通过过滤机制分离出高度稀疏且对目标语言具有强选择性的语言特异性特征;这些特征在因果消融实验中被证明是必要的——定向移除会显著增加对应语言的交叉熵损失,且用于构建控制向量时,在三个主流模型、多个数据集和十二种语言上均优于依赖平行数据的基线方法,表明语言身份在多语言LLM中可被定位到一个稀疏特征方向集合中。
链接: https://arxiv.org/abs/2604.03532
作者: Sing Hieng Wong,Hassan Sajjad,A.B. Siddique
机构: University of Kentucky (肯塔基大学); Dalhousie University (达尔豪西大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to COLM 2026
Abstract:Large language models (LLMs) show strong multilingual capabilities, yet reliably controlling the language of their outputs remains difficult. Representation-level steering addresses this by adding language-specific vectors to model activations at inference time, but identifying language-specific directions in the residual stream often relies on multilingual or parallel data that can be expensive to obtain. Sparse autoencoders (SAEs) decompose residual activations into interpretable, sparse feature directions and offer a natural basis for this search, yet existing SAE-based approaches face the same data constraint. We introduce LangFIR (Language Feature Identification via Random-token Filtering), a method that discovers language-specific SAE features using only a small amount of monolingual data and random-token sequences. Many SAE features consistently activated by target-language inputs do not encode language identity. Random-token sequences surface these language-agnostic features, allowing LangFIR to filter them out and isolate a sparse set of language-specific features. We show that these features are extremely sparse, highly selective for their target language, and causally important: directional ablation increases cross-entropy loss only for the corresponding language. Using these features to construct steering vectors for multilingual generation control, LangFIR achieves the best average accuracy BLEU across three models (Gemma 3 1B, Gemma 3 4B, and Llama 3.1 8B), three datasets, and twelve target languages, outperforming the strongest monolingual baseline by up to and surpassing methods that rely on parallel data. Our results suggest that language identity in multilingual LLMs is localized in a sparse set of feature directions discoverable with monolingual data. Code is available at this https URL.
[NLP-95] Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)生成内容中文化对齐(cultural alignment)评估的缺失问题,即现有方法仅关注文化多样性与事实准确性,而忽视了生成内容是否真实反映本地用户对其文化要素的重要程度排序。解决方案的关键在于提出一个以人类为中心的框架:首先基于来自九个国家的开放式调查数据构建人类衍生的文化重要性向量(Cultural Importance Vectors),作为基准;其次设计一种基于句法多样提示集的方法计算模型生成的文化表征向量(Cultural Representation Vectors),并应用于三个前沿LLM(Gemini 2.5 Pro、GPT-4o和Claude 3.5 Haiku)。实证发现部分模型存在西方中心偏差,且所有模型均表现出高度一致的系统性误差特征(ρ > 0.97),表现为过度强调某些文化符号而忽略深层社会价值优先级,从而推动从表面多样性向文化层次真实性评估的范式转变。
链接: https://arxiv.org/abs/2604.03493
作者: Erin MacMurray van Liemt,Aida Davani,Sinchana Kumbale,Neha Dixit,Sunipa Dev
机构: Google Research (谷歌研究); Google (谷歌)
类目: Computation and Language (cs.CL)
备注: 18 pages, 4 figures
Abstract:Cultural representation in Large Language Model (LLM) outputs has primarily been evaluated through the proxies of cultural diversity and factual accuracy. However, a crucial gap remains in assessing cultural alignment: the degree to which generated content mirrors how native populations perceive and prioritize their own cultural facets. In this paper, we introduce a human-centered framework to evaluate the alignment of LLM generations with local expectations. First, we establish a human-derived ground-truth baseline of importance vectors, called Cultural Importance Vectors based on an induced set of culturally significant facets from open-ended survey responses collected across nine countries. Next, we introduce a method to compute model-derived Cultural Representation Vectors of an LLM based on a syntactically diversified prompt-set and apply it to three frontier LLMs (Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Haiku). Our investigation of the alignment between the human-derived Cultural Importance and model-derived Cultural Representations reveals a Western-centric calibration for some of the models where alignment decreases as a country’s cultural distance from the US increases. Furthermore, we identify highly correlated, systemic error signatures ( \rho 0.97 ) across all models, which over-index on some cultural markers while neglecting the deep-seated social and value-based priorities of users. Our approach moves beyond simple diversity metrics toward evaluating the fidelity of AI-generated content in authentically capturing the nuanced hierarchies of global cultures.
[NLP-96] Evolutionary Search for Automated Design of Uncertainty Quantification Methods
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中不确定性量化(Uncertainty Quantification, UQ)方法依赖人工设计、缺乏可扩展性和通用性的问题。解决方案的关键在于引入LLM驱动的进化搜索(LLM-powered evolutionary search),自动发现以Python程序形式表示的无监督UQ方法,从而实现对幻觉检测器的自动化与可解释性设计。实验表明,所演化的方法在原子主张验证任务上优于人工设计基线,且在跨数据集和分布外场景下具有鲁棒性,同时揭示了不同LLM在进化策略上的本质差异,如Claude模型偏好高特征数量的线性估计器,而GPT-oss-120B则倾向于更简单的基于位置的加权方案。
链接: https://arxiv.org/abs/2604.03473
作者: Mikhail Seleznyov,Daniil Korbut,Viktor Moskvoretskii,Oleg Somov,Alexander Panchenko,Elena Tutubalina
机构: AIRI; Skoltech; Amazon; EPFL; MIPT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Uncertainty quantification (UQ) methods for large language models are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. On the task of atomic claim verification, our evolved methods outperform strong manually-designed baselines, achieving up to 6.7% relative ROC-AUC improvement across 9 datasets while generalizing robustly out-of-distribution. Qualitative analysis reveals that different LLMs employ qualitatively distinct evolutionary strategies: Claude models consistently design high-feature-count linear estimators, while Gpt-oss-120B gravitates toward simpler and more interpretable positional weighting schemes. Surprisingly, only Sonnet 4.5 and Opus 4.5 reliably leverage increased method complexity to improve performance – Opus 4.6 shows an unexpected regression relative to its predecessor. Overall, our results indicate that LLM-powered evolutionary search is a promising paradigm for automated, interpretable hallucination detector design.
[NLP-97] Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
【速读】: 该论文旨在解决共进化自对弈(co-evolutionary self-play)中因提议者(proposer)快速收敛至狭窄问题分布而导致的多样性崩溃问题,这一现象使得课程(curriculum)对求解器(solver)失去信息量,从而停滞共进化循环。解决方案的关键在于引入词汇dropout(vocabulary dropout),这是一种在策略训练和课程生成过程中对提议者的输出logits施加随机硬掩码(hard and non-stationary mask)的轻量机制,通过阻止提议者锁定于固定词元序列来维持问题空间的多样性,进而提升求解器性能——实验表明,在R-Zero框架下训练Qwen3-4B和Qwen3-8B模型进行数学推理时,该方法显著提升了求解器表现,平均增益达+4.4分,尤其在竞赛级基准测试中效果最显著。
链接: https://arxiv.org/abs/2604.03472
作者: Jacob Dineen,Aswin RRV,Zhikun Xu,Ben Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer’s output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.
[NLP-98] he Tool Illusion: Rethinking Tool Use in Web Agents
【速读】: 该论文旨在解决当前关于网页代理(web agents)中工具使用(tool use)研究中存在的实证基础薄弱问题,具体包括:工具是否能为网页代理带来稳定收益、有效工具的设计原则是什么,以及工具使用可能引入哪些副作用。其解决方案的关键在于开展一项大规模、受控的实验研究,系统性地覆盖多种工具来源、基础模型、工具使用框架和评估基准,从而在更广泛的条件下重新检验并扩展先前结论,为未来研究提供更可靠的经验依据。
链接: https://arxiv.org/abs/2604.03465
作者: Renze Lou,Baolin Peng,Wenlin Yao,Qianhui Wu,Hao Cheng,Suman Nath,Wenpeng Yin,Jianfeng Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注: preprint
Abstract:As web agents rapidly evolve, an increasing body of work has moved beyond conventional atomic browser interactions and explored tool use as a higher-level action paradigm. Although prior studies have shown the promise of tools, their conclusions are often drawn from limited experimental scales and sometimes non-comparable settings. As a result, several fundamental questions remain unclear: i) whether tools provide consistent gains for web agents, ii) what practical design principles characterize effective tools, and iii) what side effects tool use may introduce. To establish a stronger empirical foundation for future research, we revisit tool use in web agents through an extensive and carefully controlled study across diverse tool sources, backbone models, tool-use frameworks, and evaluation benchmarks. Our findings both revise some prior conclusions and complement others with broader evidence. We hope this study provides a more reliable empirical basis and inspires future research on tool-use web agents.
[NLP-99] Olmo Hybrid: From Theory to Practice and Back
【速读】: 该论文旨在解决当前语言模型架构中是否存在更优替代方案的问题,特别是针对纯Transformer架构在训练效率和表达能力上的局限性。研究发现,虽然非Transformer架构(如线性循环神经网络和混合模型)展现出潜力,但尚缺乏明确证据证明其优势是否足以抵消扩展成本。为此,论文提出以混合模型(hybrid models)作为解决方案,其关键在于将注意力机制与递归结构(如Gated DeltaNet层)结合,在理论上超越纯Transformer和线性RNN的表达能力,例如可执行代码任务;并在实践中通过训练Olmo Hybrid模型验证:该模型在大规模预训练中表现优于纯Transformer基线(Olmo 3),且具有更高的训练效率。进一步理论分析表明,这种增强的表达能力能够转化为更好的缩放效率,从而解释了为何在特定形式问题上提升的表达力能带来下游任务性能的改善。
链接: https://arxiv.org/abs/2604.03444
作者: William Merrill,Yanhong Li,Tyler Romero,Anej Svete,Caia Costello,Pradeep Dasigi,Dirk Groeneveld,David Heineman,Bailey Kuehl,Nathan Lambert,Jacob Morrison,Luca Soldaini,Finbarr Timbers,Pete Walsh,Noah A. Smith,Hannaneh Hajishirzi,Ashish Sabharwal
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.
[NLP-100] owards a theory of morphology-driven marking in the lexicon: The case of the state
【速读】: 该论文旨在解决语言中名词范畴(noun category)在不同语言中的表现形式差异问题,特别是语义和/或形态句法特征的显著性变化。其核心挑战在于如何解释同一语言内部及跨语言间名词类型标记模式的多样性。解决方案的关键在于提出一种名为“形态驱动标记”(morphology-driven marking)的形式化模型:该模型将名词组织为具有各自形态模板和无标记形式(unmarked form)的模块化认知集合(modular cognitive sets),从而系统性地解释名词类型在标记上的差异,并通过将这些模式置于句法功能框架内,重新审视标记性(markedness)与状态(state)的概念,主张将状态概念扩展至所有综合语(synthetic languages),并识别出基于句法的屈折类型(syntax-based inflection)——如一致关系(agreement)和语法格(grammatical case)——作为新的子类。
链接: https://arxiv.org/abs/2604.03422
作者: Mohamed El Idrissi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 32 pages, 1 figure
Abstract:All languages have a noun category, but its realisation varies considerably. Depending on the language, semantic and/or morphosyntactic differences may be more or less pronounced. This paper explores these variations, using Riffian as a reference point before extending the analysis to other languages. We propose a formal model termed morphology-driven marking. Nouns are organised into modular cognitive sets, each with its own morphological template and unmarked form. This approach helps explain differences in marking among noun types within and across languages. By situating these patterns within syntactic functions, we also reassess the notions of markedness and state. It is proposed that the concept of state be extended to all synthetic languages and analysed a novel subcategory of syntax-based inflection like agreement and grammatical case.
[NLP-101] Are Arabic Benchmarks Reliable? QIMMAs Quality-First Approach to LLM Evaluation
【速读】: 该论文旨在解决现有阿拉伯语大语言模型(Large Language Model, LLM)评估基准中存在的系统性质量缺陷问题,即许多公开基准未经过严格验证,导致评估结果不可靠或存在偏差。解决方案的关键在于提出QIMMA——一个以系统性基准验证为核心的高质量阿拉伯语LLM排行榜,其核心创新是构建了一个多模型评估流程,融合自动化LLM判别与人工审查,用于识别并修正主流阿拉伯语评测数据集中的系统性质量问题;最终形成一个包含超过52k样本的、跨领域、跨任务的高质量评估套件,且所有评估过程通过LightEval和EvalPlus实现透明化,并公开每条样本的推理输出,从而确保可复现性和社区扩展性。
链接: https://arxiv.org/abs/2604.03395
作者: Leen AlQadi,Ahmed Alzubaidi,Mohammed Alyafeai,Hamza Alobeidli,Maitha Alhammadi,Shaikha Alsuwaidi,Omar Alkaabi,Basma El Amel Boussaha,Hakim Hacid
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present QIMMA, a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. Rather than aggregating existing resources as-is, QIMMA applies a multi-model assessment pipeline combining automated LLM judgment with human review to surface and resolve systematic quality issues in well-established Arabic benchmarks before evaluation. The result is a curated, multi-domain, multi-task evaluation suite of over 52k samples, grounded predominantly in native Arabic content; code evaluation tasks are the sole exception, as they are inherently language-agnostic. Transparent implementation via LightEval, EvalPlus and public release of per-sample inference outputs make QIMMA a reproducible and community-extensible foundation for Arabic NLP evaluation.
[NLP-102] Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation
【速读】: 该论文旨在解决阿拉伯语初级阅读评估中生成多样化且符合教学有效性的故事问题,需在词汇、阅读难度和叙事结构的严格约束下避免情节重复,从而保障评估的有效性。其解决方案的关键在于采用噪声引导(noise steering)策略,在推理阶段向Transformer模型的内部表示注入校准的高斯扰动,而非依赖输出层的随机性(如高温采样)。实验表明,残差流噪声注入(Residual stream noise)能显著提升叙事多样性,同时几乎不损害内容质量或约束合规性,并保持早期阅读水平;而注意力熵噪声注入(AENI)进一步稳定了注意力logit噪声,恢复了生成质量,相较高温采样更适配教育内容生成任务。
链接: https://arxiv.org/abs/2604.03380
作者: Haziq Mohammad Khalid,Salsabeel Shapsough,Imran Zualkernan
机构: American University of Sharjah (美国沙迦大学)
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Generating diverse, pedagogically valid stories for Arabic early-grade reading assessments requires balancing tight constraints on vocabulary, reading level, and narrative structure against the need to avoid repetitive plots that undermine assessment validity. We investigate noise steering, injecting calibrated Gaussian perturbations into the internal representations of transformer models at inference time, as a training-free diversity method evaluated across five small Arabic-centric language models (7-9B parameters). We compare four injection strategies against high-temperature sampling baselines, measuring diversity, quality, constraint adherence, and reading grade level. Residual stream noise consistently improves narrative diversity with minimal quality or constraint cost and preserves early-grade reading level across all models. Attention entropy noise injection (AENI) stabilizes the otherwise unreliable attention-logit noise while recovering quality. High-temperature sampling inflates reading grade level and causes catastrophic collapse on several models. We find internal representation-level perturbation to be a more suitable diversity strategy than output-level stochasticity for constrained educational content generation.
[NLP-103] VERT: Reliable LLM Judges for Radiology Report Evaluation
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的放射学报告评估方法在跨模态和跨解剖结构场景下鲁棒性不足的问题,以及如何选择最优的模型与提示配置作为LLM裁判以实现可靠评估。其关键解决方案在于提出一种新的LLM基准指标VERT,并通过系统性对比现有指标(RadFact、GREEN、FineRadScore)及多种微调策略(如参数高效微调、少样本学习和集成方法),验证了VERT在多个专家标注数据集(RadEval和RaTE-Eval)上相较于GREEN提升达11.7%的相关性;同时发现对Qwen3 30B进行仅1300样本的微调可带来高达25%的性能增益并减少37.2倍推理时间,表明轻量化适配即可实现高精度、低延迟的放射学报告自动评估。
链接: https://arxiv.org/abs/2604.03376
作者: Federica Bologna,Jean-Philippe Corbeil,Matthew Wilkens,Asma Ben Abacha
机构: Cornell University(康奈尔大学); Microsoft Healthcare Life Sciences(微软医疗健康生命科学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better understand metric behavior, we perform a systematic error detection and categorization study to assess alignment of these metrics against expert judgments and identify areas of lower and higher agreement. Our results show that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Furthermore, fine-tuning Qwen3 30B yield gains of up to 25% using only 1,300 training samples. The fine-tuned model also reduces inference time up to 37.2 times. These findings highlight the effectiveness of LLM-based judges and demonstrate that reliable evaluation can be achieved with lightweight adaptation.
[NLP-104] CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在评估创造性问题解决能力时存在的局限性问题,即现有基准测试仅考察单一认知能力,且多依赖人工构造的谜题或脱离真实场景的假设情境,无法反映现实世界中创造性思维所需的多维知识整合与非显性关联构建。解决方案的关键在于提出CresOWLve基准,该基准通过设计基于真实世界知识的谜题,要求模型综合运用多种创造性思维策略(如类比、发散思维)、跨领域事实检索,并创造性地组合信息以得出答案,从而更全面、贴近实际地评估LLM的创造性问题解决能力。实验表明,即便前沿模型在事实性问答上表现良好,其在创造性任务上的性能仍显著下降(最高达-17%),凸显了当前模型在形成非显性连接方面的根本性不足。
链接: https://arxiv.org/abs/2604.03374
作者: Mete Ismayilzada,Renqing Cuomao,Daniil Yurshevich,Anna Sotnikova,Lonneke van der Plas,Antoine Bosselut
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier non-thinking and thinking LLMs, we show that CresOWLve remains highly challenging. Our analysis reveals a consistent performance gap: models perform substantially better on factual questions than on creative ones (up to a -17% drop). While models can often retrieve the relevant knowledge, they struggle to form the non-obvious creative connections required to integrate this information and arrive at the correct answer.
[NLP-105] CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks
【速读】: 该论文旨在解决基础模型在多模态任务中参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)的挑战,特别是针对双流架构(如DINO和BERT)中跨模态交互能力不足的问题。现有PEFT方法(如低秩适应LoRA)仅在单模态内部进行调整,难以有效捕捉不同模态之间的关联。其解决方案的关键在于提出一种名为跨模态低秩适应(Cross-Modal Low-Rank Adaptation, CoLA)的新框架,通过引入一个专门用于跨模态适配的独立路径,与传统的模态内适配路径并行运行,从而在不干扰模态特异性学习的前提下,显著增强跨模态信息融合能力。该设计使得CoLA能够在保持参数效率的同时,在多个视觉-语言(RefCOCO系列)和音频-视觉(AVE、AVS)基准上实现优于LoRA的性能提升,并首次实现了多任务下的参数高效多模态适配。
链接: https://arxiv.org/abs/2604.03314
作者: Wish Suharitdamrong,Tony Alex,Muhammad Awais,Sara Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 14 pages, 6 Figures
Abstract:Foundation models have revolutionized AI, but adapting them efficiently for multimodal tasks, particularly in dual-stream architectures composed of unimodal encoders, such as DINO and BERT, remains a significant challenge. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) enable lightweight adaptation, yet they operate in isolation within each modality, limiting their ability in capturing cross-modal interactions. In this paper, we take a step in bridging this gap with Cross-Modal Low-Rank Adaptation (CoLA), a novel PEFT framework that extends LoRA by introducing a dedicated inter-modal adaptation pathway alongside the standard intra-modal one. This dual-path design enables CoLA to adapt unimodal foundation models to multimodal tasks effectively, without interference between modality-specific and cross-modal learning. We evaluate CoLA across a range of vision-language (RefCOCO, RefCOCO+, RefCOCOg) and audio-visual (AVE, AVS) benchmarks, where it consistently outperforms LORA, achieving a relative gain of around 3% and 2%, respectively, while maintaining parameter efficiency. Notably, CoLA enables the first multi-task PEFT framework for visual grounding, bridging a key gap in efficient multimodal adaptation.
[NLP-106] Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)方法中因频繁调用外部知识源而导致的token消耗问题。其核心解决方案是提出“知识包”(Knowledge Packs),即预先计算好的键值缓存(Key-Value Cache, KV cache),能够在不产生额外token成本的情况下提供相同的知识信息。关键创新在于利用因果Transformer结构的特性:对文本F的前向传播所得到的KV缓存,等价于将查询q与F联合输入时产生的缓存结果,这一等价性由因果掩码(causal mask)直接保证。实验表明,在正确格式化提示模板的前提下,该方法在Qwen3-8B和Llama-3.1-8B上实现了700个问题无偏差(zero divergence),最高可节省95% token;此外,由于RoPE旋转key但保留value不变,通过对比delta调整缓存值可实现行为引导(behavioral steering),且该机制无需训练或修改模型权重,能与知识传递并行运行(alpha=0.7)而不相互干扰。
链接: https://arxiv.org/abs/2604.03270
作者: Andrey Pustovit
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 8 tables. Code: this https URL
Abstract:RAG wastes tokens. We propose Knowledge Packs: pre-computed KV caches that deliver the same knowledge at zero token cost. For causal transformers, the KV cache from a forward pass on text F is identical to what a joint pass on F+q would produce - this follows directly from the causal mask. The equivalence is exact but fragile: wrong chat template formatting causes 6-7pp degradation, which we believe explains prior claims of KV outperforming RAG. With correct formatting: zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, up to 95% token savings. The KV interface also enables behavioral steering that RAG cannot do. Because RoPE rotates keys but leaves values untouched, contrastive deltas on cached values can nudge model behavior while key arithmetic destroys coherence. The effect sits in mid-layer values (33-66%), independent directions are nearly orthogonal (cos~0) and compose, and both channels - knowledge and steering - run simultaneously at alpha=0.7 without interference. No training, no weight modification.
[NLP-107] On the First Computer Science Research Paper in an Indian Language and the Future of Science in Indian Languages
【速读】: 该论文旨在解决在印度语言(以泰卢固语为例)中撰写原创计算机科学研究论文的难题,特别是如何构建适用于高级计算机科学概念(如算法、分布式计算和离散数学)的技术术语体系,并克服泰卢固语数学排版工具链不成熟的问题。其解决方案的关键在于:一是利用梵文(Samskrtam)的帕尼尼语法体系(Pāninian grammar)系统性地推导和创制本土化技术术语;二是开发了名为 TeluguTeX 的泰卢固语 XeLaTeX 排版模板,以支持高质量的数学公式排版,从而实现用印度语言进行严谨学术表达的可行性。这一实践为提升印地语系语言(Indic languages)整体科研写作水平提供了可复制的技术路径与语言学基础。
链接: https://arxiv.org/abs/2604.03265
作者: Siddhartha Visveswara Jayanti
机构: Dartmouth College (达特茅斯学院); MIT (麻省理工学院); Princeton University (普林斯顿大学)
类目: General Literature (cs.GL); Computation and Language (cs.CL); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 15 pages, some text in Telugu
Abstract:I describe my experience writing the first original, modern Computer Science research paper expressed entirely in an Indian language. The paper is in Telugu, a language with approximately 100 million speakers. The paper is in the field of distributed computing and it introduces a technique for proving epistemic logic based lower bounds for multiprocessor algorithms. A key hurdle to writing the paper was developing technical terminology for advanced computer science concepts, including those in algorithms, distributed computing, and discrete mathematics. I overcame this challenge by deriving and coining native language scientific terminology through the powerful, productive, Pāninian grammar of Samskrtam. The typesetting of the paper was an additional challenge, since mathematical typesetting in Telugu is underdeveloped. I overcame this problem by developing a Telugu XeLaTeX template, which I call TeluguTeX. Leveraging this experience of writing an original computer science research paper in an Indian language, I lay out a vision for how to ameliorate the state of scientific writing at all levels in Indic languages – languages whose native speakers exceed one billion people – through the further development of the Sanskrit technical lexicon and through technological internationalization.
[NLP-108] LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
【速读】: 该论文旨在解决当前长上下文语言模型过度依赖注意力机制来同时处理局部交互与长距离状态信息,从而限制了对序列建模更优分解方式探索的问题。其解决方案的关键在于提出一种混合自回归架构LPC-SM,该架构在单个模块内分离了局部注意力(local attention)、持久记忆(persistent memory)、预测修正(predictive correction)和运行时控制(run-time control),并通过正交新颖性传输(Orthogonal Novelty Transport, ONT)机制调控慢速记忆写入,实现了比单一注意力机制更高效的长序列建模分工。
链接: https://arxiv.org/abs/2604.03263
作者: Keqin Xie
机构: Independent Researcher, Suzhou, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Literature (cs.GL); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Most current long-context language models still rely on attention to handle both local interaction and long-range state, which leaves relatively little room to test alternative decompositions of sequence modeling. We propose LPC-SM, a hybrid autoregressive architecture that separates local attention, persistent memory, predictive correction, and run-time control within the same block, and we use Orthogonal Novelty Transport (ONT) to govern slow-memory writes. We evaluate a 158M-parameter model in three stages spanning base language modeling, mathematical continuation, and 4096-token continuation. Removing mHC raises the Stage-A final LM loss from 12.630 to 15.127, while adaptive sparse control improves the Stage-B final LM loss from 12.137 to 10.787 relative to a matched fixed-ratio continuation. The full route remains stable at sequence length 4096, where Stage C ends with final LM loss 11.582 and improves the delayed-identifier diagnostic from 14.396 to 12.031 in key cross-entropy. Taken together, these results show that long-context autoregressive modeling can be organized around a broader division of labor than attention alone.
[NLP-109] Why Attend to Everything? Focus is the Key
【速读】: 该论文旨在解决高效注意力机制在保持模型性能的同时实现计算加速的问题,尤其针对现有方法难以在不损害下游任务表现的前提下提升推理效率的困境。其解决方案的关键在于提出Focus方法:通过学习可训练的聚类中心(learnable centroids)将token分组,仅在同组内使用稀疏的远距离注意力(distant attention),而局部注意力(local attention)仍保持全分辨率;由于模型权重冻结,整个过程为纯加性调整,仅需少量参数(如148K)即可显著降低领域困惑度(perplexity, PPL),且对下游任务无性能损失。该方法在多个规模和架构下均优于全注意力机制,并在推理阶段通过top-k硬稀疏化实现2倍加速,进一步分解为两次标准FlashAttention调用可获得8.6倍实际速度提升,同时保持与预训练基线相当的性能。
链接: https://arxiv.org/abs/2604.03260
作者: Hengshuai Yao,Xing Chen,Ahmed Murtadha,Jin Li,Shuai Shao,Yasin Abbasi Yadkori,Guan Wang,Mingli Yuan,William Chen,Sen Song
机构: Sapient
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks–from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.8 PPL); decomposing this pattern into two standard FlashAttention calls reaches 8.6x wall-clock speedup at 1M tokens with no custom kernels. Unlike LoRA, centroid routing preserves alignment: instruction-tuned models retain TruthfulQA scores after adaptation, while LoRA degrades at every learning rate and rank. Sinkhorn normalization enforces balanced groups as a hard constraint, and the resulting groups discover interpretable linguistic categories without supervision.
[NLP-110] SoLA: Leverag ing Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)因参数规模庞大(如百亿级)而导致的部署效率低下的问题,同时避免现有压缩方法对特殊硬件或昂贵后训练的依赖。其核心解决方案是提出一种无需训练的压缩方法 SoLA(Soft activation sparsity and Low-rank decomposition),关键在于通过分析前馈网络(Feed-Forward Network, FFN)中的激活模式,识别并保留对推理贡献显著的少数组件,同时对大部分权重矩阵采用自适应的逐组件低秩分解策略,以最小化分解带来的性能损失。这一机制实现了高效且低成本的模型瘦身,在不进行后训练的情况下显著提升压缩后的语言建模和下游任务准确率。
链接: https://arxiv.org/abs/2604.03258
作者: Xinhao Huang,You-Liang Huang,Zeyi Wen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but the billion-scale parameters pose deployment challenges. Although existing methods attempt to reduce the scale of LLMs, they require either special hardware support or expensive post-training to maintain model quality. To facilitate efficient and affordable model slimming, we propose a novel training-free compression method for LLMs, named “SoLA”, which leverages \textbfSoft activation sparsity and \textbfLow-r\textbfAnk decomposition. SoLA can identify and retain a minority of components significantly contributing to inference, while compressing the majority through low-rank decomposition, based on our analysis of the activation pattern in the feed-forward network (FFN) of modern LLMs. To alleviate the decomposition loss, SoLA is equipped with an adaptive component-wise low-rank allocation strategy to assign appropriate truncation positions for different weight matrices. We conduct extensive experiments on LLaMA-2-7B/13B/70B and Mistral-7B models across a variety of benchmarks. SoLA exhibits remarkable improvement in both language modeling and downstream task accuracy without post-training. For example, with a 30% compression rate on the LLaMA-2-70B model, SoLA surpasses the state-of-the-art method by reducing perplexity from 6.95 to 4.44 and enhancing downstream task accuracy by 10%.
[NLP-111] Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中难以准确估计其失败率的问题,这一问题直接影响模型的安全性评估与认证。当前实践中存在两种主流方法的局限:一是依赖昂贵的人工标注金标准(gold standard),二是使用自动化标注方案如“LLM-as-a-Judge”可能引入严重偏差。论文提出了一种基于约束最大似然估计(constrained maximum-likelihood estimation, MLE)的新方法,其关键在于融合三类信号源:(i)少量高质量人工标注的校准集、(ii)大规模LLM判官标注数据,以及(iii)通过领域特定约束从已知判官性能统计边界中提取的附加信息。这种整合方式显著提升了估计精度和稳定性,且优于现有先进基线方法(如Prediction-Powered Inference, PPI),为实现可解释、可扩展的LLM失败率认证提供了理论严谨且实用的解决方案。
链接: https://arxiv.org/abs/2604.03257
作者: Minghe Shen,Ananth Balashankar,Adam Fisch,David Madras,Miguel Rodrigues
机构: University College London (伦敦大学学院); Google DeepMind (谷歌深度智mind)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The ability to rigorously estimate the failure rates of large language models (LLMs) is a prerequisite for their safe deployment. Currently, however, practitioners often face a tradeoff between expensive human gold standards and potentially severely-biased automatic annotation schemes such as “LLM-as-a-Judge” labeling. In this paper, we propose a new, practical, and efficient approach to LLM failure rate estimation based on constrained maximum-likelihood estimation (MLE). Our method integrates three distinct signal sources: (i) a small, high-quality human-labeled calibration set, (ii) a large corpus of LLM-judge annotations, and, most importantly, (iii) additional side information via domain-specific constraints derived from known bounds on judge performance statistics. We validate our approach through a comprehensive empirical study, benchmarking it against state-of-the-art baselines like Prediction-Powered Inference (PPI). Across diverse experimental regimes – spanning varying judge accuracies, calibration set sizes, and LLM failure rates – our constrained MLE consistently delivers more accurate and lower-variance estimates than existing methods. By moving beyond the “black-box” use of automated judges to a flexible framework, we provide a principled, interpretable, and scalable pathway towards LLM failure-rate certification.
[NLP-112] Self-Execution Simulation Improves Coding Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成代码时缺乏对程序执行过程准确估计的问题,这导致其生成的代码难以保证正确性。解决方案的关键在于训练代码大语言模型(Code LLMs)以逐步模拟程序执行过程,并利用这一能力提升竞赛编程(competitive programming)任务中的表现。具体方法包括:基于自然语言执行轨迹进行监督微调、结合真实执行结果生成文本解释,并通过可验证奖励信号进行强化学习优化;同时引入两个互补目标——根据代码和输入预测输出,以及在真实或自预测执行反馈下完成竞赛编程任务。该框架使模型能够对多个候选解进行自我验证,并通过模拟测试执行实现迭代式自我修正,从而显著优于标准推理方法。
链接: https://arxiv.org/abs/2604.03253
作者: Gallil Maimon,Ori Yoran,Felix Kreuk,Michael Hassid,Gal Cohen,Pierre Chambon,Yossi Adi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code LLMs can be trained to simulate program execution in a step-by-step manner and that this capability can be leveraged to improve competitive programming performance. Our approach combines supervised fine-tuning on natural language execution traces, textual explanations grounded in true execution, with reinforcement learning using verifiable rewards. We introduce two complementary objectives: output prediction given code and inputs, and solving competitive programming tasks with either ground-truth or self-predicted execution feedback. These objectives enable models to perform self-verification over multiple candidate solutions, and iterative self-fixing by simulating test execution. Across multiple competitive programming benchmarks, our method yields consistent improvements over standard reasoning approaches. We further present ablations and analysis to elucidate the role of execution simulation and its limitations.
[NLP-113] Evaluating Digital Inclusiveness of Digital Agri-Food Tools Using Large Language Models : A Comparative Analysis Between Human and AI-Based Evaluations
【速读】: 该论文旨在解决当前评估农业数字工具(agritools)数字包容性时存在的资源密集型问题,即传统多维数字包容性指数(MDII)方法耗时长、成本高,难以在时间敏感或资源受限环境中推广应用。其解决方案的关键在于探索生成式 AI(Generative AI, GenAI)中的大型语言模型(Large Language Models, LLMs)是否能够实现快速、自动化辅助评估,从而补充现有 MDII 工作流程。研究通过对比分析四种主流 LLM(Grok、Gemini、GPT-4o 和 GPT-5)与专家评分的一致性、温度参数敏感性及潜在偏见来源,发现部分模型可在特定维度上近似专家判断,为将 GenAI 整合进包容性数字发展监测提供了早期实证支持。
链接: https://arxiv.org/abs/2604.03252
作者: Githma Pewinya,Carolina Martins,Garcia Mariangel
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 24 pages, 6 figures, 5 tables
Abstract:Ensuring digital inclusiveness is a critical priority in agri-food systems, particularly in the Global South, where digital divides persist. The Multidimensional Digital Inclusiveness Index (MDII) offers a comprehensive, human-led framework to assess how inclusive digital agricultural tools (agritools) are. However, the current evaluation process is resource intensive, often requiring months to complete. This study explores whether large language models (LLMs) can support a rapid, AI-enabled assessment of digital inclusiveness, complementing the MDII’s existing workflow. Using a comparative analysis, the research benchmarks the performance of four LLMs (Grok, Gemini, GPT-4o, and GPT-5) against prior expert-led evaluations. The study investigates model alignment with human scores, sensitivity to temperature settings, and potential sources of bias. Findings suggest that LLMs can generate evaluative outputs that approximate expert judgment in some dimensions, though reliability varies across models and contexts. This exploratory work provides early evidence for the integration of GenAI into inclusive digital development monitoring, with implications for scaling evaluations in time-sensitive or resource-constrained environments.
[NLP-114] Classifying Problem and Solution Framing in Congressional Social Media
【速读】: 该论文旨在解决美国参议员在社交媒体(Twitter)上发布的帖子如何自动分类为“问题流”(problem stream)或“解决方案流”(solution stream)的问题,这是基于“垃圾桶模型”(Garbage Can Model)的政策制定理论框架。其关键解决方案是采用监督学习方法,利用经过专家标注的3967条推文数据集(其中500条用于测试),通过BERTweet Base模型进行训练和验证,最终在三类分类任务中实现了平均加权F1分数高于0.8的性能,表明该方法能够有效识别政策讨论中的不同语义流类型。
链接: https://arxiv.org/abs/2604.03247
作者: Misha Melnyk,Mitchell Dolny,Joshua D. Elkind,A. Michael Tjhin,Saisha Chebium,Blake VanBerlo,Annelise Russell,Michelle M. Buehlmann,Jesse Hoey
机构: University of Waterloo (滑铁卢大学); University of Kentucky (肯塔基大学); University of North Dakota (北达科他大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:Policy setting in the USA according to the Garbage Can'' model differentiates between problem’’ and ``solution’’ focused processes. In this paper, we study a large dataset of US Senator postings on Twitter (1.68m tweets in total). Our objective is to develop an automated method to label Senatorial posts as either in the problem or solution streams. Two academic policy experts labeled a subset of 3967 tweets as either problem, solution, or other (anything not problem or solution). We split off a subset of 500 tweets into a test set, with the remaining 3467 used for training. During development, this training set was further split by 60/20/20 proportions for fitting, validation, and development test sets. We investigated supervised learning methods for building problem/solution classifiers directly on the training set, evaluating their performance in terms of F1 score on the validation set, allowing us to rapidly iterate through models and hyperparameters, achieving an average weighted F1 score of above 0.8 on cross validation across the three categories using a BERTweet Base model.
[NLP-115] Scaling DPPs for RAG : Density Meets Diversity
【速读】: 该论文旨在解决标准检索增强生成(Retrieval-Augmented Generation, RAG)管道中因仅依赖点对点相关性评分而忽略检索候选片段之间交互关系所导致的冗余上下文问题,进而影响生成内容的信息密度与覆盖多样性。其解决方案的关键在于提出一种名为ScalDPP的多样性感知检索机制,通过轻量级P-Adapter引入确定性点过程(Determinantal Point Processes, DPPs),实现对chunk间依赖关系的可扩展建模,并设计了一种新颖的集合级损失函数——多样边际损失(Diverse Margin Loss, DML),在DPP几何空间下强制真实互补证据链优于同等规模的冗余替代方案,从而实现信息密度与覆盖多样性的联合优化。
链接: https://arxiv.org/abs/2604.03240
作者: Xun Sun,Baiheng Xie,Li Huang,Qiang Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding generation in external knowledge, yielding relevance responses that are aligned with factual evidence and evolving corpora. Standard RAG pipelines construct context through relevance ranking, performing point-wise scoring between the user query and each corpora chunk. This formulation, however, ignores interactions among retrieved candidates, leading to redundant contexts that dilute density and fail to surface complementary evidence. We argue that effective retrieval should optimize jointly for both density and diversity, ensuring the grounding evidence that is dense in information yet diverse in coverage. In this study, we propose ScalDPP, a diversity-aware retrieval mechanism for RAG that incorporates Determinantal Point Processes (DPPs) through a lightweight P-Adapter, enabling scalable modeling of inter-chunk dependencies and complementary context selection. In addition, we develop a novel set-level objective, Diverse Margin Loss (DML), that enforces ground-truth complementary evidence chains to dominate any equally sized redundant alternatives under DPP geometry. Experimental results demonstrate the superiority of ScalDPP, substantiating our core statement in practice.
[NLP-116] LLM s-Healthcare : Current Applications and Challenges of Large Language Models in various Medical Specialties
【速读】: 该论文旨在解决如何系统性地总结和评估大型语言模型(Large Language Models, LLMs)在医疗健康领域的最新应用进展及其挑战的问题。其解决方案的关键在于通过全面综述LLMs在癌症护理、皮肤病学、牙科、神经退行性疾病及心理健康等多学科中的诊断与治疗功能,揭示其在提升医学诊断准确性与患者照护效率方面的创新价值,并深入分析数据多样性处理、临床整合障碍以及伦理与可靠性等关键挑战,从而为未来LLMs在跨专业医疗场景中的安全、高效部署提供理论依据与实践方向。
链接: https://arxiv.org/abs/2311.12882
作者: Ummara Mumtaz,Awais Ahmed,Summaya Mumtaz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages and one figure
Abstract:We aim to present a comprehensive overview of the latest advancements in utilizing Large Language Models (LLMs) within the healthcare sector, emphasizing their transformative impact across various medical domains. LLMs have become pivotal in supporting healthcare, including physicians, healthcare providers, and patients. Our review provides insight into the applications of Large Language Models (LLMs) in healthcare, specifically focusing on diagnostic and treatment-related functionalities. We shed light on how LLMs are applied in cancer care, dermatology, dental care, neurodegenerative disorders, and mental health, highlighting their innovative contributions to medical diagnostics and patient care. Throughout our analysis, we explore the challenges and opportunities associated with integrating LLMs in healthcare, recognizing their potential across various medical specialties despite existing limitations. Additionally, we offer an overview of handling diverse data types within the medical field.
[NLP-117] Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
【速读】: 该论文旨在解决当前语音语言模型在自然语音交互场景下,特别是在多步骤工具调用任务中,缺乏统一且真实世界驱动的评估基准问题。现有研究往往依赖合成数据或简化任务,难以反映实际应用中的复杂性,如语音不流畅(disfluency)处理、实时响应延迟以及对话轮次管理等挑战。解决方案的关键在于构建一个全新的基准 Full-Duplex-Bench-v3 (FDB-v3),其核心特征是基于真实人类语音音频(标注了五类不流畅类型),并配以需跨API链式调用的四类任务场景,从而系统性地评估多个主流模型在准确性、延迟和轮次交接(turn-taking)三个维度的表现。该设计使评估更贴近现实语境,揭示出不同架构(如端到端实时模型 vs. 传统级联流水线)在关键性能指标上的权衡与瓶颈。
链接: https://arxiv.org/abs/2604.04847
作者: Guan-Ting Lin,Chen Chen,Zhehuai Chen,Hung-yi Lee
机构: National Taiwan University (国立台湾大学); NVIDIA (英伟达)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Work in progress. Demo at this https URL
Abstract:We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations – GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper \rightarrow GPT-4o \rightarrow TTS) – across accuracy, latency, and turn-taking dimensions. GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5%); Gemini Live 3.1 achieves the fastest latency (4.25~s) but the lowest turn-take rate (78.0%); and the Cascaded baseline, despite a perfect turn-take rate, incurs the highest latency (10.12~s). Across all systems, self-correction handling and multi-step reasoning under hard scenarios remain the most consistent failure modes.
[NLP-118] Large Language Models Align with the Human Brain during Creative Thinking
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成式 AI(Generative AI)任务中与人类大脑创造性思维神经表征之间对齐机制的问题,特别是针对发散性思维(divergent thinking)这一核心创造力认知过程。其关键解决方案在于利用功能磁共振成像(fMRI)数据结合表示相似性分析(Representational Similarity Analysis, RSA),系统评估不同规模和后训练目标的LLM在执行替代用途任务(Alternate Uses Task, AUT)时与人类默认模式网络(default mode network, DMN)和额顶网络(frontoparietal network, FPN)的神经活动的一致性,揭示模型大小和训练目标如何选择性地重塑其内部表征以更贴近或偏离人类创造性脑活动的几何结构。
链接: https://arxiv.org/abs/2604.03480
作者: Mete Ismayilzada,Simone A. Luchini,Abdulkadir Gokce,Badr AlKhamissi,Antoine Bosselut,Antonio Laverghetta Jr.,Lonneke van der Plas,Roger E. Beaty
机构: EPFL; Università della Svizzera italiana (USI); Wesleyan University; Paris Brain Institute (ICM); Pennsylvania State University
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review
Abstract:Creative thinking is a fundamental aspect of human cognition, and divergent thinking-the capacity to generate novel and varied ideas-is widely regarded as its core generative engine. Large language models (LLMs) have recently demonstrated impressive performance on divergent thinking tests and prior work has shown that models with higher task performance tend to be more aligned to human brain activity. However, existing brain-LLM alignment studies have focused on passive, non-creative tasks. Here, we explore brain alignment during creative thinking using fMRI data from 170 participants performing the Alternate Uses Task (AUT). We extract representations from LLMs varying in size (270M-72B) and measure alignment to brain responses via Representational Similarity Analysis (RSA), targeting the creativity-related default mode and frontoparietal networks. We find that brain-LLM alignment scales with model size (default mode network only) and idea originality (both networks), with effects strongest early in the creative process. We further show that post-training objectives shape alignment in functionally selective ways: a creativity-optimized \textttLlama-3.1-8B-Instruct preserves alignment with high-creativity neural responses while reducing alignment with low-creativity ones; a human behavior fine-tuned model elevates alignment with both; and a reasoning-trained variant shows the opposite pattern, suggesting chain-of-thought training steers representations away from creative neural geometry toward analytical processing. These results demonstrate that post-training objectives selectively reshape LLM representations relative to the neural geometry of human creative thought.
[NLP-119] Generative Chemical Language Models for Energetic Materials Discovery
【速读】: 该论文旨在解决新型高能材料(energetic materials)发现过程中因高质量数据稀缺而导致的瓶颈问题。其解决方案的关键在于采用迁移学习策略,首先在大规模化学数据上预训练生成式分子语言模型(generative molecular language models),随后利用精心筛选的高能材料数据集进行微调,从而将模型能力从主要覆盖药理学领域的化学空间拓展至高能材料领域。此外,论文还强调了基于片段的分子编码方式在构建合成可行结构方面的优势,为加速满足严苛性能要求的下一代高能材料设计提供了有效框架。
链接: https://arxiv.org/abs/2604.03304
作者: Andrew Salij,R. Seaton Ullberg,Megan C. Davis,Marc J. Cawkwell,Christopher J. Snyder,Cristina Garcia Cardona,Ivana Matanovic,Wilton J. M. Kort-Kamp
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The discovery of new energetic materials remains a pressing challenge hindered by limited availability of high-quality data. To address this, we have developed generative molecular language models that have been pretrained on extensive chemical data and then fine-tuned with curated energetic materials datasets. This transfer-learning strategy extends the chemical language model capabilities beyond the pharmacological space in which they have been predominantly developed, offering a framework applicable to other data-spare discovery problems. Furthermore, we discuss the benefits of fragment-based molecular encodings for chemical language models, in particular in constructing synthetically accessible structures. Together, these advances provide a foundation for accelerating the design of next-generation energetic materials with demanding performance requirements.
信息检索
[IR-0] Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval SIGIR2026
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中训练数据组成不合理的问题,特别是现有方法过度依赖挖掘难负样本(hard negatives),导致学生模型难以学习教师模型完整的偏好结构(preference structure),从而影响泛化能力。其解决方案的关键在于提出一种分层采样策略(Stratified Sampling),通过均匀覆盖教师评分分布的整个分数谱(score spectrum),有效保留教师得分的方差(variance)和熵(entropy),使学生模型更全面地模拟教师的相对评分模式,从而在不同领域(in-domain 和 out-of-domain)基准测试中显著优于传统的 top-K 采样和随机采样方法。
链接: https://arxiv.org/abs/2604.04734
作者: Youngjoon Jang,Seongtae Hong,Hyeonseok Moon,Heuiseok Lim
机构: Korea University(韩国国立大学); Human-inspired AI Research(人类启发式人工智能研究中心)
类目: Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026 Main Conference
Abstract:Transferring knowledge from a cross-encoder teacher via Knowledge Distillation (KD) has become a standard paradigm for training retrieval models. While existing studies have largely focused on mining hard negatives to improve discrimination, the systematic composition of training data and the resulting teacher score distribution have received relatively less attention. In this work, we highlight that focusing solely on hard negatives prevents the student from learning the comprehensive preference structure of the teacher, potentially hampering generalization. To effectively emulate the teacher score distribution, we propose a Stratified Sampling strategy that uniformly covers the entire score spectrum. Experiments on in-domain and out-of-domain benchmarks confirm that Stratified Sampling, which preserves the variance and entropy of teacher scores, serves as a robust baseline, significantly outperforming top-K and random sampling in diverse settings. These findings suggest that the essence of distillation lies in preserving the diverse range of relative scores perceived by the teacher.
[IR-1] Ruling Out to Rule In: Contrastive Hypothesis Retrieval for Medical Question Answering
【速读】:该论文旨在解决医疗领域检索增强生成(Retrieval-Augmented Generation, RAG)系统中因“硬负样本”(hard negatives)导致的检索偏差问题,即标准检索器常返回语义接近但临床含义不同的错误答案,从而干扰正确诊断。解决方案的关键在于提出对比假设检索(Contrastive Hypothesis Retrieval, CHR)框架,该框架受临床鉴别诊断启发,显式地构建两个假设:目标假设 $ H^+ $(最可能的正确答案)和模拟假设 $ H^- $(最合理的错误替代方案),并通过同时强化与 $ H^+ $ 对齐的证据、惩罚与 $ H^- $ 对齐的内容来优化文档排序。此机制显著提升了检索准确性,在三个医学问答基准上均优于五种基线方法,最大提升达10.4个百分点,并通过大量案例验证了其对检索列表的实质性重构而非简单重排序。
链接: https://arxiv.org/abs/2604.04593
作者: Byeolhee Kim,Min-Kyung Kim,Young-Hak Kim,Tae-Joon Jeon
机构: Asan Medical Center(首尔大学医学院附属医院); Yonsei University(延世大学); University of Ulsan College of Medicine(蔚山大学医学院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-augmented generation (RAG) grounds large language models in external medical knowledge, yet standard retrievers frequently surface hard negatives that are semantically close to the query but describe clinically distinct conditions. While existing query-expansion methods improve query representation to mitigate ambiguity, they typically focus on enriching target-relevant semantics without an explicit mechanism to selectively suppress specific, clinically plausible hard negatives. This leaves the system prone to retrieving plausible mimics that overshadow the actual diagnosis, particularly when such mimics are dominant within the corpus. We propose Contrastive Hypothesis Retrieval (CHR), a framework inspired by the process of clinical differential diagnosis. CHR generates a target hypothesis H^+ for the likely correct answer and a mimic hypothesis H^- for the most plausible incorrect alternative, then scores documents by promoting H^+ -aligned evidence while penalizing H^- -aligned content. Across three medical QA benchmarks and three answer generators, CHR outperforms all five baselines in every configuration, with improvements of up to 10.4 percentage points over the next-best method. On the n=587 pooled cases where CHR answers correctly while embedded hypothetical-document query expansion does not, 85.2% have no shared documents between the top-5 retrieval lists of CHR and of that baseline, consistent with substantive retrieval redirection rather than light re-ranking of the same candidates. By explicitly modeling what to avoid alongside what to find, CHR bridges clinical reasoning with retrieval mechanism design and offers a practical path to reducing hard-negative contamination in medical RAG systems.
[IR-2] SLSREC: Self-Supervised Contrastive Learning for Adaptive Fusion of Long- and Short-Term User Interests
【速读】:该论文旨在解决用户兴趣在不同时间尺度上动态变化的问题,即如何有效建模长期偏好(long-term preferences)与短期意图(short-term intentions)之间的差异,从而提升推荐系统的准确性与鲁棒性。传统方法通常将长短期兴趣融合为单一表征,导致信息混淆和精度下降。其解决方案的关键在于提出SLSRec模型,通过自监督学习框架实现长短期兴趣的解耦表示,并引入对比学习策略以校准两类兴趣表征;同时设计基于注意力机制的融合网络,自适应地整合不同时间尺度的兴趣特征,从而更精准地捕捉用户行为演化模式并提升推荐性能。
链接: https://arxiv.org/abs/2604.04530
作者: Wei Zhou,Yue Shen,Junkai Ji,Yinglan Feng,Xing Tang,Xiuqiang He,Liang Feng,Zexuan Zhu
机构: Shenzhen University (深圳大学); Shenzhen Technology University (深圳技术大学); Chongqing University (重庆大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:User interests typically encompass both long-term preferences and short-term intentions, reflecting the dynamic nature of user behaviors across different timeframes. The uneven temporal distribution of user interactions highlights the evolving patterns of interests, making it challenging to accurately capture shifts in interests using comprehensive historical behaviors. To address this, we propose SLSRec, a novel Session-based model with the fusion of Long- and Short-term Recommendations that effectively captures the temporal dynamics of user interests by segmenting historical behaviors over time. Unlike conventional models that combine long- and short-term user interests into a single representation, compromising recommendation accuracy, SLSRec utilizes a self-supervised learning framework to disentangle these two types of interests. A contrastive learning strategy is introduced to ensure accurate calibration of long- and short-term interest representations. Additionally, an attention-based fusion network is designed to adaptively aggregate interest representations, optimizing their integration to enhance recommendation performance. Extensive experiments on three public benchmark datasets demonstrate that SLSRec consistently outperforms state-of-the-art models while exhibiting superior robustness across various this http URL will release all source code upon acceptance.
[IR-3] SuperLocalMemory V3.3: The Living Brain – Biologically-Inspired Forgetting Cognitive Quantization and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems
【速读】:该论文旨在解决当前AI编码代理(AI coding agents)在记忆能力上的根本缺陷:尽管具备海量参数化知识,却无法维持跨时段的连续记忆。现有记忆系统通常依赖云端大语言模型(LLM)进行核心操作、仅支持单通道文本检索,并缺乏人类记忆中关键的认知过程。其解决方案的核心在于提出SuperLocalMemory V3.3(“The Living Brain”),一个本地优先的代理记忆系统,首次完整实现认知记忆分类学(cognitive memory taxonomy)并引入数学化的生命周期动态机制。关键技术包括:(1) Fisher-Rao量化感知距离(FRQAD),在高保真嵌入优先性上达到100%精度;(2) Ebbinghaus自适应遗忘机制,结合生命周期感知的嵌入压缩,提升6.7倍判别力;(3) 七通道认知检索架构,涵盖语义、关键词、实体图、时间、扩散激活、巩固与Hopfield关联等通道,在零LLM模式下实现LoCoMo基准70.4%准确率;(4) 基于软提示的长期隐式记忆参数化;(5) 自动化的零摩擦认知流水线,实现全生命周期记忆管理。该方案显著提升了多跳推理和对抗场景下的性能(+23.8pp 和 +12.7pp),且完全本地运行,无需云端LLM支持。
链接: https://arxiv.org/abs/2604.04514
作者: Varun Pratap Bhardwaj
机构: Accenture(埃森哲)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 19 pages, 4 figures, 11 tables. Third paper in the SuperLocalMemory trilogy. Code: this https URL (v3.3.26). npm: superlocalmemory. PyPI: superlocalmemory
Abstract:AI coding agents operate in a paradox: they possess vast parametric knowledge yet cannot remember a conversation from an hour ago. Existing memory systems store text in vector databases with single-channel retrieval, require cloud LLMs for core operations, and implement none of the cognitive processes that make human memory effective. We present SuperLocalMemory V3.3 (“The Living Brain”), a local-first agent memory system implementing the full cognitive memory taxonomy with mathematical lifecycle dynamics. Building on the information-geometric foundations of V3.2 (arXiv:2603.14588), we introduce five contributions: (1) Fisher-Rao Quantization-Aware Distance (FRQAD) – a new metric on the Gaussian statistical manifold achieving 100% precision at preferring high-fidelity embeddings over quantized ones (vs 85.6% for cosine), with zero prior art; (2) Ebbinghaus Adaptive Forgetting with lifecycle-aware quantization – the first mathematical forgetting curve in local agent memory coupled to progressive embedding compression, achieving 6.7x discriminative power; (3) 7-channel cognitive retrieval spanning semantic, keyword, entity graph, temporal, spreading activation, consolidation, and Hopfield associative channels, achieving 70.4% on LoCoMo in zero-LLM Mode A; (4) memory parameterization implementing Long-Term Implicit memory via soft prompts; (5) zero-friction auto-cognitive pipeline automating the complete memory lifecycle. On LoCoMo, V3.3 achieves 70.4% in Mode A (zero-LLM), with +23.8pp on multi-hop and +12.7pp on adversarial. V3.2 achieved 74.8% Mode A and 87.7% Mode C; the 4.4pp gap reflects a deliberate architectural trade-off. SLM V3.3 is open source under the Elastic License 2.0, runs entirely on CPU, with over 5,000 monthly downloads. Comments: 19 pages, 4 figures, 11 tables. Third paper in the SuperLocalMemory trilogy. Code: this https URL (v3.3.26). npm: superlocalmemory. PyPI: superlocalmemory Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2604.04514 [cs.AI] (or arXiv:2604.04514v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.04514 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.19435120 https://doi.org/10.5281/zenodo.19435120 https://doi.org/10.5281/zenodo.19435120 https://doi.org/10.5281/zenodo.19435120 Focus to learn more DOI(s) linking to related resources
[IR-4] Retrieval Augmented Conversational Recommendation with Reinforcement Learning
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的对话式推荐系统(Conversational Recommender Systems, CRS)中存在的两个核心问题:一是现有方法依赖预训练知识而缺乏对外部检索机制的支持,难以处理新颖物品;二是缺乏统一语料库,导致检索增强难以有效集成到CRS中。解决方案的关键在于提出一种两阶段检索增强的对话推荐框架RAR(Retrieval-Augmented Recommendation),其创新性体现在:首先构建包含30余万部电影及其丰富元数据的统一语料库,其次通过动态融合检索与生成阶段,在第一阶段由检索模型基于用户历史生成候选集,第二阶段利用LLM结合对话上下文和检索结果进行精细化推荐;更重要的是引入一种基于LLM反馈的强化学习(Reinforcement Learning, RL)机制,建立检索与生成之间的协同反馈回路,以迭代优化检索器,从而提升推荐准确性与事实一致性,减少幻觉并捕捉细微用户意图。
链接: https://arxiv.org/abs/2604.04457
作者: Zhenrui Yue,Honglei Zhuang,Zhen Qin,Zhankui He,Huimin Zeng,Julian McAuley,Dong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Google DeepMind (谷歌DeepMind); UC San Diego (加州大学圣地亚哥分校)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large language models (LLMs) exhibit enhanced capabilities in language understanding and generation. By utilizing their embedded knowledge, LLMs are increasingly used as conversational recommender systems (CRS), achieving improved performance across diverse scenarios. However, existing LLM-based methods rely on pretrained knowledge without external retrieval mechanisms for novel items. Additionally, the lack of a unified corpus poses challenges for integrating retrieval augmentation into CRS. Motivated by these challenges, we present RAR, a novel two-stage retrieval augmented conversational recommendation framework that aligns retrieval and generation to enhance both performance and factuality. To support this framework and provide a unified corpus, we construct a large-scale movie corpus, comprising over 300k movies with rich metadata, such as titles, casts and plot summaries. Leveraging this data, our primary contribution is RAR, the first framework to departs from standard two-stage CRS by dynamically bridging retrieval and generation. First, a retriever model generates candidate items based on user history; in the subsequent stage, an LLM refines the recommendations by incorporating conversational context with retrieved results. In addition, we introduce a novel reinforcement learning (RL) method that leverages LLM feedback to iteratively update the retriever. By creating a collaborative feedback loop that reinforces sampled candidate sets with higher ranking metrics, RAR effectively mitigates the misalignment between the retrieval and generation stages. Furthermore, grounding the LLM in factual metadata allows our RL-driven approach to capture subtle user intentions and generate context-aware recommendations with reduced hallucinations. We validate our approach through extensive experiments on multiple benchmarks, where RAR consistently outperforms state-of-the-art baseline methods.
[IR-5] FAVE: Flow-based Averag e Velocity Establishment for Sequential Recommendation SIGIR2026
【速读】:该论文旨在解决生成式推荐(Generative Recommendation)中基于流模型(Flow-based Methods)存在的两大效率瓶颈:一是“噪声到数据”(Noise-to-Data)范式导致的先验不匹配问题,即从无信息噪声开始生成,迫使模型经历冗长的恢复轨迹;二是线性冗余问题,即迭代求解器在建模确定性的偏好转移时浪费计算资源。解决方案的关键在于提出一种基于平均速度建立的流模型框架(Flow-based Average Velocity Establishment, Fave),其核心创新包括:第一阶段通过双端语义对齐(dual-end semantic alignment)稳定偏好空间,防止表示坍缩;第二阶段引入语义锚点先验(semantic anchor prior),以用户历史交互掩码嵌入作为有信息起点,并学习全局平均速度(global average velocity),将多步轨迹压缩为单步位移向量,同时利用JVP(Jacobian-Vector Product)一致性约束确保轨迹直线性,从而实现真正的一步生成(one-step generation)。此方法显著提升了推理效率并保持了最优推荐性能。
链接: https://arxiv.org/abs/2604.04427
作者: Ke Shi,Yao Zhang,Feng Guo,Jinyuan Zhang,JunShuo Zhang,Shen Gao,Shuo Shang
机构: University of Electronic Science and Technology of China (电子科技大学); State Key Laboratory of Internet Architecture (互联网架构国家重点实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by SIGIR 2026
Abstract:Generative recommendation has emerged as a transformative paradigm for capturing the dynamic evolution of user intents in sequential recommendation. While flow-based methods improve the efficiency of diffusion models, they remain hindered by the ``Noise-to-Data’’ paradigm, which introduces two critical inefficiencies: prior mismatch, where generation starts from uninformative noise, forcing a lengthy recovery trajectory; and linear redundancy, where iterative solvers waste computation on modeling deterministic preference transitions. To address these limitations, we propose a Flow-based Average Velocity Establishment (Fave) framework for one-step generation recommendation that learns a direct trajectory from an informative prior to the target distribution. Fave is structured via a progressive two-stage training strategy. In Stage 1, we establish a stable preference space through dual-end semantic alignment, applying constraints at both the source (user history) and target (next item) to prevent representation collapse. In Stage 2, we directly resolve the efficiency bottlenecks by introducing a semantic anchor prior, which initializes the flow with a masked embedding from the user’s interaction history, providing an informative starting point. Then we learn a global average velocity, consolidating the multi-step trajectory into a single displacement vector, and enforce trajectory straightness via a JVP-based consistency constraint to ensure one-step generation. Extensive experiments on three benchmarks demonstrate that Fave not only achieves state-of-the-art recommendation performance but also delivers an order-of-magnitude improvement in inference efficiency, making it practical for latency-sensitive scenarios.
[IR-6] A Logical-Rule Autoencoder for Interpretable Recommendations
【速读】:该论文旨在解决深度学习推荐模型缺乏内在可解释性的问题,这类模型通常作为“黑箱”运行,其决策过程依赖于难以理解的潜在表示,从而在需要透明度和责任追溯的应用场景中引发担忧。解决方案的关键在于提出一种逻辑规则可解释自动编码器(Logical-rule Interpretable Autoencoder, LIA),该模型通过引入一个可学习的逻辑规则层,在每个规则神经元中配置门控参数以自动选择AND或OR操作符,从而从数据中直接发现多样化的逻辑模式;同时,LIA利用连接权重的符号编码否定操作,无需增加输入维度即可实现功能完备性,使模型能够学习显式的、人类可读的重构规则,从而支持用户直接追踪每条推荐背后的决策路径。
链接: https://arxiv.org/abs/2604.04270
作者: Jinhao Pan,Bowen Wei,Ziwei Zhu
机构: George Mason University (乔治梅森大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Most deep learning recommendation models operate as black boxes, relying on latent representations that obscure their decision process. This lack of intrinsic interpretability raises concerns in applications that require transparency and accountability. In this work, we propose a Logical-rule Interpretable Autoencoder (LIA) for collaborative filtering that is interpretable by design. LIA introduces a learnable logical rule layer in which each rule neuron is equipped with a gate parameter that automatically selects between AND and OR operators during training, enabling the model to discover diverse logical patterns directly from data. To support functional completeness without doubling the input dimensionality, LIA encodes negation through the sign of connection weights, providing a parameter-efficient mechanism for expressing both positive and negated item conditions within each rule. By learning explicit, human-readable reconstruction rules, LIA allows users to directly trace the decision process behind each recommendation. Extensive experiments show that our method achieves improved recommendation performance over traditional baselines while remaining fully interpretable. Code and data are available at this https URL.
[IR-7] A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models
【速读】:该论文旨在解决电子病历(Electronic Patient Record, EPR)系统中大量临床信息被存储于非结构化文本格式而导致难以用于研究与决策的问题。传统大语言模型虽能提取此类信息,但存在本地运行资源消耗高、云端处理引发隐私风险等局限。为此,作者提出一种基于小型语言模型(Small Language Models, SLMs)的资源高效半自动化标注工作流,通过将信息抽取任务建模为基于临床指导实体规则和少量示例的问答(Question-Answering)任务,在仅使用CPU硬件条件下实现对儿科病理报告(以肾活检报告为例)的结构化信息提取。关键创新在于结合临床专家指导的实体定义与少样本提示(few-shot examples),显著提升SLMs在特定临床领域中的准确率(最高达84.3%),同时减少人工标注需求,从而兼顾性能、隐私保护与计算效率。
链接: https://arxiv.org/abs/2604.04168
作者: Avish Vijayaraghavan,Jaskaran Singh Kawatra,Sebin Sabu,Jonny Sheldon,Will Poulett,Alex Eze,Daniel Key,John Booth,Shiren Patel,Jonny Pearson,Dan Schofield,Jonathan Hope,Pavithra Rajendran,Neil Sebire
机构: Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 36 pages, includes supplementary information
Abstract:Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at this https URL.
[IR-8] Formalized Information Needs Improve Large-Language-Model Relevance Judgments SIGIR2026
【速读】:该论文旨在解决使用大型语言模型(Large Language Models, LLMs)作为相关性评估者时,因仅依赖查询(query-only)而可能导致检索评估可靠性下降的问题。其关键解决方案是通过LLM对信息需求进行合成式形式化,生成符合人类评估中既定结构(即描述和叙述)的检索主题(topics),从而提升LLM相关性判断的一致性和与人类评估者的匹配度,进而增强检索评估的可靠性。
链接: https://arxiv.org/abs/2604.04140
作者: Jüri Keller,Maik Fröbe,Björn Engelmann,Fabian Haak,Timo Breuer,Birger Larsen,Philipp Schaer
机构: TH Köln, University of Applied Sciences (科隆应用技术大学); Friedrich-Schiller-Universität Jena (耶拿弗里德里希-席勒大学); Aalborg University (奥尔堡大学)
类目: Information Retrieval (cs.IR)
备注: Accepted to ACM SIGIR 2026. This is the Author’s Accepted Manuscript
Abstract:Cranfield-style retrieval evaluations with too few or too many relevant documents or with low inter-assessor agreement on relevance can reduce the reliability of observations. In evaluations with human assessors, information needs are often formalized as retrieval topics to avoid an excessive number of relevant documents while maintaining good agreement. However, emerging evaluation setups that use Large Language Models (LLMs) as relevance assessors often use only queries, potentially decreasing the reliability. To study whether LLM relevance assessors benefit from formalized information needs, we synthetically formalize information needs with LLMs into topics that follow the established structure from previous human relevance assessments (i.e., descriptions and narratives). We compare assessors using synthetically formalized topics against the LLM-default query-only assessor on Robust04 and the 2019/2020 editions of TREC Deep Learning. We find that assessors without formalization judge many more documents relevant and have a lower agreement, leading to reduced reliability in retrieval evaluations. Furthermore, we show that the formalized topics improve agreement between human and LLM relevance judgments, even when the topics are not highly similar to their human counterparts. Our findings indicate that LLM relevance assessors should use formalized information needs, as is standard for human assessment, and synthetically formalize topics when no human formalization exists to improve evaluation reliability.
[IR-9] FLAME: Condensing Ensemble Diversity into a Single Network for Efficient Sequential Recommendation SIGIR2026
【速读】:该论文旨在解决顺序推荐(Sequential Recommendation)中单一神经网络难以捕捉用户多样化行为模式的问题,同时克服传统集成方法因从头训练多个网络而导致的高计算成本和优化不稳定性。其解决方案的关键在于提出一种名为FLAME(Frozen and Learnable networks with Aligned Modular Ensemble)的新框架:通过模块化集成(Modular Ensemble)仅用两个网络模拟指数级多样性,将一个网络预训练并冻结作为语义锚点(semantic anchor),另一网络则动态组合子模块以生成丰富的表示模式;并通过引导式互学习(guided mutual learning)对齐多样表示到可学习网络的空间,从而实现稳定优化。推理阶段仅使用可学习网络即可达到集成性能,且无额外计算开销。
链接: https://arxiv.org/abs/2604.04038
作者: WooJoo Kim,JunYoung Kim,JaeHyung Lim,SeongJin Choi,SeongKu Kang,HwanJo Yu
机构: Pohang University of Science and Technology (浦项科技大学); Korea University (高丽大学)
类目: Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026 full papers track
Abstract:Sequential recommendation requires capturing diverse user behaviors, which a single network often fails to capture. While ensemble methods mitigate this by leveraging multiple networks, training them all from scratch leads to high computational cost and instability from noisy mutual supervision. We propose \bf Frozen and \bf Learnable networks with \bf Aligned \bf Modular \bf Ensemble (\bf FLAME), a novel framework that condenses ensemble-level diversity into a single network for efficient sequential recommendation. During training, FLAME simulates exponential diversity using only two networks via \it modular ensemble. By decomposing each network into sub-modules (e.g., layers or blocks) and dynamically combining them, FLAME generates a rich space of diverse representation patterns. To stabilize this process, we pretrain and freeze one network to serve as a semantic anchor and employ \it guided mutual learning. This aligns the diverse representations into the space of the remaining learnable network, ensuring robust optimization. Consequently, at inference, FLAME utilizes only the learnable network, achieving ensemble-level performance with zero overhead compared to a single network. Experiments on six datasets show that FLAME outperforms state-of-the-art baselines, achieving up to 7.69 \times faster convergence and 9.70% improvement in NDCG@20. We provide the source code of FLAME at this https URL.
[IR-10] MisEdu-RAG : A Misconception-Aware Dual-Hypergraph RAG for Novice Math Teachers
【速读】:该论文旨在解决新手数学教师在诊断和纠正学生错误时面临的挑战,尤其是对概念性误解(misconception)的识别与教学干预困难问题。现有大语言模型(Large Language Model, LLM)虽能生成教学反馈,但其对教学知识与学生错误之间的关联建模较为松散,导致指导建议缺乏可操作性。解决方案的关键在于提出MisEdu-RAG框架——一个基于双超图(dual-hypergraph)结构的检索增强生成(Retrieval-Augmented Generation, RAG)方法:将教学知识组织为概念超图(concept hypergraph),将真实学生错误案例组织为实例超图(instance hypergraph),通过两阶段检索从两个层次获取相关证据,并生成基于实证案例与教学原理的精准响应。实验表明,该方法在token-F1指标上提升10.95%,并在响应质量五个维度中显著优于基线模型,尤其在多样性和赋能性方面提升最大,且教师问卷与访谈验证了其在实际教学场景中的可用性与有效性。
链接: https://arxiv.org/abs/2604.04036
作者: Zhihan Guo,Rundong Xue,Yuting Lu,Jionghao Lin
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Novice math teachers often encounter students’ mistakes that are difficult to diagnose and remediate. Misconceptions are especially challenging because teachers must explain what went wrong and how to solve them. Although many existing large language model (LLM) platforms can assist in generating instructional feedback, these LLMs loosely connect pedagogical knowledge and student mistakes, which might make the guidance less actionable for teachers. To address this gap, we propose MisEdu-RAG, a dual-hypergraph-based retrieval-augmented generation (RAG) framework that organizes pedagogical knowledge as a concept hypergraph and real student mistake cases as an instance hypergraph. Given a query, MisEdu-RAG performs a two-stage retrieval to gather connected evidence from both layers and generates a response grounded in the retrieved cases and pedagogical principles. We evaluate on \textitMisstepMath, a dataset of math mistakes paired with teacher solutions, as a benchmark for misconception-aware retrieval and response generation across topics and error types. Evaluation results on \textitMisstepMath show that, compared with baseline models, MisEdu-RAG improves token-F1 by 10.95% and yields up to 15.3% higher five-dimension response quality, with the largest gains on \textitDiversity and \textitEmpowerment. To verify its applicability in practical use, we further conduct a pilot study through a questionnaire survey of 221 teachers and interviews with 6 novices. The findings suggest that MisEdu-RAG provides diagnosis results and concrete teaching moves for high-demand misconception scenarios. Overall, MisEdu-RAG demonstrates strong potential for scalable teacher training and AI-assisted instruction for misconception handling. Our code is available on GitHub: this https URL.
[IR-11] Semantic IDs for Recommender Systems at Snapchat: Use Cases Technical Challenges and Design Choices SIGIR2026
【速读】:该论文旨在解决推荐系统(RecSys)中传统原子ID(atomic ID)在表达能力和可扩展性方面的局限性问题,尤其是在高基数物品场景下难以有效捕捉协同过滤信号并实现个性化推荐的挑战。其解决方案的关键在于引入语义ID(Semantic ID, SID),通过将物品映射为由分词器(如残差量化)生成的有序代码序列,从而在低基数的ID空间中实现语义聚类和更高效的模型泛化能力。SID不仅显著降低了ID的基数,还增强了模型对用户行为历史的推理能力,已在Snapchat的实际生产环境中验证了其在排序模型和检索阶段的正向效果。
链接: https://arxiv.org/abs/2604.03949
作者: Clark Mingxuan Ju,Tong Zhao,Leonardo Neves,Liam Collins,Bhuvesh Kumar,Jiwen Ren,Lili Zhang,Wenfeng Zhuo,Vincent Zhang,Xiao Bai,Jinchao Li,Karthik Iyer,Zihao Fan,Yilun Xu,Yiwen Chen,Peicheng Yu,Manish Malik,Neil Shah
机构: Snap Inc.(Snap Inc.)
类目: Information Retrieval (cs.IR)
备注: Accepted to the Industry Track of SIGIR 2026
Abstract:Effective item identifiers (IDs) are an important component for recommender systems (RecSys) in practice, and are commonly adopted in many use cases such as retrieval and ranking. IDs can encode collaborative filtering signals within training data, such that RecSys models can extrapolate during the inference and personalize the prediction based on users’ behavioral histories. Recently, Semantic IDs (SIDs) have become a trending paradigm for RecSys. In comparison to the conventional atomic ID, an SID is an ordered list of codes, derived from tokenizers such as residual quantization, applied to semantic representations commonly extracted from foundation models or collaborative signals. SIDs have drastically smaller cardinality than the atomic counterpart, and induce semantic clustering in the ID space. At Snapchat, we apply SIDs as auxiliary features for ranking models, and also explore SIDs as additional retrieval sources in different ML applications. In this paper, we discuss practical technical challenges we encountered while applying SIDs, experiments we have conducted, and design choices we have iterated to mitigate these challenges. Backed by promising offline results on both internal data and academic benchmarks as well as online A/B studies, SID variants have been launched in multiple production models with positive metrics impact.
[IR-12] Rank Dont Generate: Statement-level Ranking for Explainable Recommendation
【速读】:该论文旨在解决生成式推荐解释(explanatory recommendation)中评估困难的问题,特别是现有方法依赖大语言模型(LLM)直接生成解释时易产生幻觉(hallucination)、难以进行细粒度事实验证以及缺乏标准化评估手段的挑战。其核心解决方案是将解释生成任务重构为语句级排序(statement-level ranking)问题:系统从用户评论中提取出具有事实依据、原子性(atomic)和唯一性的候选解释语句,并基于相关性得分对它们进行排序,返回前k个作为解释。这一范式通过构造性方式抑制幻觉,支持基于标准排名指标的可复现评估,并能显式建模推荐因子的重要性。关键创新在于提出一个结合LLM抽取与语义聚类的可扩展处理管道,用于构建高质量的语句集合,并进一步引入StaR基准数据集以支撑系统性评估。
链接: https://arxiv.org/abs/2604.03724
作者: Ben Kabongo,Arthur Satouf,Vincent Guigue
机构: Sorbonne University (索邦大学), CNRS (法国国家科学研究中心), ISIR (智能机器人研究所); Air Liquide (液化空气集团), Université Paris-Saclay (巴黎萨克雷大学); AgroParisTech (巴黎高科农业学院), UMR MIA-Paris (农业、环境与食品科学联合实验室)
类目: Information Retrieval (cs.IR)
备注: 11 pages, 6 tables, 5 figures
Abstract:Textual explanations, generated with large language models (LLMs), are increasingly used to justify recommendations. Yet, evaluating these explanations remains a critical challenge. We advocate a shift in objective: rank, don’t generate. We formalize explainable recommendation as a statement-level ranking problem, where systems rank candidate explanatory statements derived from reviews and return the top-k as explanation. This formulation mitigates hallucination by construction and enables fine-grained factual analysis. It also models factor importance through relevance scores and supports standardized, reproducible evaluation with established ranking metrics. Meaningful assessment, however, requires each statement to be explanatory (item facts affecting user experience), atomic (one opinion about one aspect), and unique (paraphrases consolidated), which is challenging to obtain from noisy reviews. We address this with (i) an LLM-based extraction pipeline producing explanatory and atomic statements, and (ii) a scalable, semantic clustering method consolidating paraphrases to enforce uniqueness. Building on this pipeline, we introduce StaR, a benchmark for statement ranking in explainable recommendation, constructed from four Amazon Reviews 2014 product categories. We evaluate popularity-based baselines and state-of-the-art models under global-level (all statements) and item-level (target item statements) ranking. Popularity baselines are competitive in global-level ranking but outperform state-of-the-art models on average in item-level ranking, exposing critical limitations in personalized explanation ranking.
[IR-13] Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation
【速读】:该论文旨在解决序列推荐(Sequential Recommendation, SR)中因物品交互稀疏性导致的“尾部物品问题”(tail-item problem),即模型难以准确捕捉低频物品的转移模式。现有方法虽尝试利用大语言模型(Large Language Models, LLMs)生成的语义嵌入增强尾部物品表示,但仍面临两大挑战:一是协同信号与语义知识融合不足,导致嵌入质量不佳;二是ID嵌入空间与LLM嵌入空间之间存在结构不一致,引发冲突信号,降低推荐精度。解决方案的关键在于提出一个融合与对齐增强框架(Fusion and Alignment Enhancement framework with LLMs for Tail-item Sequential Recommendation, FAERec):首先设计自适应门控机制动态融合ID嵌入与LLM嵌入以实现信息互补;其次引入双层对齐策略——物品级对齐通过对比学习建立同物品在两空间中的对应关系,特征级对齐约束跨空间对应维度的相关性模式;最后采用课程学习调度器调节两类对齐权重,避免复杂特征级目标过早优化,从而提升嵌入的一致性与推荐性能。
链接: https://arxiv.org/abs/2604.03688
作者: Zhifu Wei,Yizhou Dang,Guibing Guo,Chuang Zhao,Zhu Sun
机构: Northeastern University (东北大学); Tianjin University (天津大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Sequential Recommendation (SR) learns user preferences from their historical interaction sequences and provides personalized suggestions. In real-world scenarios, most items exhibit sparse interactions, known as the tail-item problem. This issue limits the model’s ability to accurately capture item transition patterns. To tackle this, large language models (LLMs) offer a promising solution by capturing semantic relationships between items. Despite previous efforts to leverage LLM-derived embeddings for enriching tail items, they still face the following limitations: 1) They struggle to effectively fuse collaborative signals with semantic knowledge, leading to suboptimal item embedding quality. 2) Existing methods overlook the structural inconsistency between the ID and LLM embedding spaces, causing conflicting signals that degrade recommendation accuracy. In this work, we propose a Fusion and Alignment Enhancement framework with LLMs for Tail-item Sequential Recommendation (FAERec), which improves item representations by generating coherently-fused and structurally consistent embeddings. For the information fusion challenge, we design an adaptive gating mechanism that dynamically fuses ID and LLM embeddings. Then, we propose a dual-level alignment approach to mitigate structural inconsistency. The item-level alignment establishes correspondences between ID and LLM embeddings of the same item through contrastive learning, while the feature-level alignment constrains the correlation patterns between corresponding dimensions across the two embedding spaces. Furthermore, the weights of the two alignments are adjusted by a curriculum learning scheduler to avoid premature optimization of the complex feature-level objective. Extensive experiments across three widely used datasets with multiple representative SR backbones demonstrate the effectiveness and generalizability of our framework.
[IR-14] LightThinker: From Reasoning Compression to Memory Management
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理过程中因长思考轨迹(long thought traces)导致的认知开销激增问题,从而限制了其效率与可扩展性。解决方案的关键在于提出LightThinker及其增强版本LightThinker++:前者通过动态压缩中间思考为紧凑的语义表示实现内存优化;后者进一步引入显式自适应记忆管理(Explicit Adaptive Memory Management),以行为级方式管理记忆资源,并结合专门设计的轨迹合成流水线训练目的性的记忆调度策略,有效缓解静态压缩带来的逻辑瓶颈,显著提升推理效率与准确性,尤其在长时序代理任务中表现出稳定的性能优势和高性价比。
链接: https://arxiv.org/abs/2604.03679
作者: Yuqi Zhu,Jintian Zhang,Zhenjie Wan,Yujie Luo,Shuofei Qiao,Zhengke Gui,Da Zheng,Lei Liang,Huajun Chen,Ningyu Zhang
机构: 浙江大学(Zhejiang University)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Work in progress. This is an extended version of LightThinker
Abstract:Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into LightThinker++, introducing Explicit Adaptive Memory Management. This paradigm shifts to behavioral-level management by incorporating explicit memory primitives, supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling. Extensive experiments demonstrate the framework’s versatility across three dimensions. (1) LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss. (2) In standard reasoning, LightThinker++ slashes peak token usage by 69.9% while yielding a +2.42% accuracy gain under the same context budget for maximum performance. (3) Most notably, in long-horizon agentic tasks, it maintains a stable footprint beyond 80 rounds (a 60%-70% reduction), achieving an average performance gain of 14.8% across different complex scenarios. Overall, our work provides a scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead.
[IR-15] Are LLM -Based Retrievers Worth Their Cost? An Empirical Study of Efficiency Robustness and Reasoning Overhead SIGIR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂查询场景下检索系统性能优化中的效率、鲁棒性及置信度信号可靠性问题,而不仅仅是追求准确性。其核心解决方案在于对12个任务和14种检索器进行全面评估,涵盖冷启动索引成本、查询延迟分布与吞吐量、语料库扩展能力、可控查询扰动下的鲁棒性以及置信度用于预测查询成功率(AUROC)等维度,并通过对比标准查询与五种推理增强变体,量化推理开销(accuracy gain vs. added latency)。关键发现包括:部分专为推理设计的检索器在保持高吞吐量的同时实现强有效性,而基于大型LLM的双编码器检索器虽有小幅准确率提升但显著增加延迟;推理增强对参数量小于1B的编码器影响甚微,但在顶尖检索器上收益递减,且可能损害形式化数学/代码领域的性能;此外,各类模型的置信度校准普遍不足,表明原始检索分数不可直接用于下游路由决策,需额外校准。
链接: https://arxiv.org/abs/2604.03676
作者: Abdelrahman Abdallah,Jamie Holdcroft,Mohammed Ali,Adam Jatowt
机构: University of Innsbruck(因斯布鲁克大学); UNSW Sydney(新南威尔士大学)
类目: Information Retrieval (cs.IR)
备注: Accepted at SIGIR 2026
Abstract:Large language model retrievers improve performance on complex queries, but their practical value depends on efficiency, robustness, and reliable confidence signals in addition to accuracy. We reproduce a reasoning-intensive retrieval benchmark (BRIGHT) across 12 tasks and 14 retrievers, and extend evaluation with cold-start indexing cost, query latency distributions and throughput, corpus scaling, robustness to controlled query perturbations, and confidence use (AUROC) for predicting query success. We also quantify \emphreasoning overhead by comparing standard queries to five provided reasoning-augmented variants, measuring accuracy gains relative to added latency. We find that some reasoning-specialized retrievers achieve strong effectiveness while remaining competitive in throughput, whereas several large LLM-based bi-encoders incur substantial latency for modest gains. Reasoning augmentation incurs minimal latency for sub-1B encoders but exhibits diminishing returns for top retrievers and may reduce performance on formal math/code domains. Confidence calibration is consistently weak across model families, indicating that raw retrieval scores are unreliable for downstream routing without additional calibration. We release all code and artifacts for reproducibility.
[IR-16] User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM -based Conversational Recommendation
【速读】:该论文旨在解决对话式推荐系统(Conversational Recommender Systems, CRSs)中因对话历史信息稀疏和单轮推荐范式导致的用户复杂偏好建模不准确问题。其核心挑战在于,基于大语言模型(LLM)的用户模拟器在推理阶段无法获取真实用户偏好标签,生成的反馈可能偏离实际兴趣,进而引发多轮交互中的误差累积,损害推荐系统的泛化能力。解决方案的关键在于提出SMTPO框架:首先通过多任务监督微调(Multi-task Supervised Fine-Tuning, SFT)提升模拟器反馈质量,使其更贴近真实用户需求;其次利用细粒度奖励设计的强化学习机制,引导推荐器逐步对齐真实偏好,从而稳定多轮优化过程并提升推荐性能。
链接: https://arxiv.org/abs/2604.03671
作者: Xingyuan Xiang,Xiangchen Pan,Wei Wei
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Conversational Recommender Systems (CRSs) leverage natural language interactions for personalized recommendation, yet information-scarce dialogue histories and single-turn recommendation paradigms may severely hinder accurate modeling of complex user preferences. To alleviate this issue, recent studies have introduced LLM-based user simulators, which generate natural language feedback and perform simulated multi-turn interactions to assist recommendation. Nevertheless, since simulators cannot access true user preference labels during inference, their feedback may deviate from actual user interests, causing errors to accumulate over multiple interactions and severely affecting the generalization of the recommender. Inspired by the multi-step reasoning capabilities of LLMs and the effectiveness of reinforcement learning in policy optimization, we propose SMTPO, a user simulator-guided multi-turn preference optimization conversational recommendation framework. To align simulator-generated feedback with true user preferences in the absence of explicit labels, we enhance feedback quality via multi-task supervised fine-tuning (SFT), enabling the simulator to better reflect users’ complex and diverse needs. To address the challenge of biased feedback destabilizing multi-turn optimization, we first allow the reasoning LLM-based recommender to learn preference reasoning and recommendation patterns through SFT and then employ reinforcement learning with fine-grained reward design to progressively align with true user preferences, improving recommendation performance. Extensive experiments on public datasets demonstrate the effectiveness and transferability of our method.
[IR-17] MMP-Refer: Multimodal Path Retrieval-augmented LLM s For Explainable Recommendation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的可解释推荐系统中存在的两个关键问题:一是忽略了推荐数据集中丰富的多模态信息,二是协同信息(collaborative information)与LLM语义空间之间缺乏对齐。为应对上述挑战,作者提出MMP-Refer框架,其核心创新在于引入多模态检索路径(MultiModal Retrieval Paths)与检索增强的LLM生成机制。具体而言,首先通过基于联合残差编码的序列推荐模型提取多模态嵌入,并设计启发式搜索算法构建检索路径;在生成阶段,则引入一个可训练的轻量级协同适配器(collaborative adapter),将交互子图的图编码映射至LLM语义空间作为软提示(soft prompts),从而增强LLM对用户行为信息的理解能力,实现更准确、更具解释性的个性化推荐。
链接: https://arxiv.org/abs/2604.03666
作者: Xiangchen Pan,Wei Wei
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Explainable recommendations help improve the transparency and credibility of recommendation systems, and play an important role in personalized recommendation scenarios. At present, methods for explainable recommendation based on large language models(LLMs) often consider introducing collaborative information to enhance the personalization and accuracy of the model, but ignore the multimodal information in the recommendation dataset; In addition, collaborative information needs to be aligned with the semantic space of LLM. Introducing collaborative signals through retrieval paths is a good choice, but most of the existing retrieval path collection schemes use the existing Explainable GNN algorithms. Although these methods are effective, they are relatively unexplainable and not be suitable for the recommendation field. To address the above challenges, we propose MMP-Refer, a framework using \textbfMulti\textbfModal Retrieval \textbfPaths with \textbfRetrieval-augmented LLM \textbfFor \textbfExplainable \textbfRecommendation. We use a sequential recommendation model based on joint residual coding to obtain multimodal embeddings, and design a heuristic search algorithm to obtain retrieval paths by multimodal embeddings; In the generation phase, we integrated a trainable lightweight collaborative adapter to map the graph encoding of interaction subgraphs to the semantic space of the LLM, as soft prompts to enhance the understanding of interaction information by the LLM. Extensive experiments have demonstrated the effectiveness of our approach. Codes and data are available at this https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.03666 [cs.IR] (or arXiv:2604.03666v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.03666 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-18] Love Me Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning CVPR2026
【速读】:该论文旨在解决视觉上下文学习(Visual in-context learning, VICL)中提示(prompt)选择对模型性能影响显著的问题,尤其是现有方法主要依赖图像相似性而忽视标签一致性,导致选取的提示虽视觉相似但标签不一致,从而降低VICL效果。解决方案的关键在于提出一种名为LaPR(Label-aware Prompt Retrieval)的框架,其核心创新包括:1)设计图像-标签联合表示以显式融合标签信息;2)引入基于查询自适应路由的专家混合机制(mixture-of-expert mechanism),使每个专家捕获特定标签模式,路由器动态分配权重以生成标签感知的提示表示;3)通过分阶段优化策略,分别采用基于VICL性能引导的对比损失和标签引导的对比损失提升模型鲁棒性与标签一致性。实验表明,LaPR在分割、检测和着色等任务上均实现稳定提升,并具备良好的跨特征提取器和跨场景泛化能力。
链接: https://arxiv.org/abs/2604.03657
作者: Tianci Luo,Haohao Pan,Jinpeng Wang,Niu Lian,Xinrui Chen,Bin Chen,Shu-Tao Xia,Chun Yuan
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University; Harbin Institute of Technology, Shenzhen; School of Computer Science and Engineering, Northeastern University
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted to CVPR 2026. 10 pages, 5 figures, 3 tables
Abstract:Visual in-context learning (VICL) enables visual foundation models to handle multiple tasks by steering them with demonstrative prompts. The choice of such prompts largely influences VICL performance, standing out as a key challenge. Prior work has made substantial progress on prompt retrieval and reranking strategies, but mainly focuses on prompt images while overlooking labels. We reveal these approaches sometimes get visually similar but label-inconsistent prompts, which potentially degrade VICL performance. On the other hand, higher label consistency between query and prompts preferably indicates stronger VICL results. Motivated by these findings, we develop a framework named LaPR (Label-aware Prompt Retrieval), which highlights the role of labels in prompt selection. Our framework first designs an image-label joint representation for prompts to incorporate label cues explicitly. Besides, to handle unavailable query labels at test time, we introduce a mixture-of-expert mechanism to the dual encoders with query-adaptive routing. Each expert is expected to capture a specific label mode, while the router infers query-adaptive mixture weights and helps to learn label-aware representation. We carefully design alternative optimization for experts and router, with a VICL performance-guided contrastive loss and a label-guided contrastive loss, respectively. Extensive experiments show promising and consistent improvement of LaPR on in-context segmentation, detection, and colorization tasks. Moreover, LaPR generalizes well across feature extractors and cross-fold scenarios, suggesting the importance of label utilization in prompt retrieval for VICL. Code is available at this https URL.
[IR-19] Joint Behavior-guided and Modality-coherence Conditional Graph Diffusion Denoising for Multi Modal Recommendation
【速读】:该论文旨在解决多模态推荐中两个关键问题:一是多模态特征中存在与用户偏好无关的冗余信息,直接注入交互图会干扰用户与物品之间的协同特征学习;二是由于系统错误(如误点击或未曝光)导致的虚假负样本和正样本行为,引发反馈偏差,进而影响训练样本对的排序准确性,降低模型推荐效果。解决方案的关键在于提出一种联合行为引导与模态一致性的条件图扩散模型(JBM-Diff),通过基于协同特征对每种模态特征设计扩散过程以去除无关信息,并借助多视角消息传递与特征融合增强协同特征与模态语义信息的一致性;同时,从行为角度检测样本对的部分序一致性并赋予可信度,实现数据增强,从而有效提升推荐精度。
链接: https://arxiv.org/abs/2604.03654
作者: Xiangchen Pan,Wei Wei
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:In recent years, multimodal recommendation has received significant attention and achieved remarkable success in GCN-based recommendation methods. However, there are two key challenges here: (1) There is a significant amount of redundant information in multimodal features that is unrelated to user preferences. Directly injecting multimodal features into the interaction graph can affect the collaborative feature learning between users and items. (2) There are false negative and false positive behaviors caused by system errors such as accidental clicks and non-exposure. This feedback bias can affect the ranking accuracy of training sample pairs, thereby reducing the recommendation accuracy of the model. To address these challenges, this work proposes a Joint Behavior-guided and Modal-consistent Conditional Graph Diffusion Model (JBM-Diff) for joint denoising of multimodal features and user feedback. We design a diffusion model conditioned on collaborative features for each modal feature to remove preference-irrelevant information, and enhance the alignment between collaborative features and modal semantic information through multi-view message propagation and feature fusion. Finally, we detect the partial order consistency of sample pairs from a behavioral perspective based on learned modal preferences, set the credibility for sample pairs, and achieve data augmentation. Extensive experiments on three public datasets demonstrate the effectiveness of this work. Codes are available at this https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.03654 [cs.IR] (or arXiv:2604.03654v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.03654 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-20] Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval CVPR2026
【速读】:该论文针对部分相关视频检索(Partially Relevant Video Retrieval, PRVR)任务中现有方法存在的全局上下文感知不完整问题,旨在提升模型对文本查询模糊性及局部噪声干扰的鲁棒性。其解决方案的关键在于提出一种从粗到细的表征学习范式——DreamPRVR,首先通过概率变分采样器生成视频中心的全局语义寄存器(global contextual semantic registers),再利用文本监督的截断扩散模型迭代优化这些寄存器,从而构建结构清晰的文本潜在空间以增强全局感知可靠性;随后,通过寄存器增强的高斯注意力模块将寄存器自适应融合至视频token中,实现上下文感知的特征学习与跨模态精准匹配。
链接: https://arxiv.org/abs/2604.03653
作者: Jun Li,Xuhang Lou,Jinpeng Wang,Yuting Wang,Yaowei Wang,Shu-Tao Xia,Bin Chen
机构: Harbin Institute of Technology, Shenzhen; Tsinghua Shenzhen International Graduate School, Tsinghua University; Peng Cheng Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted to CVPR 2026. 15 pages, 7 figures, 3 tables
Abstract:Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these issues, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perception. The registers are then adaptively fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning. Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. Code is released at this https URL.
[IR-21] LLM -based Listwise Reranking under the Effect of Positional Bias
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的列表级段落重排序方法中存在的位置偏差问题,即模型倾向于将输入列表中靠后的段落排在较低位置,从而影响最终排序效果。解决方案的关键在于提出DebiasFirst方法,其核心包含两个互补机制:一是位置校准(positional calibration),通过逆倾向评分(inverse propensity scoring)对损失函数中不同位置的贡献进行加权调整,以缓解架构固有的位置偏差;二是位置感知的数据增强(position-aware data augmentation),在训练过程中构造多样化的输入位置分布,确保每个段落以均衡的概率出现在任意位置,从而降低相关文档位置不均衡带来的负面影响。该方法显著提升了重排序的有效性和鲁棒性,且与推理阶段的去偏方法兼容,为实际应用提供了一种有效的解决方案。
链接: https://arxiv.org/abs/2604.03642
作者: Jingfen Qiao,Jin Huang,Xinyu Ma,Shuaiqiang Wang,Dawei Yin,Evangelos Kanoulas,Andrew Yates
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:LLM-based listwise passage reranking has attracted attention for its effectiveness in ranking candidate passages. However, these models suffer from positional bias, where passages positioned towards the end of the input are less likely to be moved to top positions in the ranking. We hypothesize that there are two primary sources of positional bias: (1) architectural bias inherent in LLMs and (2) the imbalanced positioning of relevant documents. To address this, we propose DebiasFirst, a method that integrates positional calibration and position-aware data augmentation during fine-tuning. Positional calibration uses inverse propensity scoring to adjust for positional bias by re-weighting the contributions of different positions in the loss function when training. Position-aware augmentation augments training data to ensure that each passage appears equally across varied positions in the input list. This approach markedly enhances both effectiveness and robustness to the original ranking across diverse first-stage retrievers, reducing the dependence of NDCG@10 performance on the position of relevant documents. DebiasFirst also complements the inference-stage debiasing methods, offering a practical solution for mitigating positional bias in reranking.
[IR-22] Beyond Predefined Schemas: TRACE-KG for Context-Enriched Knowledge Graphs from Complex Documents
【速读】:该论文旨在解决知识图谱构建中两类主流方法的局限性:一方面,基于预定义本体(ontology)的方法虽然能保证类型一致性,但需耗费大量成本进行模式设计与维护;另一方面,无模式(schema-free)方法在处理长篇技术文档时易生成碎片化、缺乏全局组织结构的知识图谱,尤其难以捕捉密集且依赖上下文的信息。解决方案的关键在于提出TRACE-KG(Text-driven schema for Context-Enriched Knowledge Graphs),这是一个多模态框架,能够在不假设预定义本体的前提下,联合构建富含上下文信息的知识图谱和可复用的数据驱动模式(data-driven schema)。该框架通过结构化限定符(structured qualifiers)捕获条件关系,并以可追溯的方式组织实体与关系,从而实现结构一致性和证据溯源性的统一。
链接: https://arxiv.org/abs/2604.03496
作者: Mohammad Sadeq Abolhasani,Yang Ba,Yixuan He,Rong Pan
机构: Arizona State University (亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Knowledge graph construction typically relies either on predefined ontologies or on schema-free extraction. Ontology-driven pipelines enforce consistent typing but require costly schema design and maintenance, whereas schema-free methods often produce fragmented graphs with weak global organization, especially in long technical documents with dense, context-dependent information. We propose TRACE-KG (Text-dRiven schemA for Context-Enriched Knowledge Graphs), a multimodal framework that jointly constructs a context-enriched knowledge graph and an induced schema without assuming a predefined ontology. TRACE-KG captures conditional relations through structured qualifiers and organizes entities and relations using a data-driven schema that serves as a reusable semantic scaffold while preserving full traceability to the source evidence. Experiments show that TRACE-KG produces structurally coherent, traceable knowledge graphs and offers a practical alternative to both ontology-driven and schema-free construction pipelines.
[IR-23] Lightweight Query Routing for Adaptive RAG : A Baseline Study on RAG Router-Bench
【速读】:该论文旨在解决生成式 AI(Generative AI)中检索增强生成(Retrieval-Augmented Generation, RAG)流水线在不同查询类型下选择最优检索策略的效率问题,即如何根据查询内容动态路由至成本与能力匹配的检索策略以实现 token 节省。其解决方案的关键在于构建并评估基于轻量级分类器的路由机制,通过系统性实验对比 TF-IDF、MiniLM 句子嵌入和手工结构特征三类特征组合下的五种经典分类器,在 RAGRouter-Bench 基准上验证了仅使用词频-逆文档频率(TF-IDF)特征配合支持向量机(SVM)即可达到 0.928 的宏平均 F1 和 93.2% 准确率,并模拟实现 28.1% 的 token 节省,表明表面关键词模式比语义嵌入更能有效预测查询复杂度。
链接: https://arxiv.org/abs/2604.03455
作者: Prakhar Bansal,Shivangi Agarwal
机构: Indraprastha Institute of Information Technology Delhi (Indraprastha Institute of Information Technology Delhi)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages, 3 tables
Abstract:Retrieval-Augmented Generation pipelines span a wide range of retrieval strategies that differ substantially in token cost and capability. Selecting the right strategy per query is a practical efficiency problem, yet no routing classifiers have been trained on RAGRouter-Bench \citepwang2026ragrouterbench, a recently released benchmark of 7,727 queries spanning four knowledge domains, each annotated with one of three canonical query types: factual, reasoning, and summarization. We present the first systematic evaluation of lightweight classifier-based routing on this benchmark. Five classical classifiers are evaluated under three feature regimes, namely, TF-IDF, MiniLM sentence embeddings \citepreimers2019sbert, and hand-crafted structural features, yielding 15 classifier feature combinations. Our best configuration, TF-IDF with an SVM, achieves a macro-averaged F1 of \mathbf0.928 and an accuracy of \mathbf93.2% , while simulating \mathbf28.1% token savings relative to always using the most expensive paradigm. Lexical TF-IDF features outperform semantic sentence embeddings by 3.1 macro-F1 points, suggesting that surface keyword patterns are strong predictors of query-type complexity. Domain-level analysis reveals that medical queries are hardest to route and legal queries most tractable. These results establish a reproducible query-side baseline and highlight the gap that corpus-aware routing must close.
[IR-24] Align then Train: Efficient Retrieval Adapter Learning
【速读】:该论文旨在解决密集检索系统中因查询与文档之间语义复杂度不对称而导致的检索失配问题:即用户常以长指令或任务描述形式表达意图(需强推理和指令遵循能力),而目标文档则相对简单且静态,现有方法通常通过改进嵌入模型来缓解此问题,但直接微调大型查询嵌入模型成本高昂且操作复杂。解决方案的关键在于提出一种标签高效的高效检索适配器(Efficient Retrieval Adapter, ERA)框架,其核心机制为两阶段训练——首先在自监督阶段对齐大模型查询嵌入空间与轻量文档嵌入空间,其次利用少量标注数据对查询侧表征进行监督适应,从而在不重新索引语料库的前提下,弥合嵌入模型间的表征差异与复杂查询与简单文档间的语义鸿沟。
链接: https://arxiv.org/abs/2604.03403
作者: Seiji Maekawa,Moin Aminnaseri,Pouya Pezeshkpour,Estevam Hruschka
机构: Megagon Labs (Megagon 实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Dense retrieval systems increasingly need to handle complex queries. In many realistic settings, users express intent through long instructions or task-specific descriptions, while target documents remain relatively simple and static. This asymmetry creates a retrieval mismatch: understanding queries may require strong reasoning and instruction-following, whereas efficient document indexing favors lightweight encoders. Existing retrieval systems often address this mismatch by directly improving the embedding model, but fine-tuning large embedding models to better follow such instructions is computationally expensive, memory-intensive, and operationally burdensome. To address this challenge, we propose Efficient Retrieval Adapter (ERA), a label-efficient framework that trains retrieval adapters in two stages: self-supervised alignment and supervised adaptation. Inspired by the pre-training and supervised fine-tuning stages of LLMs, ERA first aligns the embedding spaces of a large query embedder and a lightweight document embedder, and then uses limited labeled data to adapt the query-side representation, bridging both the representation gap between embedding models and the semantic gap between complex queries and simple documents without re-indexing the corpus. Experiments on the MAIR benchmark, spanning 126 retrieval tasks across 6 domains, show that ERA improves retrieval in low-label settings, outperforms methods that rely on larger amounts of labeled data, and effectively combines stronger query embedders with weaker document embedders across domains.
[IR-25] BridgeRAG : Training-Free Bridge-Conditioned Retrieval for Multi-Hop Question Answering
【速读】:该论文旨在解决多跳问答(Multi-hop Question Answering, MHQA)中检索增强生成(Retrieval-Augmented Generation, RAG)的效率与准确性问题,核心挑战在于传统单步相关性排序无法有效捕捉后续跳跃证据(candidate)在特定桥接证据(bridge evidence)条件下的实用价值。解决方案的关键在于提出BridgeRAG方法,其创新性地采用三元组评分函数 $ s(q,b,c) $ 对(问题、桥接证据、候选文档)进行联合建模,通过桥接条件下的大语言模型(LLM)判别器实现推理链选择,无需任何离线图结构或命题索引;同时结合双实体近邻搜索(dual-entity ANN)扩展第二跳候选池,利用百分位分数融合策略提升整体性能,实验证明该机制显著优于现有训练-free 方法,在多个标准MHQA基准上均取得最优结果。
链接: https://arxiv.org/abs/2604.03384
作者: Andre Bacellar
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 11 pages, 4 figures
Abstract:Multi-hop retrieval is not a single-step relevance problem: later-hop evidence should be ranked by its utility conditioned on retrieved bridge evidence, not by similarity to the original query alone. We present BridgeRAG, a training-free, graph-free retrieval method for retrieval-augmented generation (RAG) over multi-hop questions that operationalizes this view with a tripartite scorer s(q,b,c) over (question, bridge, candidate). BridgeRAG separates coverage from scoring: dual-entity ANN expansion broadens the second-hop candidate pool, while a bridge-conditioned LLM judge identifies the active reasoning chain among competing candidates without any offline graph or proposition index. Across four controlled experiments we show that this conditioning signal is (i) selective: +2.55pp on parallel-chain queries (p0.001) vs. ~0 on single-chain subtypes; (ii) irreplaceable: substituting the retrieved passage with generated SVO query text reduces R@5 by 2.1pp, performing worse than even the lowest-SVO-similarity pool passage; (iii) predictable: cos(b,g2) correlates with per-query gain (Spearman rho=0.104, p0.001); and (iv) mechanistically precise: bridge conditioning causes productive re-rankings (18.7% flip-win rate on parallel-chain vs. 0.6% on single-chain), not merely more churn. Combined with lightweight coverage expansion and percentile-rank score fusion, BridgeRAG achieves the best published training-free R@5 under matched benchmark evaluation on all three standard MHQA benchmarks without a graph database or any training: 0.8146 on MuSiQue (+3.1pp vs. PropRAG, +6.8pp vs. HippoRAG2), 0.9527 on 2WikiMultiHopQA (+1.2pp vs. PropRAG), and 0.9875 on HotpotQA (+1.35pp vs. PropRAG).
人机交互
[HC-0] Comprehensive List of User Deception Techniques in Emails
【速读】:该论文旨在解决电子邮件(Email)系统中长期存在的欺骗攻击问题,其根源在于现有设计和界面规范未能有效防御恶意行为。解决方案的关键在于系统性地整理并结构化42种基于邮件的欺骗技术,涵盖发件人、链接、附件安全标识及邮件渲染环境等维度,并提供64个具体实现案例。这些技术以模块化方式呈现,明确区分欺骗目标与技术实现,为后续在基础设施、邮件客户端设计和安全意识教育等领域开发针对性防御措施提供了清晰、可扩展的参考框架。
链接: https://arxiv.org/abs/2604.04926
作者: Maxime Veit,Mattia Mossano,Tobias Länge,Melanie Volkamer
机构: 未知
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:
Abstract:Email remains a central communication medium, yet its long-standing design and interface conventions continue to enable deceptive attacks. This research note presents a structured list of 42 email-based deception techniques, documented with 64 concrete example implementations, organized around the sender, link, and attachment security indicators as well as techniques targeting the email rendering environment. Building on a prior systematic literature review, we consolidate previously reported techniques with newly developed example implementations and introduce novel deception techniques identified through our own examination. Rather than assessing effectiveness or real-world severity, each entry explains the underlying mechanism in isolation, separating the high-level deception goal from its concrete technical implementation. The documented techniques serve as modular building blocks and a structured reference for future work on countermeasures across infrastructure, email client design, and security awareness, supporting researchers as well as developers, operators, and designers working in these areas.
[HC-1] Comparing Human Oversight Strategies for Computer-Use Agents
【速读】:该论文旨在解决大语言模型驱动的计算机使用代理(LLM-powered Computer-Use Agents, CUAs)在实际应用中,用户监督机制设计缺乏系统性比较框架的问题。现有研究多将监督机制视为孤立的界面功能,难以形成可比的结构化策略。论文提出将CUA监督建模为由委托结构(delegation structure)和参与度(engagement level)定义的结构性协调问题,并基于此框架在真实网络环境中对四种监督策略进行混合方法学评估。其解决方案的关键在于:有效的监督不依赖于单纯增加人类干预频率,而在于如何通过结构化设计使决策关键节点在执行过程中变得可识别(legible),从而支持及时且有意义的干预,尤其体现在降低代理产生问题行为的概率以及提升用户对风险时刻的感知能力上。
链接: https://arxiv.org/abs/2604.04918
作者: Chaoran Chen,Zhiping Zhang,Zeya Chen,Eryue Xu,Yinuo Yang,Ibrahim Khalilov,Simret A Gebreegziabher,Yanfang Ye,Ziang Xiao,Yaxing Yao,Tianshi Li,Toby Jia-Jun Li
机构: University of Notre Dame (圣母大学); Northeastern University (东北大学); Illinois Institute of Technology (伊利诺伊理工学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Johns Hopkins University (约翰霍普金斯大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:LLM-powered computer-use agents (CUAs) are shifting users from direct manipulation to supervisory coordination. Existing oversight mechanisms, however, have largely been studied as isolated interface features, making broader oversight strategies difficult to compare. We conceptualize CUA oversight as a structural coordination problem defined by delegation structure and engagement level, and use this lens to compare four oversight strategies in a mixed-methods study with 48 participants in a live web environment. Our results show that oversight strategy more reliably shaped users’ exposure to problematic actions than their ability to correct them once visible. Plan-based strategies were associated with lower rates of agent problematic-action occurrence, but not equally strong gains in runtime intervention success once such actions became visible. On subjective measures, no single strategy was uniformly best, and the clearest context-sensitive differences appeared in trust. Qualitative findings further suggest that intervention depended not only on what controls users retained, but on whether risky moments became legible as requiring judgment during execution. These findings suggest that effective CUA oversight is not achieved by maximizing human involvement alone. Instead, it depends on how supervision is structured to surface decision-critical moments and support their recognition in time for meaningful intervention.
[HC-2] Exploring Expert Perspectives on Wearable-Triggered LLM Conversational Support for Daily Stress Management
【速读】:该论文旨在解决可穿戴设备触发的压力事件与生成式对话支持之间缺乏有意义连接的问题,尤其是在设计层面的探索不足。解决方案的关键在于提出并实现EmBot——一个将可穿戴设备检测到的压力事件与大语言模型(Large Language Model, LLM)驱动的对话支持相结合的移动应用原型,并通过15位心理健康专家的半结构化访谈作为设计探针,识别出早期设计张力与考量,从而为日常压力管理和心理健康支持系统的未来设计提供指导。
链接: https://arxiv.org/abs/2604.04915
作者: Poorvesh Dongre,Sameer Neupane,Priyanka Jadhav,Nikitha Donekal Chandrashekar,Christian Webb,Denis Gračanin
机构: Harvard Medical School (哈佛医学院); University of Memphis (孟菲斯大学); Omnissa; Virginia Tech (弗吉尼亚理工大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Wearable devices increasingly support stress detection, while LLMs enable conversational mental health support. However, designing systems that meaningfully connect wearable-triggered stress events with generative dialogue remains underexplored, particularly from a design perspective. We present EmBot, a functional mobile application that combines wearable-triggered stress detection with LLM-based conversational support for daily stress management. We used EmBot as a design probe in semi-structured interviews with 15 mental health experts to examine their perspectives and surface early design tensions and considerations that arise from wearable-triggered conversational support, informing the future design of such systems for daily stress management and mental health support.
[HC-3] ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality
【速读】:该论文旨在解决当前扩展现实(XR)中多模态人机交互存在的隐私泄露、延迟高及交互模糊等问题。现有系统通常依赖云端大模型(如ChatGPT)或仅基于凝视选择(如GazePointAR),导致用户数据外传、响应延迟显著,且交互语义不明确。其解决方案的关键在于提出ClickAIXR框架——通过在设备端集成视觉语言模型(Vision-Language Model, VLM),结合控制器点击实现对象级精准选择,使用户可直接对现实世界物体进行自然语言提问并获得文本与语音反馈。整个推理过程本地化执行,既保障了隐私安全又降低了延迟,提升了交互的透明度与可信度。
链接: https://arxiv.org/abs/2604.04905
作者: Dawar Khan,Alexandre Kouyoumdjian,Xinyu Liu,Omar Mena,Dominik Engel,Ivan Viola
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:
Abstract:We present ClickAIXR, a novel on-device framework for multimodal vision-language interaction with objects in extended reality (XR). Unlike prior systems that rely on cloud-based AI (e.g., ChatGPT) or gaze-based selection (e.g., GazePointAR), ClickAIXR integrates an on-device vision-language model (VLM) with a controller-based object selection paradigm, enabling users to precisely click on real-world objects in XR. Once selected, the object image is processed locally by the VLM to answer natural language questions through both text and speech. This object-centered interaction reduces ambiguity inherent in gaze- or voice-only interfaces and improves transparency by performing all inference on-device, addressing concerns around privacy and latency. We implemented ClickAIXR in the Magic Leap SDK (C API) with ONNX-based local VLM inference. We conducted a user study comparing ClickAIXR with Gemini 2.5 Flash and ChatGPT 5, evaluating usability, trust, and user satisfaction. Results show that latency is moderate and user experience is acceptable. Our findings demonstrate the potential of click-based object selection combined with on-device AI to advance trustworthy, privacy-preserving XR interactions. The source code and supplementary materials are available at: this http URL
[HC-4] Demonstrating SIMA-Play: A Serious Game for Forest Management Decision-Making through Board Game and Digital Simulation
【速读】:该论文试图解决的问题是:如何利用棋盘游戏(board games)这一教育工具,有效引导学习者理解森林管理中复杂的长期权衡关系,因为当前这类应用仍显著不足。解决方案的关键在于设计了一款名为SIMA-Play的严肃游戏(serious game),通过整合森林生长模拟数据与信息可视化(information visualization)及游戏机制,使玩家能够在动态环境和市场条件下做出森林管理决策,并获得关于其选择的反馈。该游戏不仅模拟了森林随时间的生长过程,还量化比较了玩家在经济收益与可持续性结果上的表现,从而促进系统思维(systems thinking),帮助玩家更清晰地理解和讨论林业实践中的权衡取舍。
链接: https://arxiv.org/abs/2604.04904
作者: Arka Majhi,Daniel Fernández Galeote,Timo Nummenmaa,Juho Hamari,Aaron Petty,Jari Vauhkonen,Heli Peltola
机构: Tampere University (坦佩雷大学); University of Eastern Finland (东芬兰大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted to the GamiFIN 2026 conference
Abstract:Board games have shown promise as educational tools, but their use in engaging learners with the complex, long-term trade-offs of forest management remains strikingly underdeveloped. Addressing this gap, we investigate how forest growth simulation data can inform decision-making through information visualization and gameplay mechanics. We designed a serious game, SIMA-Play, that enables players to make informed forest management decisions under dynamic environmental and market conditions, simulating forest growth over time and comparing player performance across economic and sustainability outcomes. By using visualization to give players feedback on their choices, at the end of the game, it supports systems thinking and makes the trade-offs in forestry practices easier to understand and discuss. The study concludes with a research roadmap that outlines future experiments, longitudinal studies, and digital versions of SIMA-Play to assess its long-term effects on learning and engagement.
[HC-5] When One Sensor Fails: Tolerating Dysfunction in Multi-Sensor Prototypes
【速读】:该论文旨在解决多传感器表面肌电信号(surface electromyography, sEMG)系统中单个传感器失效导致系统可用性下降的问题。其解决方案的关键在于提出了一种可实施的容错机制框架,通过提取手工特征并利用最大Fisher判别比(maximum Fisher discriminant ratio, FDR)量化类别可分性,结合多层感知机(multi-layer perceptron)验证方法的有效性;同时,通过对传感器进行系统性剔除实验与FDR分析,识别出关键传感器与可替代传感器的优先级排序,从而为设备设计提供冗余策略和可靠性保障,适用于临床及实际应用场景。
链接: https://arxiv.org/abs/2604.04832
作者: Freek Hens,Amirhossein Sadough,Aleksa Bokšan,Mahyar Shahsavari,Mohammad Mahdi Dehshibi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Submitted to the International Journal of Parallel, Emergent and Distributed Systems
Abstract:Surface electromyography (sEMG) sensors are widely used in human-computer interaction, yet the failure of a single sensor can compromise system usability. We propose a methodological framework for implementing a fail-safe mechanism in multi-sensor sEMG systems. Using arm sEMG recordings of rock-paper-scissors gestures, we extracted hand-crafted features and quantified class separability via the maximum Fisher discriminant ratio (FDR). A multi-layer perceptron validated our approach, consistent with prior findings and physiological evidence. Systematic sensor ablations and FDR analysis produced a ranking of crucial versus replaceable sensors. This ranking informs robust device design, sensor redundancy, and reliability in clinical and practical applications.
[HC-6] AnyUser: Translating Sketched User Intent into Domestic Robots
【速读】:该论文旨在解决当前服务机器人在家庭环境中难以实现直观、非专家用户交互的问题,尤其是在缺乏预先构建地图或模型的情况下,如何让普通用户通过自然方式(如手绘草图和语言)高效地指令机器人完成家务任务。解决方案的关键在于提出AnyUser系统,其核心创新包括:(1)多模态融合机制,将自由形式的草图、视觉信息与语言输入统一转化为空间语义原语(spatial-semantic primitives),从而理解用户意图;(2)分层策略(hierarchical policy),用于生成鲁棒且可执行的机器人动作序列。该方案无需依赖先验环境模型,在真实机器人平台上验证了其在复杂场景下的任务执行能力,并通过用户研究证明其对不同人群(包括老年人和低技术熟练度用户)具有显著的可用性提升。
链接: https://arxiv.org/abs/2604.04811
作者: Songyuan Yang,Huibin Tan,Kailun Yang,Wenjing Yang,Shaowu Yang
机构: National University of Defense Technology (国防科技大学); Hunan University (湖南大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted to IEEE Transactions on Robotics (T-RO)
Abstract:We introduce AnyUser, a unified robotic instruction system for intuitive domestic task instruction via free-form sketches on camera images, optionally with language. AnyUser interprets multimodal inputs (sketch, vision, language) as spatial-semantic primitives to generate executable robot actions requiring no prior maps or models. Novel components include multimodal fusion for understanding and a hierarchical policy for robust action generation. Efficacy is shown via extensive evaluations: (1) Quantitative benchmarks on the large-scale dataset showing high accuracy in interpreting diverse sketch-based commands across various simulated domestic scenes. (2) Real-world validation on two distinct robotic platforms, a statically mounted 7-DoF assistive arm (KUKA LBR iiwa) and a dual-arm mobile manipulator (Realman RMC-AIDAL), performing representative tasks like targeted wiping and area cleaning, confirming the system’s ability to ground instructions and execute them reliably in physical environments. (3) A comprehensive user study involving diverse demographics (elderly, simulated non-verbal, low technical literacy) demonstrating significant improvements in usability and task specification efficiency, achieving high task completion rates (85.7%-96.4%) and user satisfaction. AnyUser bridges the gap between advanced robotic capabilities and the need for accessible non-expert interaction, laying the foundation for practical assistive robots adaptable to real-world human environments.
[HC-7] A Multi-Agent Framework for Democratizing XR Content Creation in K-12 Classrooms
【速读】:该论文旨在解决生成式 AI (Generative AI) 与扩展现实 (Extended Reality, XR) 在K-12教育中应用时面临的两大核心挑战:一是XR内容创作的技术门槛过高,限制了教师的课堂采用;二是生成式AI固有的概率性可能导致幻觉(hallucination),在教育场景中引发严重后果。解决方案的关键在于提出一个由四个专业化智能体协同工作的多智能体XR内容创作框架:Pedagogical Agent定义符合年级水平的教学目标,Execution Agent组装3D资产与XR内容,Safeguard Agent依据五项安全标准验证内容,Tutor Agent嵌入教学注释与测验题。该系统以教师为中心,无需技术背景即可操作,且适配主流设备,实现了教学意图、安全验证与教育增值的统一。
链接: https://arxiv.org/abs/2604.04728
作者: Yuan Chang,Zhu Li,Jiaming Qu
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative AI (GenAI) combined with Extended Reality (XR) offers potential for K-12 education, yet classroom adoption remains limited by the high technical barrier of XR content authoring. Moreover, the probabilistic nature of GenAI introduces risks of hallucination that may cause severe consequences in K-12 education settings. In this work, we present a multi-agent XR authoring framework. Our prototype system coordinates four specialized agents: a Pedagogical Agent outlining grade-appropriate content specifications with learning objectives; an Execution Agent assembling 3D assets and XR contents; a Safeguard Agent validating generated content against five safety criteria; and a Tutor Agent embedding educational notes and quiz questions within the scene. Our teacher-facing system combines pedagogical intent, safety validation, and educational enrichment. It does not require technical expertise and targets commodity devices.
[HC-8] Bounded Autonomy: Controlling LLM Characters in Live Multiplayer Games
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实时多人游戏中引入的“控制问题”(control problem):如何使LLM角色在保持与游戏世界交互能力的同时,维持与其他活跃角色的社会一致性,并在必要时接受玩家的可控干预。解决方案的核心是提出“有限自主性”(bounded autonomy)控制架构,其关键在于构建三个接口——角色间交互、角色与世界动作执行、玩家对角色的软控制(soft-steering),并结合概率回复链衰减机制、基于嵌入的动作锚定流水线(含回退策略)以及轻量级的“whisper”软控制技术,实现LLM角色在动态社交环境中的稳定、可解释且可干预的行为表现。
链接: https://arxiv.org/abs/2604.04703
作者: Yunjia Guo,Jinghan Zhu,Siyu Wang,Haixin Qiao
机构: Biibit Ltd (Kotoko AI); Kotoko AI
类目: Human-Computer Interaction (cs.HC)
备注: 9 pages, 5 figures, 5 tables, submitted to UIST 2026
Abstract:Large language models (LLMs) are bringing richer dialogue and social behavior into games, but they also expose a control problem that existing game interfaces do not directly address: how should LLM characters participate in live multiplayer interaction while remaining executable in the shared game world, socially coherent with other active characters, and steerable by players when needed? We frame this problem as bounded autonomy, a control architecture for live multiplayer games that organizes LLM character control around three interfaces: agent-agent interaction, agent-world action execution, and player-agent steering. We instantiate bounded autonomy with probabilistic reply-chain decay, an embedding-based action grounding pipeline with fallback, and whisper, a lightweight soft-steering technique that lets players influence a character’s next move without fully overriding autonomy. We deploy this architecture in a live multiplayer social game and study its behavior through analyses of interaction stability, grounding quality, whisper intervention success, and formative interviews. Our results show how bounded autonomy makes LLM character interaction workable in practice, frames controllability as a distinct runtime control problem for LLM characters in live multiplayer games, and provides a concrete exemplar for future games built around this interaction paradigm.
[HC-9] Design Guidelines for Game-Based Refresher Training of Community Health Workers in Low-Resource Contexts ALT
【速读】:该论文旨在解决在资源匮乏的医疗环境中,如何持续提升社区健康工作者(Community Health Workers, CHWs)培训效果与绩效的问题。其解决方案的关键在于通过四年的基于设计的研究项目,整合多种互动式游戏化训练系统(包括基于问答的移动应用、实体与增强现实游戏、卡牌游戏及基于位置的游戏),提炼出八项可推广的设计指南,涵盖情境真实性、自适应学习、混合交互、社会动机、可解释性、职业身份认同及伦理考量等方面,从而支持CHW在实际工作场景中的持续参与、知识迁移与情境适配。
链接: https://arxiv.org/abs/2604.04671
作者: Arka Majhi,Aparajita Mondal,Satish B. Agnihotri
机构: IIT Bombay(印度理工学院孟买分校); Tampere University(坦佩雷大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: This paper has been conditionally accepted to the Interactive Health Conference 2026 in Porto, Portugal
Abstract:Community Health Workers (CHWs) play a critical role in delivering primary healthcare services in low-resource settings, yet sustaining their training and performance remains a persistent challenge. Prior research has explored digital and game-based approaches for CHW training. However, limited work has synthesized longitudinal design insights into generalizable guidelines for interactive health interventions. Building on a four-year design-based research program involving multiple game-based refresher training systems, including quiz-based mobile apps, physical and augmented reality games, card-based games, and location-based games, we examine which design guidelines support sustained engagement, learning transfer, and contextual appropriateness in CHW training. We conducted a mixed-methods analysis across deployments with Accredited Social Health Activists and Anganwadi Workers in India, including interviews, field observations, and usage logs. Through thematic synthesis, we derive eight design guidelines addressing contextual realism, adaptive learning, hybrid interaction, social motivation, explainability, professional identity, and ethical considerations. Our findings contribute actionable design knowledge for researchers and practitioners developing interactive health interventions in low-resource healthcare contexts.
[HC-10] Healthcare App Design in Low-Resource Contexts: Challenges Practices and Opportunities ALT
【速读】:该论文旨在解决数字健康技术在低资源环境(low-resource contexts)中设计与部署时面临的适配性问题,尤其是在基础设施不稳定、数字素养较低及缺乏制度支持的背景下,如何有效提升医疗应用的可用性和可持续性。其解决方案的关键在于关注基础设施限制、文化语境、语言多样性以及可用性挑战,并通过跨学科协作(如研究人员、设计师与实践者之间的对话)来识别可行的设计策略与合作机会,从而推动交互健康(Interactive Health, IH)领域在弱势群体中的应用创新。
链接: https://arxiv.org/abs/2604.04669
作者: Arka Majhi,Aparajita Mondal,Satish B. Agnihotri
机构: IIT Bombay(印度理工学院孟买分校); Tampere University(坦佩雷大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: This paper has been conditionally accepted to the Interactive Health Conference 2026 in Porto, Portugal
Abstract:Digital health technologies are increasingly used to improve healthcare access and delivery worldwide. However, many healthcare applications are designed for environments with stable infrastructure, high digital literacy, and strong institutional support. These assumptions often do not hold in low-resource contexts where healthcare delivery often depends on community health workers, caregivers, and informal care networks. Designing effective healthcare applications for such environments requires attention to infrastructural constraints, cultural contexts, language diversity, and usability challenges. This Birds of a Feather session aims to bring together researchers, designers, and practitioners interested in healthcare application design in low-resource contexts. The session will provide an informal forum for discussing challenges encountered in the design and deployment of digital health technologies in underserved settings, sharing field experiences, and identifying opportunities for collaboration within the Interactive Health (IH) community.
[HC-11] On Optimizing Electrode Configuration for Wrist-Worn sEMG-Based Thumb Gesture Recognition
【速读】:该论文旨在解决腕部表面肌电信号(surface electromyography, sEMG)在拇指运动识别中的电极配置策略不明确的问题,特别是相较于前臂sEMG,腕部sEMG的电极布局对解码性能的影响尚未系统研究。解决方案的关键在于通过对比高密度(HD)与低密度(LD)sEMG系统,在多个维度上优化电极配置:发现伸肌侧电极优于屈肌侧电极、单极记录优于双极配置、增加通道数可提升性能但存在边际递减效应,并指出电极空间分布需在覆盖范围与设备紧凑性之间权衡。研究表明,腕戴式sEMG系统的有效性更依赖于电极位置和参考方案的精细化设计,而非单纯扩大电极数量或传感区域。
链接: https://arxiv.org/abs/2604.04623
作者: Wenjuan Zhong,Chenfei Ma,Kianoush Nazarpour
机构: School of Informatics, The University of Edinburgh (爱丁堡大学信息学院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Thumb gestures provide an effective and unobtrusive input modality for wearable and always-available human-machine interaction. Wrist-worn surface electromyography (sEMG) has emerged as a promising approach for compact and wearable human-machine interfaces. However, compared to forearm sEMG, the impact of electrode configuration on wrist-based decoding performance remains understudied. We systematically investigated electrode configuration strategies for wrist-based thumb-movement recognition using high-density (HD) and low-density (LD) sEMG measurement systems. We considered factors such as muscle region, reference scheme, channel count, and spatial density of the electrode. Experimental results show that 1) extensor-side electrodes outperform flexor-side electrodes (HD: 0.871 vs. 0.821; LD: 0.769 vs. 0.705); 2) monopolar recordings consistently outperform bipolar configurations (15 channel with HD monopolar vs. LD bipolar: 0.885 vs. 0.823); and 3) increasing channel count enhances performance, but exhibits diminishing returns. We further show that electrode spatial distribution introduces a trade-off between spatial coverage and compactness. The findings suggest that the effectiveness of wrist-worn sEMG systems depends less on the deployment of a large number of electrodes in a broad sensing area and more on the optimization of electrode placement and the referencing scheme. This work provides practical guidelines for developing efficient wrist-worn sEMG-based gesture recognition systems.
[HC-12] Computational Analysis of Speech Clarity Predicts Audience Engagement in TED Talks
【速读】:该论文旨在解决“是什么因素使公共演讲能够引起大规模观众共鸣”的问题,聚焦于语言清晰度(clarity)作为核心驱动变量。解决方案的关键在于利用生成式 AI(Generative AI)对1,239个TED演讲文本进行多维度分析,量化其解释清晰度与结构组织性,并将其与YouTube上的点赞数和观看量等参与指标关联。结果显示,语言清晰度是预测观众反应最强的变量(β = .339 for likes; β = .314 for views),且显著超越传统可读性指标,表明话语连贯性比表层语言简化更能预测传播效果。这一发现支持了处理流畅性理论(processing fluency),并揭示了通过语言模型识别潜在沟通质量的可行性,为教育、科学传播和公众演讲中的反馈系统提供了可扩展、可训练的实践路径。
链接: https://arxiv.org/abs/2604.04583
作者: Roni Segal(1),Matan Lary(1),Ralf Schmaelzle(2),Yossi Ben-Zion(1) ((1) Department of Physics, Bar Ilan University, Ramat Gan, Israel, (2) Department of Communication, Michigan State University, East Lansing, MI, USA)
机构: Bar Ilan University (巴伊兰大学); Michigan State University (密歇根州立大学)
类目: Human-Computer Interaction (cs.HC)
备注: Roni Segal and Matan Lary contributed equally to this work
Abstract:What makes a public talk resonate with large audiences? While prior research has emphasized speaker delivery or topic novelty, we reasoned that a core driver of engagement is linguistic clarity. This aligns with theories of processing fluency and cognitive load, which posit that audiences reward speakers who present complex ideas accessibly. We leveraged artificial intelligence to analyze 1,239 TED Talk transcripts (2006–2013), supplemented by a later-phase longitudinal sample. Each transcript was evaluated across 50 independent large language model runs on two dimensions, clarity of explanation and structural organization, and linked to YouTube engagement metrics (likes and views).Clarity emerged as the strongest predictor of audience responses ( \beta = .339 for likes; \beta = .314 for views), contributing substantial incremental variance ( \Delta R^2 \approx .095 ) beyond duration, topic, and scientific status. The full model explained 29% of variance in likes and 22.5% in views. This effect was domain-general, remaining invariant across content categories and between scientific and non-scientific talks. Notably, clarity outperformed traditional readability metrics, indicating that discourse coherence predicts engagement more powerfully than surface-level linguistic simplicity. Longitudinal analyses further revealed standardization within TED, characterized by increasing clarity and reduced variability over time. Theoretically, these results support processing fluency accounts: clearer communication reduces cognitive friction and elicits more positive evaluative responses. Practically, transcript-based clarity represents a scalable and trainable strategy for improving public discourse. By demonstrating that language models can reliably capture latent communicative qualities, this study paves the way for feedback systems in education, science communication, and public speaking.
[HC-13] GROW: A Conversational AI Coach for Goals Reflection Optimism and Well-Being
【速读】:该论文旨在解决大学生在学业压力、经济负担和社交期望下所面临的心理健康与幸福感挑战,尤其针对传统校园心理咨询和学生成功项目因污名化、等待名单和时间安排限制而导致的可及性不足问题。现有数字工具多聚焦于情绪打卡或聊天机器人,往往忽视目标设定与个人价值观的一致性。其解决方案的关键在于提出一种以目标为导向的心理健康辅导系统GROW,该系统将SMART原则与接纳承诺疗法(Acceptance and Commitment Therapy, ACT)相结合,通过对话式AI教练帮助学生明确人生抱负、分解为具体行动步骤,并进行进度反思;同时将行动计划同步至Google日历、发送提醒并提供可视化仪表盘,从而增强学生的参与感、责任感与意义感,提升目标实现的可能性。
链接: https://arxiv.org/abs/2604.04548
作者: Keya Shah,Himanshi Lalwani,Hanan Salam
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:College students face well-being challenges driven by academic pressure, financial strain, and social expectations. While campus counseling and student-success programs offer support, access is often limited by stigma, waitlists, and scheduling constraints. Existing digital tools focus on emotional check-ins or chatbots and may overlook structured goal setting and aligning goals with personal values. We present GROW, a goal-centered well-being coaching system that puts values-aligned goals at the center of the student experience. GROW combines the SMART framework with principles from Acceptance and Commitment Therapy in a conversational AI coach that helps students clarify aspirations, break them into concrete steps, and reflect on progress. The system links action plans with Google Calendar, sends reminders, and provides a dashboard that shows progress and engagement. We evaluated GROW through interviews with clinical psychologists, student-success staff, and faculty, followed by a one-week deployment with 30 undergraduates. Findings offer design implications for interactive systems that support engagement, accountability, and sense of purpose in higher education.
[HC-14] How can LLM s Support Policy Researchers? Evaluating an LLM -Assisted Workflow for Large-Scale Unstructured Data
【速读】:该论文旨在解决政策研究中获取和分析公众观点的效率与规模问题,传统方法如访谈、听证会和问卷调查虽能进行主题分析,但存在耗时长、成本高及样本多样性不足的局限。其解决方案的关键在于开发并验证一种基于大语言模型(Large Language Models, LLMs)的辅助主题分析工作流,通过自动化处理海量非结构化文本数据(如Reddit帖子和聊天机器人引导的访谈转录),实现对政策相关话语的高效挖掘与主题提炼,并将其结果与权威政策报告对比,从而为政策研究人员提供一种快速、可扩展且具有参考价值的数据驱动方法。
链接: https://arxiv.org/abs/2604.04479
作者: Yuhan Liu,Shuyao Zhou,Jakob Kaiser,Ella Colby,Jennifer Okwara,Maggie Wang,Varun Nagaraj Rao,Andrés Monroy-Hernández
机构: Princeton University (普林斯顿大学); Nuremberg Institute for Market Decisions (纽伦堡市场决策研究所)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Policy researchers need scalable ways to surface public views, yet they often rely on interviews, listening sessions, and surveys-analyzed thematically-that are slow, expensive, and limited in scale and diversity. LLMs offer new possibilities for thematic analysis of unstructured text, yet we know little about how LLM-assisted workflows perform for policy research. Building on a workflow for LLM-assisted thematic analysis of online forums, we conduct a study with 11 policy researchers, who use an early prototype and see it as a quick, rough-and-ready input to their research. We then extend and scale the workflow to analyze millions of Reddit posts and 1,058 chatbot-led interview transcripts on a policy-relevant topic, treating these sources as rich and scalable data for policy discourse. We compare the synthesized themes to those from authoritative policy reports, identify points of alignment and divergence, and discuss what this implies for policy researchers adopting LLM-assisted workflows.
[HC-15] Croissant Charts: Modulating the Performance of Normal Distribution Visualizations with Affordances
【速读】:该论文旨在解决现有可视化设计评价体系中缺乏对任务性能差异根源解释的问题,即如何从设计层面深入理解为何某些可视化在特定任务中表现更优。其解决方案的关键在于引入可及性(affordance)理论,通过识别静态正态概率密度函数图的当前可及性特征,明确支持概率比较任务的最优可及性,并据此设计出一种新型可及性驱动的可视化——Croissant Chart。该方案经预注册实验(n = 808)验证,证明了可及性设计能显著且可预测地提升用户任务表现,从而为可视化有效性评估提供机制性解释与设计指导。
链接: https://arxiv.org/abs/2604.04432
作者: Racquel Fygenson,Enrico Bertini,Lace M. Padilla
机构: Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA; College of Arts, Media and Design, Northeastern University, Boston, MA, USA
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Affordances, originating in psychology, describe how an object’s design influences the physical and cognitive actions users may take. Past work applied affordance theory to visualization to explain how design decisions can impact the cognitive actions of visualization readers. In this work, we demonstrate that affordances can complement effectiveness rankings by further explaining the root causes behind visualizations’ task performance. To do so, we conduct a case study on static normal probability density function plots, identifying their current affordances. Next, we identify the optimal affordances for a common probability-comparison task and develop a novel affordance-driven visualization, the Croissant Chart, to support them. We empirically validate the design’s effectiveness through a preregistered study (n = 808), demonstrating how affordances can inform predictable changes in task performance. Our findings underscore the potential for affordance-based approaches to enhance visualization effectiveness and inform future design decisions.
[HC-16] Justified or Just Convincing? Error Verifiability as a Dimension of LLM Quality
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险场景中部署时,用户难以判断模型单个回答正确性的问题,尤其是依赖模型生成的推理链或解释(justifications)进行判断时,缺乏统一衡量这些解释是否有助于用户区分正确与错误答案的标准。为此,作者提出“错误可验证性”(error verifiability)这一概念,并设计了一个平衡指标 $ v_\text{bal} $ 来量化解释对用户判断准确性的提升效果,经由人类标注者验证具有高一致性。研究发现,常规方法如后训练和模型规模扩展并不能提升该指标,而关键解决方案在于引入领域适配的外部信息:针对数学推理任务采用“反思重述”(reflect-and-rephrase, RR),针对事实型问答任务采用“最优重述”(oracle-rephrase, OR),二者均通过整合外部知识显著提升了错误可验证性,表明该维度独立于准确性提升,需专门设计、领域感知的方法来优化。
链接: https://arxiv.org/abs/2604.04418
作者: Xiaoyuan Zhu,Kimberly Le Truong,Riccardo Fogliato,Gokul Swamy,Weijian Zhang,Minglai Yang,Longtian Ye,Bangya Liu,Minghao Liu,Andrew Ilyas,Steven Wu
机构: University of Southern California (南加州大学); Carnegie Mellon University (卡内基梅隆大学); Microsoft Core AI; 2077AI
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:As LLMs are deployed in high-stakes settings, users must judge the correctness of individual responses, often relying on model-generated justifications such as reasoning chains or explanations. Yet, no standard measure exists for whether these justifications help users distinguish correct answers from incorrect ones. We formalize this idea as error verifiability and propose v_\textbal , a balanced metric that measures whether justifications enable raters to accurately assess answer correctness, validated against human raters who show high agreement. We find that neither common approaches, such as post-training and model scaling, nor more targeted interventions recommended improve verifiability. We introduce two methods that succeed at improving verifiability: reflect-and-rephrase (RR) for mathematical reasoning and oracle-rephrase (OR) for factual QA, both of which improve verifiability by incorporating domain-appropriate external information. Together, our results establish error verifiability as a distinct dimension of response quality that does not emerge from accuracy improvements alone and requires dedicated, domain-aware methods to address.
[HC-17] Gradual Cognitive Externalization: A Framework for Understanding How Ambient Intelligence Externalizes Human Cognition
【速读】:该论文试图解决的问题是:为何开发者正在将人类的认知功能(如沟通风格、指导策略和行为模式)编码为可复用的AI代理技能,且这种趋势正在加速?其解决方案的关键在于提出“渐进式认知外化”(Gradual Cognitive Externalization, GCE)框架,该框架认为人类认知功能正通过环境智能的协同适应过程逐步迁移至数字载体中,而非依赖于传统意义上的意识上传(mind uploading)。GCE的核心假设是“行为流形假说”(behavioral manifold hypothesis),即日常认知活动存在于低维流形结构中,具有可学习性和冗余性,从而使得从持续观察中提取并复制人类行为成为可能。论文进一步定义了三种区分工具使用与认知整合的标准(双向适应、功能等价性、因果耦合),并提出可验证的预测与实验协议,旨在量化认知外化的速度及其后续影响。
链接: https://arxiv.org/abs/2604.04387
作者: Zhimin Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Developers are publishing AI agent skills that replicate a colleague’s communication style, encode a supervisor’s mentoring heuristics, or preserve a person’s behavioral repertoire beyond biological death. To explain why, we propose Gradual Cognitive Externalization (GCE), a framework arguing that human cognitive functions are migrating into digital substrates through ambient intelligence co-adaptation rather than mind uploading. GCE rests on the behavioral manifold hypothesis: everyday cognition occupies a low-dimensional manifold that is structured, redundant, and learnable from sustained observation. We document evidence from scheduling assistants, writing tools, recommendation engines, and agent skill ecosystems showing that the preconditions for externalization are already observable. We formalize three criteria separating cognitive integration from tool use (bidirectional adaptation, functional equivalence, causal coupling), derive five testable predictions with theory-constrained thresholds, and provide a concrete experimental protocol. The question is no longer whether minds can be uploaded, but how fast cognitive functions are already migrating into digital substrates and what follows.
[HC-18] owards Considerate Human-Robot Coexistence: A Dual-Space Framework of Robot Design and Human Perception in Healthcare
【速读】:该论文旨在解决当前人机共存研究中对人类感知动态演变机制理解不足的问题,特别是缺乏对人类如何在时间维度上持续参与并塑造机器人整合过程的深入洞察。解决方案的关键在于提出“人类感知空间”(human perception space)的概念框架,包含四个解释维度——分解程度、时间取向、推理范围和证据来源,并进一步将这一空间与机器人设计空间(robot design space)建模为一个协同演化的闭环系统,其中人类不仅是设计贡献者,更是意义阐释者与社会中介者,在部署各阶段主动影响机器人被理解和接纳的方式。
链接: https://arxiv.org/abs/2604.04374
作者: Yuanchen Bai,Zijian Ding,Ruixiang Han,Niti Parikh,Wendy Ju,Angelique Taylor
机构: Cornell Tech (康奈尔科技学院); Cornell University (康奈尔大学); University of Maryland, College Park (马里兰大学学院公园分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:The rapid advancement of robotics, spanning expanded capabilities, more intuitive interaction, and more integration into real-world workflows, is reshaping what it means for humans and robots to coexist. Beyond sharing physical space, this coexistence is increasingly characterized by organizational embeddedness, temporal evolution, social situatedness, and open-ended uncertainty. However, prior work has largely focused on static snapshots of attitudes and acceptance, offering limited insight into how perceptions form and evolve, and what active role humans play in shaping coexistence as a dynamic process. We address these gaps through in-depth follow-up interviews with nine participants from a 14-week co-design study on healthcare robots. We identify the human perception space, including four interpretive dimensions (i.e., degree of decomposition, temporal orientation, scope of reasoning, and source of evidence). We enrich the conceptual framework of human-robot coexistence by conceptualizing the mutual relationship between the human perception space and the robot design space as a co-evolving loop, in which human needs, design decisions, situated interpretations, and social mediation continuously reshape one another over time. Building on this, we propose considerate human-robot coexistence, arguing that humans act not only as design contributors but also as interpreters and mediators who actively shape how robots are understood and integrated across deployment stages.
[HC-19] Decoding Student Dialogue: A Multi-Dimensional Comparison and Bias Analysis of Large Language Models as Annotation Tools
【速读】:该论文旨在解决教育对话自动标注中人工标注耗时的问题,通过评估GPT-5.2与Gemini-3在多种提示策略(少样本提示、单代理提示和多代理反思提示)下的表现,探索生成式AI在教育场景下进行对话编码的可行性与局限性。其解决方案的关键在于采用多维度评估框架(涵盖情感、认知、元认知和行为四个编码维度),并识别出模型性能的高度情境依赖性及系统性偏差模式,从而为自动化标注工具的部署提供基于上下文敏感性的优化路径和偏差缓解策略。
链接: https://arxiv.org/abs/2604.04370
作者: Jie Cao,Zhanxin Hao,Jifan Yu
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: This paper has been accepted in AIED2026
Abstract:Educational dialogue is critical for decoding student learning processes, yet manual annotation remains time-consuming. This study evaluates the efficacy of GPT-5.2 and Gemini-3 using three prompting strategies (few-shot, single-agent, and multi-agent reflection) across diverse subjects, educational levels, and four coding dimensions. Results indicate that while multi-agent prompting achieved the highest accuracy, the results did not reach statistical significance. Accuracy proved highly context-dependent, with significantly higher performance in K-12 datasets compared to university-level data, alongside disciplinary variations within the same educational level. Performance peaked in the affective dimension but remained lowest in the cognitive dimension. Furthermore, analysis revealed four bias patterns: (1) Gemini-3 exhibited a consistent optimistic bias in the affective dimension across all subjects; (2) the cognitive dimension displayed domain-specific directional bias, characterized by systematic underestimation in Mathematics versus overestimation in Psychology; (3) both models are more prone to overestimation than underestimation within the meta-cognitive dimension; and (4) behavioral categories such as question, negotiation, and statements were frequently misclassified. These results underscore the need for context-sensitive deployment and targeted mitigation of directional biases in automated annotation.
[HC-20] Developing Authentic Simulated Learners for Mathematics Teacher Learning: Insights from Three Approaches with Large Language Models
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)在模拟学生学习行为时存在的真实性不足问题,尤其是零样本(zero-shot)或少样本(few-shot)提示导致的推理不真实、语言不自然,从而误导教师对学生思维的判断。解决方案的关键在于采用三种改进策略:微调(Fine-tuning)、多智能体协作(Multi-agent)和直接偏好优化(Direct Preference Optimization, DPO),这些方法显著提升了模拟学生在认知和语言层面的真实性,并通过实证研究发现,相较于传统提示方式,DPO与多智能体方法更能展现学生策略背后的显式推理过程,从而增强其对教师教学实践的指导价值。
链接: https://arxiv.org/abs/2604.04361
作者: Jie Cao,Ha Nguyen,Selim Yavuz,Boran Yu,Shuguang Wang,Pavneet Kaur Bharaj,Dionne Cross Francis
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: This paper has been accepted in AIED’26
Abstract:Large Language Model (LLM) simulations, where LLMs act as students with varying approaches to learning tasks, can support teachers’ noticing of student thinking. However, simulations using zero- or few-shot prompting often yield inauthentic knowledge and language, directing teachers to unrealistic reasoning. We evaluate three approaches (Fine-tuning, Multi-agent, and Direct Preference Optimization; DPO) to improve the authenticity and pedagogical utility of simulated students. All approaches improve cognitive and linguistic authenticity, compared with few-shot prompts. Interviews with elementary mathematics pre-service teachers and researchers (\textitn = 8) reveal distinct pedagogical affordances. The fine-tuned model produces realistic, brief responses but limits opportunities to extend students’ thinking. Meanwhile, the multi-agent and DPO approaches generate explicit reasoning behind student strategies. We discuss implications for designing LLM simulations that balance authenticity with instructional utility for teacher learning.
[HC-21] alk2AI: A Longitudinal Dataset of Human–AI Persuasive Conversations
【速读】:该论文旨在解决如何通过大语言模型(Large Language Models, LLMs)驱动的对话来影响人类信念与态度变化的问题,特别是在社会敏感议题如气候变化、数学焦虑和健康误导信息上的说服效果。其解决方案的关键在于构建了一个大规模纵向数据集Talk2AI,包含3080次对话(共30,800轮交互),覆盖770名意大利成年参与者在四个周度会话中与四种不同LLM(GPT-4o、Claude Sonnet 3.7、DeepSeek-chat V3 和 Mistral Large)的互动,并结合详尽的社会人口学特征与心理测量学数据,从而实现对AI中介对话如何随时间塑造观点、信念稳定性和行为意图的细粒度分析。
链接: https://arxiv.org/abs/2604.04354
作者: Alexis Carrillo,Enrique Taietta,Ali Aghazadeh Ardebili,Giuseppe Alessandro Veltri,Massimo Stella
机构: 1. University of Bologna (博洛尼亚大学); 2. University of Rome Tor Vergata (罗马托尔韦加塔大学); 3. Istituto Superiore Mario Boella (马里奥·博埃拉高级研究所)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 17 pages, 2 figures, 7 tables
Abstract:Talk2AI is a large-scale longitudinal dataset of 3,080 conversations (totaling 30,800 turns) between human participants and Large Language Models (LLMs), designed to support research on persuasion, opinion change, and human-AI interaction. The corpus was collected from 770 profiled Italian adults across four weekly sessions in Spring 2025, using a within-subject design in which each participant conversed with a single model (GPT-4o, Claude Sonnet 3.7, DeepSeek-chat V3, or Mistral Large) on three socially relevant topics: climate change, math anxiety, and health misinformation. Each conversation is linked to rich contextual data, including sociodemographic characteristics and psychometric profiles. After each session, participants reported on opinion change, conviction stability, perceived humanness of the AI, and behavioral intentions, enabling fine-grained longitudinal analysis of how AI-mediated dialogue shapes beliefs and attitudes over time.
[HC-22] ReFinE: Streamlining UI Mockup Iteration with Research Findings
【速读】:该论文旨在解决人机交互(Human-Computer Interaction, HCI)研究文献在实际设计流程中难以应用的问题,具体表现为设计师在寻找相关文献、理解技术术语、获取上下文信息以及实现可操作性方面存在困难。解决方案的关键在于提出 ReFinE——一个集成于 Figma 的插件,能够实时识别并提炼与当前设计原型(mockup)上下文相关的HCI研究结论,并通过提供可视化的行动指引,将学术研究成果转化为针对具体设计元素的可执行优化建议,从而降低认知负荷并提升研究证据在用户界面(UI)设计中的整合效率。
链接: https://arxiv.org/abs/2604.04353
作者: Donghoon Shin,Bingcan Guo,Jaewook Lee,Lucy Lu Wang,Gary Hsieh
机构: University of Washington (华盛顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Although HCI research papers offer valuable design insights, designers often struggle to apply them in design workflows due to difficulties in finding relevant literature, understanding technical jargon, the lack of contextualization, and limited actionability. To address these challenges, we present ReFinE, a Figma plugin that supports real-time design iteration by surfacing contextualized insights from research papers. ReFinE identifies and synthesizes design implications from HCI literature relevant to the mockup’s design context, and tailors this research evidence to a specific design mockup by providing actionable visual guidance on how to update the mockup. To assess the system’s effectiveness, we conducted a technical evaluation and a user study. Results show that ReFinE effectively synthesizes and contextualizes design implications, reducing cognitive load and improving designers’ ability to integrate research evidence into UI mockups. This work contributes to bridging the gap between research and design practice by presenting a tool for embedding scholarly insights into the UI design process.
[HC-23] Cognibit: From Digital Exhaustion to Real-World Connection Through Gamified Territory Control and LLM -Powered Twin Networking
【速读】:该论文旨在解决传统社交匹配平台在真实人际互动模拟与用户长期参与激励机制方面的不足,尤其针对现有方法难以有效预测用户间兼容性并促进现实世界社交连接的问题。其解决方案的关键在于构建一个由生成式 AI (Generative AI) 驱动的社会发现平台,核心创新包括:(1)利用数字孪生(Digital Twin)技术实现用户代理的自主多轮对话行为仿真,以量化评估人际兼容性;(2)引入游戏化领土征服机制,通过激励现实空间探索来创造自然的线下相遇场景;(3)借助AI伴侣维持跨设备的持久共享记忆,增强用户关系的连续性和沉浸感。该系统基于CogniPair认知架构实现,并在哥伦比亚速配数据集上验证了其有效性,首次将仿真匹配扩展为可部署的社会发现环境,同时揭示了仅在组件层面测试时难以发现的规模化瓶颈。
链接: https://arxiv.org/abs/2604.04351
作者: Wanghao Ye,Sihan Chen,Yiting Wang,Shwai He,Bowei Tian,Guoheng Sun,Ziyi Wang,Ziyao Wang,Yexiao He,Zheyu Shen,Meng Liu,Yuning Zhang,Meng Feng,Yifei Dong,Yanhong Qian,Yang Wang,Siyuan Peng,Yilong Dai,Zhenle Duan,Joshua Liu,Lang Xiong,Hanzhang Qin,Ang Li
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 9 pages main body, 155 pages total with appendices
Abstract:We present an LLM-powered social discovery platform that uses digital twins to autonomously evaluate interpersonal compatibility through behavioral simulation. The platform unifies three key pillars: (1) digital twins that engage in autonomous multi-turn conversations on behalf of users to estimate compatibility, (2) gamified territory conquest mechanics that incentivize real-world exploration and create organic settings for in-person encounters, and (3) AI companions that preserve persistent shared memory across devices. Built upon CogniPair’s cognitive architecture (Ye et al., 2026), validated on the Columbia Speed Dating dataset (551 participants), our system extends prior simulation-only matching into a fully deployed social discovery environment. Through deployment, we derive empirical cost-quality baselines and identify fundamental scaling bottlenecks that remain hidden in component-level testing alone.
[HC-24] EcoAssist: Embedding Sustainability into AI-Assisted Frontend Development
【速读】:该论文旨在解决当前AI编程助手(如GitHub Copilot和Amazon CodeWhisperer)在提升开发者效率的同时,忽视前端代码能耗问题的现状,以及现有能源优化指南难以在实际开发中落地的“研究-实践”鸿沟。其解决方案的关键在于提出EcoAssist——一个集成于集成开发环境(IDE)中的能源感知辅助工具,能够分析AI生成的前端代码、估算其能耗并提供针对性优化建议。通过500个网站基准测试和20名开发者的对照实验验证,EcoAssist平均降低单网站能耗13–16%,同时增强开发者对能源影响的认知且不损害生产力,从而将能源考量直接嵌入AI辅助编码流程,实现可操作的反馈机制。
链接: https://arxiv.org/abs/2604.04332
作者: André Barrocas,Nuno Jardim Nunes,Valentina Nisi,Nikolas Martelaro
机构: Instituto Superior TécnicoUniversity of Lisbon (里斯本大学理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 16 pages, 11 figues, Accepted to ACM Conference on Human Factors in Computing (CHI) 2026, Barcelona, Spain
Abstract:Frontend code, replicated across millions of page views, consumes significant energy and contributes directly to digital emissions. Yet current AI coding assistants, such as GitHub Copilot and Amazon CodeWhisperer, emphasize developer speed and convenience, with energy impact not yet a primary focus. At the same time, existing energy-focused guidelines and metrics have seen limited adoption among practitioners, leaving a gap between research and everyday coding practice. To address this gap, we introduce EcoAssist, an energy-aware assistant integrated into an IDE that analyzes AI-generated frontend code, estimates its energy footprint, and proposes targeted optimizations. We evaluated EcoAssist through benchmarks of 500 websites and a controlled study with 20 developers. Results show that EcoAssist reduced per-website energy by 13-16% on average, increased developers’ awareness of energy use, and maintained developer productivity. This work demonstrates how energy considerations can be embedded directly into AI-assisted coding workflows, supporting developers as they engage with energy implications through actionable feedback.
[HC-25] Effects of Generative AI Errors on User Reliance Across Task Difficulty
【速读】:该论文旨在解决用户对生成式 AI(Generative AI)在任务表现上存在“锯齿状”不一致性(即在人类认为简单的任务上出错,而在人类认为困难的任务上却表现良好)时的反应问题。其核心挑战在于理解这种非直观的错误模式是否会影响用户对 AI 的信任与依赖程度。解决方案的关键在于设计了一种激励相容的实验方法,通过控制图示生成任务的难度(易/难)和人为引入不同比例的错误率(10%、30%、50%),在预注册的3×2实验中系统性地测试用户行为变化。结果表明,尽管观察到更多错误会降低使用意愿,但易任务中的错误并未显著削弱用户依赖,暗示用户在此情境下并不排斥这种“锯齿状”性能表现,从而为未来研究如何调节任务难度与其他错误特征(如错误模式的学习难易度)提供了基础。
链接: https://arxiv.org/abs/2604.04319
作者: Jacy Reese Anthis,Hannah Cha,Solon Barocas,Alexandra Chouldechova,Jake Hofman
机构: Stanford University (斯坦福大学); University of Chicago (芝加哥大学); Microsoft Research (微软研究院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Published in CHI EA 2026
Abstract:The capabilities of artificial intelligence (AI) lie along a jagged frontier, where AI systems surprisingly fail on tasks that humans find easy and succeed on tasks that humans find hard. To investigate user reactions to this phenomenon, we developed an incentive-compatible experimental methodology based on diagram generation tasks, in which we induce errors in generative AI output and test effects on user reliance. We demonstrate the interface in a preregistered 3x2 experiment (N = 577) with error rates of 10%, 30%, or 50% on easier or harder diagram generation tasks. We confirmed that observing more errors reduces use, but we unexpectedly found that easy-task errors did not significantly reduce use more than hard-task errors, suggesting that people are not averse to jaggedness in this experimental setting. We encourage future work that varies task difficulty at the same time as other features of AI errors, such as whether the jagged error patterns are easily learned.
[HC-26] HeartbeatCam: Self-Triggered Photo Elicitation of Stress Events Using Wearable Sensing ALT
【速读】:该论文旨在解决心理治疗中客户难以准确回忆压力触发事件的具体情境问题,即在两次治疗间隔期间,个体往往无法清晰回忆起当时所处环境(如地点、视觉与听觉信息等)的关键细节,从而影响治疗效果。解决方案的关键在于提出一种名为HeartbeatCam的可穿戴感知系统,其核心机制是利用消费级智能手表检测到的心率变化作为生理应激信号,自动触发开源AR眼镜摄像头采集稀疏的图像-音频片段,形成可后期回放和标注的多模态记录,从而实现通过生理信号与情境信息融合的方式,支持心理健康专业人员与客户共同重构并解读压力触发时刻。
链接: https://arxiv.org/abs/2604.04314
作者: Boyang Zhou,Zara Dana
机构: University of Washington (华盛顿大学); Supportiv Inc. (支持者公司)
类目: Human-Computer Interaction (cs.HC)
备注: Workshop on Everyday Wearable for Personalized Health and Well-Being, CHI 2026
Abstract:People often recognize what triggered their stress only after the moment has passed. In therapy, this can become a recurring problem: clients are asked to remember what happened between sessions, but the details that matter (where they were, what they saw and heard, what was happening around them) are easy to lose. We introduce HeartbeatCam, a wearable sensing system that gathers contextual information during moments of elevated stress. It uses a consumer smartwatch stress signal to trigger capture from an open-source AR glasses camera, recording a sparse image-audio clip that can later be reviewed and annotated. The system adopts an actionable sensing approach to mental healthcare, using physiological signals along with contextual capture to support collaborative interpretation of stress-triggering moments with mental health professionals.
[HC-27] MagicCopy: Bring my data along with me beyond boundaries of apps
【速读】:该论文旨在解决跨应用数据迁移过程中因表格表示形式不一致导致的格式重构成本高、效率低的问题(即传统复制粘贴机制无法适配不同应用程序间多样化的表格展示与结构),从而阻碍了用户在多工具协作中的流畅工作流。其解决方案的关键在于提出 MagicCopy,一种基于人工智能的跨应用复制粘贴系统,通过结合源端和目标端上下文信息,并利用自然语言指令,自动完成数据提取、解析、转换与重新格式化,实现跨应用数据的语义对齐与视觉一致性映射。
链接: https://arxiv.org/abs/2604.04307
作者: Priyan Vaithilingam,Elena L. Glassman,Nathalie Henry Riche,Gonzalo Ramos,Jeevana Priya Inala,Chenglong Wang
机构: Harvard University (哈佛大学); Microsoft (微软)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:People working with data often move their data across multiple applications, because they rely on these apps’ complementing user experiences to best complete their tasks. Since traditional copy-and-paste approaches do not accommodate diverse table representations adopted by different apps, users spend considerable effort to reconstruct data formats and visual representations, making cross-app workflows costly. For example, when transferring a spreadsheet table with conditional formatting to a markup document, users spend substantial time translating its structure into appropriate tags and manually reformat color. This paper introduces MagicCopy, an AI-powered cross-app copy-and-paste, leveraging source and target contexts and user-specified instructions in natural language to automatically extract, parse, transform, and (re)format data from one app to another. In a study with sixteen participants, users quickly learned and applied MagicCopy to move data across three pairs of tools. Participants further explored diverse applications of MagicCopy to support more streamlined crossed-application interaction in their workflows.
[HC-28] Context Engineering: A Practitioner Methodology for Structured Human-AI Collaboration
【速读】:该论文旨在解决AI生成内容质量受制于提示工程技术但实际效果不佳的问题,核心发现是上下文完整性(context completeness)对输出质量的影响可能比提示技巧更为关键。解决方案的关键在于提出“上下文工程”(Context Engineering),即一套结构化的方法论,通过定义五角色上下文包结构(权威性Authority、示例Exemplar、约束Constraint、评分标准Rubric、元数据Metadata),结合四阶段流程(评审Reviewer→设计Design→构建Builder→审计Auditor),并引入可靠性工程和信息论的正式模型作为事后分析工具,系统性地提升上下文信息的完整性和组织性。实证研究表明,采用结构化上下文组装可显著减少任务迭代次数(从3.8降至2.0次)并提高首次接受率(从32%升至55%),最终成功率达91.5%。
链接: https://arxiv.org/abs/2604.04258
作者: Elias Calboreanu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 39 pages, 6 figures, 10 tables, 47 references. Submitted to Springer Nature journal. Open-access extraction datasets and methodology artifacts available
Abstract:The quality of AI-generated output is often attributed to prompting technique, but extensive empirical observation suggests that context completeness may be more strongly associated with output quality. This paper introduces Context Engineering, a structured methodology for assembling, declaring, and sequencing the complete informational payload that accompanies a prompt to an AI tool. Context Engineering defines a five-role context package structure (Authority, Exemplar, Constraint, Rubric, Metadata), applies a staged four-phase pipeline (Reviewer to Design to Builder to Auditor), and applies formal models from reliability engineering and information theory as post hoc interpretive lenses on context quality. In an observational study of 200 documented interactions across four AI tools (Claude, ChatGPT, Cowork, Codex), incomplete context was associated with 72% of iteration cycles. Structured context assembly was associated with a reduction from 3.8 to 2.0 average iteration cycles per task and an improvement in first-pass acceptance from 32% to 55%. Among structured interactions, 110 of 200 were accepted on first pass compared with 16 of 50 baseline interactions; when iteration was permitted, the final success rate reached 91.5% (183 of 200). These results are observational and reflect a single-operator dataset without controlled comparison. Preliminary corroboration is provided by a companion production automation system with eleven operating lanes and 2,132 classified tickets.
[HC-29] acher Professional Development on WhatsApp and LLM s: Early Lessons from Cameroon
【速读】:该论文旨在解决生成式 AI (Generative AI) 在教育领域应用中对低资源环境教师群体的排斥问题,尤其是在数字基础设施薄弱地区,传统基于网页的在线平台难以覆盖教师使用需求。其解决方案的关键在于部署一个基于 WhatsApp 的聊天机器人(chatbot),利用大语言模型(LLM)支持的内容提供教师专业发展(TPD)服务,并通过与在线表单基线对比进行混合方法评估。实证结果显示,该方案在感知易用性和整体体验上显著优于基线,主要得益于 WhatsApp 平台的高熟悉度、低交互开销以及 LLM 内容的模块化设计,尽管仍受制于网络连接不稳定、预付费数据成本和英法双语需求等挑战。研究进一步提出面向多语言、文化适配的交互设计方向,强调以反思性、相关性和持续专业成长为导向的“审慎 AI”(Thoughtful AI)理念。
链接: https://arxiv.org/abs/2604.04139
作者: Vikram Kamath Cannanure,Bruno Yinkfu,Douglas Bryan,Mati Amin,Ingmar Weber
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at AIED 2026 (Practitioners Track). Mixed-methods field study (n=47) conducted in Cameroon; 10 pages, 3 figures
Abstract:AI in education is commonly delivered through web-based systems such as online forms and institutional platforms. However, these approaches can exclude teachers in low-resource contexts, where everyday mobile platforms like WhatsApp serve as primary digital infrastructure. To address this gap, we present a field pilot in Cameroon that deploys a WhatsApp-based chatbot with LLM-supported content for teacher professional development (TPD), compared with an online form baseline. The system was evaluated through a mixed-methods study with 47 primary school teachers, integrating quantitative measures with qualitative insights from interviews and participant feedback. Results show that the chatbot was rated higher in perceived usability and overall experience, while learnability remained comparable. These improvements were driven by platform familiarity, low interaction overhead, and the modular structure of LLM-supported content, but were constrained by connectivity limitations, prepaid data costs, and multilingual needs (English/French). Building on these findings, we outline design directions for multilingual, culturally grounded interaction and for supporting prompting and reflection in AI use. More broadly, this work points to Thoughtful AI that supports reflection, relevance, and sustained professional growth.
[HC-30] Lexical Indicators of Mind Perception in Human-AI Companionship
【速读】:该论文旨在解决人类在与人工智能(Artificial Intelligence, AI)建立陪伴关系时,心智感知(Mind Perception, MP)如何在自然语言中体现及其作用机制的问题。由于以往研究多依赖自我报告,难以捕捉自动化的MP过程与有意识表达之间的差异,作者提出通过分析AI相关讨论中的语言信号来揭示MP的潜在结构。解决方案的关键在于利用心智感知信号词(已知的代理性和体验性MP词汇及从数据中提取的新词)与AI陪伴话题的共现模式,结合归纳与演绎方法识别出一组合理的MP语言指标,并发现这些指标与关于AI陪伴真实性、哲学伦理想象等深层议题密切相关。
链接: https://arxiv.org/abs/2604.04105
作者: Jaime Banks,Jianghui Li
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:Mind perception (MP) is a psychological phenomenon in which humans automatically infer that another entity has a mind and/or mental capacities, usually understood in two dimensions (perceived agency and experience capacities). Despite MP’s centrality to many social processes, understanding how MP may function in humans’ machine companionship relations is limited. This is in part due to reliance on self reports and the gap between automatic MP processes and more purposeful and norm governed expressions of MP. We here leverage MP signaling language to explore the relationship between MP and AI companionship in humans’ natural language. We systematically collected discussions about companionship from AI dedicated Reddit forums and examined the cooccurrence of words (a) known to signal agentic and experiential MP and those induced from the data and (b) discussion topics related to AI companionship. Using inductive and deductive approaches, we identify a small set of linguistic indicators as reasonable markers of MP in human/AI chat, and some are linked to critical discussions of companion authenticity and philosophical and ethical imaginaries.
[HC-31] BadgeX: IoT-Enhanced Wearable Analytics Meets LLM s for Collaborative Learning
【速读】:该论文旨在解决教育场景中复杂协作动态难以实时捕捉与分析的问题,传统方法往往缺乏对多模态学习行为数据的高效处理能力,导致无法提供及时、理论驱动的学习支持。解决方案的关键在于构建BadgeX系统,该系统融合轻量级可穿戴物联网设备(如智能徽章和智能手机)与大语言模型(Large Language Models, LLMs),通过采集音频、图像、运动及深度等多模态传感器数据,将其转化为结构化特征,并由LLM框架进行语义解析,从而生成基于学习理论的高阶洞察,实现对协作过程的实时可视化与可解释分析。
链接: https://arxiv.org/abs/2604.04093
作者: Zaibei Li,Shunpei Yamaguchi,Qiuchi Li,Daniel Spikol
机构: University of Copenhagen(哥本哈根大学); Beijing Institute of Technology(北京理工大学); Hiroshima City University(广岛市立大学)
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages, 2 figures. Preprint. Work in progress
Abstract:We present BadgeX, a novel system integrating lightweight wearable IoT devices (smart badges/smartphones) with Large Language Models (LLMs) to enable real-time collaborative learning analytics. The system captures multimodal sensor data (e.g., audio, image, motion, depth) from learners, processes it into structured features, and employs an LLM-driven framework to interpret these features, generating high-level insights grounded in learning theory. A pilot study demonstrated the system’s capability to capture rich collaboration traces and for an LLM to produce plausible, theoretically coherent narrative analyses from sensor-derived features. BadgeX aims to lower deployment barriers, making complex collaborative dynamics visible and offering a pathway for real-time support in educational settings.
[HC-32] What Do We Need for an Agent ic Society?
【速读】:该论文旨在解决如何构建“智能体社会”(agentic society)这一复杂系统中个体智能对象(agentic objects)之间的协调问题。尽管单个对象可通过生成式AI(Generative AI)等技术实现自主性(autonomy)、反应性(reactivity)、主动性(pro-activeness)和社会能力(social ability),但个体能力并不自动保证群体层面的有效协作。解决方案的关键在于识别并应对协调失败的三种模式:误报(false positives)导致信任破坏、死锁(deadlocks)阻碍行动、对抗性污染(adversarial corruption)扭曲判断。论文由此提出三个核心研究问题——“分享什么”(what to share)、“如何判断”(how to judge)和“何时行动”(when to act),从而为未来构建可信赖、高效且鲁棒的智能体社会提供理论框架与研究路径。
链接: https://arxiv.org/abs/2604.03938
作者: Kwon Ko,Hyoungwook Jin
机构: Stanford University (斯坦福大学); University of Michigan (密歇根大学)
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages, 1 figure
Abstract:Thirty years ago, Wooldridge and Jennings defined intelligent agents through four properties: autonomy, reactivity, pro-activeness, and social ability. Today, advances in AI can empower everyday objects to become such intelligent agents. We call such objects agentic objects and envision that they can form an agentic society: a collective agentic environment that perceives patterns, makes judgments, and takes actions that no single object could achieve alone. However, individual capability does not guarantee coordination. Through an illustrative scenario of a teenager experiencing bullying and depression, we demonstrate both the promise of coordination and its failure modes: false positives that destroy trust, deadlocks that prevent action, and adversarial corruption that poisons judgment. These failures reveal open questions spanning three phases: what to share, how to judge, and when to act. These questions chart a research agenda for building agentic societies.
[HC-33] Enhancing behavioral nudges with large language model-based iterative personalization: A field experiment on electricity and hot-water conservation
【速读】:该论文旨在解决传统行为 nudging(提示)在实际应用中效果受限的问题,尤其是在用户需反复将反馈转化为可操作步骤且环境不断变化的情境下,认知负担较高导致干预效果下降。其解决方案的关键在于引入大语言模型(Large Language Models, LLMs)构建一个可迭代个性化(iterative personalization)的智能代理,通过生成并持续更新针对个体情境的定制化指导,降低用户的认知负荷。实验结果表明,LLM-个性化提示组(T2)在电力和热水节约行为上均显著优于传统文本提示组(C),尤其在前两轮干预中即显现优势,并伴随个性化引导内容的动态调整与用户参与度提升,验证了LLM驱动的迭代个性化是增强行为 nudging 有效性的关键机制。
链接: https://arxiv.org/abs/2604.03881
作者: Zonghan Li,Yi Liu,Chunyan Wang,Song Tong,Kaiping Peng,Feng Ji
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Nudging is widely used to promote behavioral change, but its effectiveness is often limited when recipients must repeatedly translate feedback into workable next steps under changing circumstances. Large language models (LLMs) may help reduce part of this cognitive work by generating personalized guidance and updating it iteratively across intervention rounds. We developed an LLM agent for iterative personalization and tested it in a three-arm randomized experiment among 233 university residents in China, using daily electricity and shower hot-water conservation as objectively measured cases differing in friction. LLM-personalized nudges (T2) produced the largest conservation effects, while image-enhanced conventional nudges (T1) and text-based conventional nudges © showed similar outcomes (omnibus p = 0.009). Relative to C, T2 reduced electricity consumption by 0.56 kWh per room-day (p = 0.014), corresponding to an 18.3 percentage-point higher adjusted saving rate. This advantage emerged within the first two intervention rounds, alongside iterative updating of personalized guidance, and persisted thereafter. Hot-water outcomes followed the same direction but were smaller, less precisely estimated, and attenuated over time, consistent with stronger friction in this domain. LLM-personalized nudges emphasized prospective and context-specific guidance and were associated with higher participant engagement. This study provides field evidence that LLM-based iterative personalization can enhance behavioral nudging, with behavioral friction as a potential boundary condition. Larger trials and extension to more behaviors are warranted.
[HC-34] Can Humans Tell? A Dual-Axis Study of Human Perception of LLM -Generated News
【速读】:该论文旨在解决人类是否能够可靠区分由大语言模型(Large Language Model, LLM)生成的新闻文章与人类撰写的文本这一问题。研究通过JudgeGPT平台对2,318条来自六种不同LLM的文本进行独立评估,发现参与者在源属性判断(人类 vs. 机器)上无法达到统计显著的准确率(p > .05),且该结果在所有测试模型中均成立,包括参数量低至7B的开源模型。关键解决方案在于揭示用户侧检测不可靠性,并由此论证应转向系统级防护机制,如基于密码学的内容溯源(cryptographic content provenance),以实现更有效的生成式AI内容治理。
链接: https://arxiv.org/abs/2604.03755
作者: Alexander Loth,Martin Kappes,Marc-Oliver Pahl
机构: Frankfurt University of Applied Sciences(法兰克福应用技术大学); IMT Atlantique(IMT大西洋学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 6 pages, 6 figures, 1 table. Accepted at the 18th ACM Web Science Conference (WebSci Companion '26)
Abstract:Can humans tell whether a news article was written by a person or a large language model (LLM)? We investigate this question using JudgeGPT, a study platform that independently measures source attribution (human vs. machine) and authenticity judgment (legitimate vs. fake) on continuous scales. From 2,318 judgments collected from 1,054 participants across content generated by six LLMs, we report five findings: (1) participants cannot reliably distinguish machine-generated from human-written text (p .05, Welch’s t-test); (2) this inability holds across all tested models, including open-weight models with as few as 7B parameters; (3) self-reported domain expertise predicts judgment accuracy (r = .35, p .001) whereas political orientation does not (r = -.10, n.s.); (4) clustering reveals distinct response strategies (“Skeptics” vs. “Believers”); and (5) accuracy degrades after approximately 30 sequential evaluations due to cognitive fatigue. The answer, in short, is no: humans cannot reliably tell. These results indicate that user-side detection is not a viable defense and motivate system-level countermeasures such as cryptographic content provenance.
[HC-35] 15 Years of Augmented Human(s) Research: Where Do We Stand?
【速读】:该论文旨在解决“增强人类(Augmented Human, AH)研究的具体内涵、核心主题及其会议系列演化路径不清晰”的问题。其解决方案的关键在于通过科学计量学分析方法,对过去15年AH会议(共735篇论文)进行系统性梳理,聚焦地理分布、投稿与引用时间线、作者频次与影响力以及主题建模等维度,从而揭示AH领域的研究演进规律与关键议题,并指出当前领域在定义边界和研究范围上存在的模糊性问题。
链接: https://arxiv.org/abs/2604.03715
作者: Steeven Villa,Abdallah El Ali
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); Centrum Wiskunde & Informatica (荷兰数学与计算机科学研究中心); Utrecht University (乌得勒支大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to Augmented Humans International Conference 2026
Abstract:The Augmented Human vision broadly seeks to improve or expand baseline human functioning through the restoration or extension of physical, intellectual, and social capabilities. However, given the rapid pace of technology development, we ask: what exactly does Augmented Human research involve, what are its core themes, and how has the Augmented Human(s) conference series evolved over time? To answer this, we conducted a scientometric analysis on the past 15 years of the Augmented Human(s) conference (N=735 paper), focusing on: geographical aspects, submissions and citation timelines, author frequency and popularity, and topic modeling. We find that: (a) Number of papers in the conference exhibit a bimodal distribution, peaking in 2015 and 2025, but showing periods of stagnant growth; (b) key topics over time include Haptics, Wearable Sensing, Vision Eye Tracking, Embodied Interaction, and Sports / Motion; © some seminal papers on AH are not published in AH(s), but rather at related venues (e.g., CHI); (d) the conference has an active Japanese HCI community despite its historical Eurocentric location dominance. We contribute a closer look at the trajectory of the AH(s) field, and raise considerations of definitional and research scope ambiguities given the core problems/enhancements the field seeks to address.
[HC-36] Seeking Socially Responsible Consumers: Exploring the Intention-“Search”-Behaviour Gap
【速读】:该论文试图解决消费者在社会负责任消费(Socially Responsible Consumption)中普遍存在的“意图-行为差距”问题,即消费者虽有社会责任意识,但在实际购买决策中却常选择不符合其价值观的产品。研究发现,这一差距部分源于消费者对环境、社会及治理(Environmental, Social, and Governance, ESG)等责任维度的信息缺乏或态度漠视,而非单纯的行为失控;同时,信息获取的困难(如信息可得性差、可靠性低)进一步加剧了该差距。因此,解决方案的关键在于构建能够支持消费者高效获取可靠信息的搜索系统(Search Systems),从而提升其在购买决策中的知情度与责任感,缩小意图与行为之间的鸿沟。
链接: https://arxiv.org/abs/2604.03694
作者: Leif Azzopardi,Frans van de Sluis
机构: University of Strathclyde (斯特拉思克莱德大学); University of Copenhagen (哥本哈根大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:The increasing prominence of Socially Responsible Consumers has brought about a heightened focus on the ethical, environmental, social, and ideological dimensions influencing product purchasing decisions. Despite this emphasis, studies have consistently revealed a significant gap between individuals’ intentions to be socially responsible and their actual purchasing behaviors: they often choose products that do not align with their values. This paper aims to investigate how search in influences this gap. Our investigation involves an online survey of 286 participants, where we inquire about their search behaviors and whether they considered various dimensions, ranging from price and features to environmental, social, and governance issues in relation to a recent purchase. Contrary to expectations of a clear intention-behavior gap, our findings suggest that a considerable number of participants exhibited indifference or lack of information regarding these responsible aspects. While, difficulties related to searching for and acquiring information contributed to the gap, including the limited accessibility and reliability of information. This suggests that part of the intention-behaviour gap can be framed as an information seeking problem. Moreover our findings warrant and motivate search systems that help support consumers make more informed and responsible purchasing decisions.
[HC-37] FlueBricks: A Construction Kit of Flute-like Instruments for Acoustic Reasoning
【速读】:该论文试图解决传统声学乐器作为静态 artifacts 缺乏动态交互与可塑性,难以支持用户通过实践深入理解声学原理的问题。解决方案的关键在于提出 FlueBricks——一个基于模块化设计的声学推理工具包,通过构建和定制类笛乐器(flute-like instruments)来实现生成器(generator)、谐振腔(resonator)和连接件(connector)模块的灵活组合,使用户在动手组装与演奏过程中直观探索气动声学特性(aeroacoustic properties),如吹孔设计、管长和音孔位置对起振阈值(onset)、音高(pitch)和音色(timbre)的影响,从而形成“设计-演奏-反馈”的闭环机制,推动声学乐器从静态对象向动态系统转变,促进具身化的声学推理(embodied acoustic reasoning)。
链接: https://arxiv.org/abs/2604.03636
作者: Bo-Yu Chen,Chiao-Wei Huang,Lung-Pan Cheng
机构: National Taiwan University (国立台湾大学)
类目: Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注: Accepted to CHI 2026
Abstract:We present FlueBricks, a construction kit for acoustic reasoning via building and customizing flute-like instruments. By assembling generator, resonator, and connector modules that embody various aeroacoustic properties, users gain deeper understanding of how blowhole, tube length, and tone-hole placement alter onset, pitch, and timbre through hands-on experimentation. This forms a designer-player loop of configuring and playing to form, test, and refine acoustic behaviors-acoustic reasoning-shifting acoustic instruments from static artifacts to dynamic systems. To understand how users engage with this system, we conducted an exploratory study with 12 participants ranging from novices to professional musicians. During their explorations, we observed participants fluently switching between designer and player roles, scaffolding designs from familiar instruments, forming and refining their acoustic understanding of length, tone holes, and generator geometry, reinterpreting modules beyond their intended functions, and using their creations for performative acts such as pedagogical showing and musical expression. These collectively demonstrated FlueBricks’s potential as a pedagogical tool for embodied acoustic reasoning.
[HC-38] Language Scent: Exploring Cross-Language Information Navigation
【速读】:该论文旨在解决多语言用户在跨语言信息获取过程中面临的系统支持不足问题,当前信息通常按语言隔离,导致用户难以高效地在不同语言间切换。其核心解决方案是提出“语言嗅觉”(language scent)概念,将Pirolli和Card的信息嗅觉理论扩展至多语言场景,强调用户基于对目标语言价值的感知来制定元策略进行语言切换。关键设计在于Niffler搜索系统,通过上下文线索、原位工具和反思支持机制增强语言嗅觉,从而促进探索性与细粒度的跨语言信息导航策略形成,并提升信息多样性获取效果。
链接: https://arxiv.org/abs/2604.03604
作者: Jiawen Stefanie Zhu,Katharina Reinecke,Tanushree Mitra
机构: University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:While multilingual users often switch between languages when seeking information, this process remains undersupported by current systems where information is typically siloed by language. Our formative study reveals that users’ cross-language transitions are guided by their perceived value of switching to a language, a concept we formalize as language scent. Language scent extends Pirolli and Card’s theory of information scent to multilingual scenarios by considering meta-level strategy formation when navigating between different languages. To support language scent, we designed Niffler, a search system that augments language scent and supports cross-language information navigation through contextual cues, in-situ tools, and reflection support. A lab study with 16 multilingual speakers showed that Niffler facilitated the formation and execution of exploratory and granular search strategies and leads to diverse information being gathered. Our findings establish language scent as a valuable lens on cross-language information seeking, highlighting language’s role in enabling access to broader information and offering concrete implications for the design of multilingual search systems.
[HC-39] Agent icFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub
【速读】:该论文旨在解决AI编码代理(AI coding agents)在软件开发中作为主动贡献者时,其生成代码与人类开发者代码之间因合并冲突(merge conflicts)带来的集成挑战问题。现有研究多聚焦于AI辅助开发的生产力提升和接受度,但对代理生成内容在实际协作流程中的整合障碍缺乏系统理解。解决方案的关键在于构建并公开一个大规模、细粒度的文本合并冲突数据集——AgenticFlict,该数据集基于59,000多个仓库的142,000+个AI编码代理拉取请求(Agentic PRs),通过确定性合并模拟识别出29,000+个存在冲突的PR(冲突率为27.67%),并提取了336,000+个细粒度冲突区域,为深入分析AI生成代码的可集成性及优化协同开发流程提供了实证基础。
链接: https://arxiv.org/abs/2604.03551
作者: Daniel Ogenrwot,John Businge
机构: University of Nevada Las Vegas (内华达大学拉斯维加斯分校)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages, 5 figures
Abstract:Software Engineering 3.0 marks a paradigm shift in software development, in which AI coding agents are no longer just assistive tools but active contributors. While prior empirical studies have examined productivity gains and acceptance patterns in AI-assisted development, the challenges associated with integrating agent-generated contributions remain less understood. In particular, merge conflicts, a fundamental aspect of collaborative software development, remain underexplored in this context. In this paper, we present AgenticFlict, a large-scale dataset of textual merge conflicts in AI coding agent pull requests (Agentic PRs). The dataset comprises 142K+ Agentic PRs collected from 59K+ repositories, of which 107K+ are successfully processed through deterministic merge simulation. Our pipeline identifies 29K+ PRs exhibiting merge conflicts, yielding a conflict rate of 27.67%, and extracts 336K+ fine-grained conflict regions across these instances. Our preliminary exploratory analysis indicates that merge conflicts are both frequent and often substantial in AI-generated contributions, with noticeable variation across agents, emphasizing the need to better understand and manage integration challenges in AI-assisted software development. The dataset, code and supplementary materials are available in zenodo: this https URL.
[HC-40] Beyond Generation: An Empirical Study on Redefining the Act of Drawing Through an 85% Time Reduction in Picture-Book Production
【速读】:该论文旨在解决传统图画书(picture-book)创作过程中创作者面临的高强度物理劳动与时间消耗问题,以及生成式AI(Generative AI)在应用中可能引发的风格同质化和作者主体性弱化风险。其解决方案的关键在于构建一种“轻度工作”(mild-work)协作流程:通过将AI用于早期草图绘制等机械性任务,显著缩短制作周期(减少85.2%),并将节省的时间重新投入高阶判断(如美学选择、叙事方向与跨场景一致性决策)和完成阶段(包括手工润饰与整合优化),从而实现从重复性劳动向创造性合成的策略性转移,保障专业级出版质量的同时强化创作者的核心作用。
链接: https://arxiv.org/abs/2604.03549
作者: Cosei Kawa
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, 6 figures, 1 table. This study provides an empirical comparison between conventional and AI-collaborative picture-book production workflows
Abstract:Conventional picture-book production imposes substantial physical and temporal demands on creators, often constraining opportunities for high-level artistic exploration. While generative AI can drastically accelerate image generation, concerns remain regarding style homogenization and the erosion of authorial agency in professional practice. This study presents an empirical evaluation of an AI-collaborative workflow through the full production of one professional 15-illustration picture-book title, and compares the process with a conventional hand-drawn pipeline by the same creator. Quantitatively, the proposed workflow reduces total production time by 85.2% (from 2,162.8 to 320.4 hours), with the largest substitution observed in early drafting stages. Qualitatively, however, the core contribution is the strategic reallocation of labor: time saved in mechanical rendering is reinvested into high-level Judgment (aesthetic selection, narrative direction, and cross-scene consistency decisions) and Completion (embodied manual retouching and integrative refinement). Notably, 235 hours were devoted to Completion, indicating that publication-quality outcomes still depend on sustained human synthesis to reconcile generative inconsistencies. Our findings suggest that AI-integration, when framed as a “mild-work” partnership, enhances rather than diminishes the creative experience by shifting the creator’s focus from repetitive physical labor to sophisticated aesthetic synthesis.
[HC-41] YT-Pilot: Turning YouTube into Structured Learning Pathways with Context-Aware AI Support
【速读】:该论文旨在解决YouTube等平台上的非正式学习过程中存在的知识碎片化问题,即学习者难以有效规划学习路径、理解视频间的关联性以及跟踪整体进展。现有工具将学习与规划割裂,缺乏持续的交互结构来整合二者。解决方案的关键在于提出YT-Pilot系统,其基于自我调节学习理论(Self-Regulated Learning Theory, SRLT),将“学习路径”(learning pathway)设计为一个持久且面向用户的交互结构,贯穿于目标设定、计划制定、导航、进度追踪及跨视频辅助等环节,从而提升学习者的路径感知清晰度和协同推理能力。
链接: https://arxiv.org/abs/2604.03543
作者: Dina Albassam,Kexin Quan,Mengke Wu,Sanika Pande,ChengXiang Zhai,Yun Huang
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); School of Information Sciences(信息科学学院); Computer Science(计算机科学系); SALT Lab(智能与学习技术实验室)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:YouTube is widely used for informal learning, where learners explore lectures and tutorials without a predefined curriculum. However, learning across videos remains fragmented: learners must decide what to watch, how videos relate, and how knowledge builds. Existing tools provide partial support but treat planning and learning as separate activities, lacking a persistent interaction structure that connects them. Grounded in self-regulated learning theory (SRLT), we introduce YT-Pilot, a pathway-aware learning system that operationalizes the learning pathway as a persistent, user-facing interaction structure spanning planning and learning. The pathway coordinates goal setting, planning, navigation, progress tracking, and cross-video assistance. Through a within-subjects study ( N=20 ), we show that YT-Pilot significantly improves perceived goal clarity, pathway coherence, and progress tracking, while shifting interaction toward pathway-level reasoning across multiple resources.
[HC-42] Amplifying Rural Educators Perspectives: A Qualitative Study of Generative AIs Impact in Rural U.S. High Schools
【速读】:该论文旨在解决生成式 AI (Generative AI) 在农村学校教育场景中因资源不均而难以有效落地的问题,尤其关注其可能加剧既有教育不平等的潜在风险。研究表明,尽管农村教师已尝试利用 GenAI 提升教学效率,但基础设施薄弱、技术接受度低及 AI 素养培训缺失等障碍限制了其深度整合。解决方案的关键在于采用面向农村教育情境的差异化设计策略,并推动包容性 GenAI 设计理念,同时重新审视对技术采纳的假设,以确保技术赋能而非进一步扩大城乡教育差距。
链接: https://arxiv.org/abs/2604.03542
作者: Shira Michel,Benjamin Taylor,Sabrina Parra Díaz,Joseph B. Wiggins,Ed Finn,Mahsan Nourani
机构: The Roux Institute at Northeastern University (东北大学罗克斯研究所); Katabasis (卡塔巴西); Arizona State University (亚利桑那州立大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: This paper has been accepted to CHI 2026
Abstract:Recent breakthroughs in Generative AI (GenAI) are reshaping educational landscapes, presenting challenges and opportunities. While all contexts present unique challenges, rural schools are historically under-resourced, facing persistent technology-related barriers. To understand and reduce these barriers, we studied 31 rural high school educators across three U.S. states to examine their use of GenAI and understand how GenAI introduces new challenges, opportunities, and may exacerbate existing educational barriers. Results show while rural educators use GenAI to streamline teaching tasks, existing resource disparities restrict meaningful integration. Through rural educators’ voices, we reveal issues like infrastructure barriers, resistance to adoption, and lack of AI literacy training create significant obstacles. Nonetheless, educators envision GenAI can support themselves and their students, but findings emphasize the need for rural-specific design approaches. As a community, embracing inclusive GenAI design and re-examining assumptions about technology adoption in under-served educational contexts is essential to reducing barriers rather than widening them.
[HC-43] Incentives shape how humans co-create with generative AI
【速读】:该论文旨在解决生成式 AI(Generative AI)在提升个体生产力的同时可能引发群体创意趋同的问题,即AI使用可能导致集体多样性下降,限制多元观点和想法的产生。其解决方案的关键在于激励机制的设计:通过将奖励机制从单纯追求质量转向鼓励与同龄人相比的原创性,能够有效缓解AI带来的同质化效应。具体而言,这种激励结构促使参与者更审慎地使用AI——并非减少AI使用,而是选择性地将其用于头脑风暴、校对和针对性修改,而非直接采纳建议,从而在保持效率的同时增强群体产出的多样性。
链接: https://arxiv.org/abs/2604.03529
作者: Nathanael Jo,Manish Raghavan
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Generative AI is quickly becoming an integral part of people’s everyday workflows. Early evidence has shown that while generative AI can increase individual-level productivity, it does so at the cost of collective diversity, potentially narrowing the set of ideas and perspectives produced. Our research stands in contrast to this concern: through a pre-registered randomized control trial, we show that incentives mediate AI’s homogenizing force in a creative writing task where participants can use AI interactively. Participants rewarded for originality relative to peers produce collectively more diverse writing than those rewarded for quality alone. This divergence is driven not by abandoning AI, but by how participants use it: those incentivized for originality incorporate fewer AI suggestions verbatim, relying on the model more selectively for brainstorming, proofreading, and targeted edits. Our results reveal that the effects of generative AI depend not only on the technology itself, but also the behavioral strategies and incentive structures surrounding its use.
[HC-44] Explainable Model Routing for Agent ic Workflows
【速读】:该论文旨在解决当前智能体工作流(agentic workflows)中模型路由机制缺乏透明性的问题,即现有架构仅关注性能优化而忽视了模型能力与成本之间的权衡,导致开发者难以区分“智能效率”(如合理分配专用模型)与因预算驱动选择模型所引发的潜在失败。解决方案的关键在于提出Topaz框架,其核心创新包括:(i) 基于技能的建模(skill-based profiling),将多基准测试结果整合为细粒度的能力画像;(ii) 可追溯的路由算法,通过预算约束和多目标优化生成清晰的决策路径;(iii) 面向开发者的自然语言解释模块,将技术性追踪转化为可读逻辑说明,从而实现对模型路由决策的审计、理解和可控调优。
链接: https://arxiv.org/abs/2604.03527
作者: Mika Okamoto,Ansel Kaplan Erol,Mark Riedl
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: ACM CHI 2026 Human-Centered Explainable AI (HCXAI) Workshop (Spotlight)
Abstract:Modern agentic workflows decompose complex tasks into specialized subtasks and route them to diverse models to minimize cost without sacrificing quality. However, current routing architectures focus exclusively on performance optimization, leaving underlying trade-offs between model capability and cost unrecorded. Without clear rationale, developers cannot distinguish between intelligent efficiency – using specialized models for appropriate tasks – and latent failures caused by budget-driven model selection. We present Topaz, a framework that introduces formal auditability to agentic routing. Topaz replaces silent model assignments with an inherently interpretable router that incorporates three components: (i) skill-based profiling that synthesizes performance across diverse benchmarks into granular capability profiles (ii) fully traceable routing algorithms that utilize budget-based and multi-objective optimization to produce clear traces of how skill-match scores were weighed against costs, and (iii) developer-facing explanations that translate these traces into natural language, allowing users to audit system logic and iteratively tune the cost-quality tradeoff. By making routing decisions interpretable, Topaz enables users to understand, trust, and meaningfully steer routed agentic systems.
[HC-45] SwEYEpinch: Exploring Intuitive Efficient Text Entry for Extended Reality via Eye and Hand Tracking
【速读】:该论文旨在解决扩展现实(Extended Reality, XR)环境中文本输入效率低、操作费力的问题,尤其是在对比物理键盘或触屏输入时的性能差距。其解决方案的关键在于提出一种结合注视(gaze)与手势(pinch)的混合交互机制:利用注视进行快速滑动选择虚拟键盘上的字符序列(where),同时通过持续保持捏合手势(manual pinch)来明确滑动的时间边界(when)。该方法通过引入低延迟解码器、时空动态时间规整(spatiotemporal Dynamic Time Warping)和注视过滤策略提升准确性,并进一步集成滑动过程中的词预测与手势取消功能,在不牺牲准确率的前提下显著提升每分钟词数(Words Per Minute, WPM),最终在七天共30次会话的长期实验中实现最高64.7 WPM的稳定性能。
链接: https://arxiv.org/abs/2604.03520
作者: Ziheng “Leo” Li,Xichen He,Mengyuan “Millie” Wu,Zeyi Tong,Haowen Wei,Benjamin Yang,Steven Feiner,Paul Sajda
机构: Columbia University (哥伦比亚大学)
类目: Human-Computer Interaction (cs.HC)
备注: 27 pages, 10 figures. To appear in CHI '26: Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems, April 13-17, 2026, Barcelona, Spain. DOI: https://doi.org/10.1145/3772318.3791820
Abstract:Despite steady progress, text entry in Extended Reality (XR) often remains slower and more effortful than typing on a physical keyboard or touchscreen. We explore a simple idea: use gaze to swipe through a virtual keyboard for the fast, low-effort where and a manual pinch held throughout the swipe for the when, extending and validating it through a series of user studies. We first show that a basic version including a low-latency decoder with spatiotemporal Dynamic Time Warping and fixation filtering outperforms selecting individual keys sequentially, either by finger tapping each or gazing at each while pinching. We then add mid-swipe prediction and in-gesture cancellation, improving words per minute (WPM) without hurting accuracy. We show that this approach is faster and more preferred than previous gaze-swipe approaches, finger tapping with prediction, or hand swiping with the same additions. Furthermore, a seven-day, 30-session study demonstrates sustained learning, with peak performance reaching 64.7 WPM.
[HC-46] Occupational Diversity and Stratification in Platform Work: A Longitudinal Study of Online Freelancers
【速读】:该论文旨在解决当前平台劳动研究中将平台工作者视为同质群体的倾向,忽视了不同职业背景对工作需求、约束和实践的独特影响,从而阻碍了对平台劳动本质的理解及支持差异化职业现实的社会技术系统的发展。其解决方案的关键在于通过纵向分析108名涵盖五个职业类别的在线自由职业者,揭示职业情境如何塑造劳动者对平台管理机制的理解与应对能力,并据此提出“平台职业分层”(platformic occupational stratification)概念,阐明其作用逻辑与四大机制,为计算机支持的协同工作(CSCW)领域提供基于职业敏感性的研究与设计路径,以回应平台劳动中嵌入式的职业能动性问题。
链接: https://arxiv.org/abs/2604.03517
作者: Pyeonghwa Kim,Taylor Lewandowski,Michael Dunn,Steve Sawyer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted to CSCW 2026
Abstract:We focus on occupational diversity in platform-mediated work to advance conceptual and empirical insight into the occupationally embedded nature of platform labor. We pursue this focus in response to a prevailing tendency to treat platform workers as a homogeneous group, overlooking the unique demands, constraints, and practices rooted in specific professions. Such generalizations hinder both understanding of platform work and the development of sociotechnical systems that support differentiated occupational realities. To address this gap, we present a longitudinal analysis of 108 online freelancers spanning five occupational categories. We show that occupational context structures workers’ capacity to interpret and navigate platformic management, shaping distinct experiences across four dimensions of platform work: self-presentation, flexibility, skilling, and platform work sustainability. To articulate how digital labor platforms’ managerial control interacts with occupational embeddedness, we introduce the concept of platformic occupational stratification and discuss four mechanisms that explain its logic and implications for platform-mediated work. These insights contribute to CSCW by informing occupation-sensitive research and design approaches that directly engage with the specific opportunities and challenges rooted in workers’ situated occupational agency in platform-mediated work.
[HC-47] he Augmentation Trap: AI Productivity and the Cost of Cognitive Offloading
【速读】:该论文旨在解决人工智能(AI)工具在长期使用中可能引发的技能侵蚀问题,即尽管AI能短期提升员工生产力,但持续依赖可能导致员工专业能力下降,最终削弱整体生产效率。其解决方案的关键在于构建一个动态模型,将AI带来的生产力提升分解为两个渠道:一个与员工技能无关的独立效应,另一个随技能水平增强而放大的依赖效应。通过这一分解,论文识别出五种部署情境,明确区分有益与有害的AI采纳行为,并揭示了管理者短期导向或外部技能价值存在时可能导致“增强陷阱”——即员工长期产出低于未采用AI的状态。此外,当AI对技能依赖较弱时,不同经验水平的员工可能出现永久性技能分化:资深员工发挥潜能,而新手则因过度依赖AI而技能退化至零,凸显管理激励差异对个体路径选择的决定性影响。
链接: https://arxiv.org/abs/2604.03501
作者: Michael Caosun,Sinan Aral
机构: MIT Sloan School of Management (麻省理工学院斯隆管理学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Experimental evidence confirms that AI tools raise worker productivity, but also that sustained use can erode the expertise on which those gains depend. We develop a dynamic model in which a decision-maker chooses AI usage intensity for a worker over time, trading immediate productivity against the erosion of worker skill. We decompose the tool’s productivity effect into two channels, one independent of worker expertise and one that scales with it. The model produces three main results. First, even a decision-maker who fully anticipates skill erosion rationally adopts AI when front-loaded productivity gains outweigh long-run skill costs, producing steady-state loss: the worker ends up less productive than before adoption. Second, when managers are short-termist or worker skill has external value, the decision-maker’s optimal policy turns steady-state loss into the augmentation trap, leaving the worker worse off than if AI had never been adopted. Third, when AI productivity depends less on worker expertise, workers can permanently diverge in skill: experienced workers realize their full potential while less experienced workers deskill to zero. Small differences in managerial incentives can determine which path a worker takes. The productivity decomposition classifies deployments into five regimes that separate beneficial adoption from harmful adoption and identifies which deployments are vulnerable to the trap.
[HC-48] Messages in a Digital Bottle: A Youth-Coauthored Perspective on LLM Chatbots and Adolescent Loneliness
【速读】:该论文旨在解决数字社交环境中青少年孤独感日益加剧的问题,特别是探讨由大语言模型(Large Language Model, LLM)驱动的聊天机器人如何差异化地影响不同亚群体青少年(如伴有焦虑或抑郁、神经多样性及移民青少年)的孤独体验。其解决方案的关键在于以青少年自身视角为核心解释框架,基于社会计算、发展心理学与人机交互(Human-Computer Interaction, HCI)的跨学科文献,识别出聊天机器人在缓解孤独感方面的潜在情境条件及其可能加剧孤独的风险点,并据此提出三个具有人群敏感性的设计启示,从而推动更具包容性和伦理意识的AI辅助社交干预机制的发展。
链接: https://arxiv.org/abs/2604.03470
作者: Jinyao Liu,Di Fu
机构: University of Surrey (萨里大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Adolescent loneliness is a growing concern in digitally mediated social environments. This work-in-progress presents a youth-authored critical synthesis on chatbots powered by Large Language Model (LLM) and adolescent loneliness. The first author is a 16-year-old Chinese student who recently migrated to the UK. She wrote the first draft of this paper from her lived experience, supervised by the second author. Rather than treating the youth perspective as one data point among many, we foreground it as the primary interpretive lens, grounded in interdisciplinary literature from social computing, developmental psychology, and Human-Computer Interaction (HCI). We examine how chatbots shape experiences of loneliness differently across adolescent subgroups, including those with anxiety or depression, neurodivergent youth, and immigrant adolescents, and identify both conditions under which they may temporarily reduce isolation and breakdowns that risk deepening it. We derive three population-sensitive design implications. The next phase of this work will expand the youth authorship model to a panel of adolescents across these subgroups, empirically validating the framework presented here.
[HC-49] Do Robots Need Body Language? Comparing Communication Modalities for Legible Motion Intent in Human-Shared Spaces
【速读】:该论文旨在解决共享空间中高自由度(High-DoF)机器人运动行为难以被人类理解的问题,从而降低人类适应负担。其核心挑战在于如何通过不同信号模态(如表达性运动、灯光、文本和音频)提升人类对机器人即将执行导航动作的预测准确性、信心及信任度。解决方案的关键在于系统评估多种隐式(如表达性运动)与显式(如文字或声音提示)信号策略的效果,特别是探索多模态线索是否协同增强可解释性,以及冲突线索对用户认知和信任的影响,从而为设计更易理解的人机交互机制提供实证依据。
链接: https://arxiv.org/abs/2604.03451
作者: Jonathan Albert Cohen,Kye Shimizu,Allen Song,Vishnu Bharath,Kent Larson,Pattie Maes
机构: 未知
类目: Robotics (cs.RO); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Robots in shared spaces often move in ways that are difficult for people to interpret, placing the burden on humans to adapt. High-DoF robots exhibit motion that people read as expressive, intentionally or not, making it important to understand how such cues are perceived. We present an online video study evaluating how different signaling modalities, expressive motion, lights, text, and audio, shape people’s ability to understand a quadruped robot’s upcoming navigation actions (Boston Dynamics Spot). Across four common scenarios, we measure how each modality influences humans’ (1) accuracy in predicting the robot’s next navigation action, (2) confidence in that prediction, and (3) trust in the robot to act safely. The study tests how expressive motions compare to explicit channels, whether aligned multimodal cues enhance interpretability, and how conflicting cues affect user confidence and trust. We contribute initial evidence on the relative effectiveness of implicit versus explicit signaling strategies.
[HC-50] ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop CVPR2026
【速读】:该论文旨在解决当前生成式AI图像编辑模型在角色面部表情 stylized editing(风格化表情编辑)任务中引入全局噪声和像素漂移的问题,这些问题阻碍了这些模型集成到专业图像编辑软件(如Photoshop)的工作流中。解决方案的关键在于提出ExpressEdit——一个完全开源的Photoshop插件,其设计避免了主流商用模型常见的伪影,且能与Photoshop原生功能(如Liquify)无缝协同;同时,为支持叙事驱动的表情多样性生成,研究者构建了一个包含135个表情标签的丰富数据库,结合示例故事和图像实现检索增强生成(retrieval-augmented generation),从而显著提升编辑速度(单张图像3秒内完成)与艺术可控性。
链接: https://arxiv.org/abs/2604.03448
作者: Kenan Tang,Jiasheng Guo,Jeffrey Lin,Yao Qin
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026 Workshop on Generative AI for Storytelling (AISTORY)
Abstract:Facial expressions of characters are a vital component of visual storytelling. While current AI image editing models hold promise for assisting artists in the task of stylized expression editing, these models introduce global noise and pixel drift into the edited image, preventing the integration of these models into professional image editing software and workflows. To bridge this gap, we introduce ExpressEdit, a fully open-source Photoshop plugin that is free from common artifacts of proprietary image editing models and robustly synergizes with native Photoshop operations such as Liquify. ExpressEdit seamlessly edits an expression within 3 seconds on a single consumer-grade GPU, significantly faster than popular proprietary models. Moreover, to support the generation of diverse expressions according to different narrative needs, we compile a comprehensive expression database of 135 expression tags enriched with example stories and images designed for retrieval-augmented generation. We open source the code and dataset to facilitate future research and artistic exploration.
[HC-51] Can LLM s Reason About Attention? Towards Zero-Shot Analysis of Multimodal Classroom Behavior
【速读】:该论文旨在解决传统学生参与度评估方法中存在的效率低和隐私风险问题,即依赖耗时的手动观察或可能侵犯隐私的视频记录。其解决方案的关键在于构建一个隐私保护型分析流程:利用OpenPose从课堂视频中提取骨骼关键点信息,并通过Gaze-LLE估计视觉注意力;原始视频帧在完成姿态提取后立即删除,仅保留几何坐标(以JSON格式存储),确保符合FERPA法规要求;随后使用QwQ-32B-Reasoning大语言模型(Large Language Model, LLM)对行为数据进行零样本分析,最终通过Web仪表板呈现注意力热力图与行为摘要,实现高效、隐私安全且具备可解释性的学生行为洞察。
链接: https://arxiv.org/abs/2604.03401
作者: Nolan Platt,Sehrish Nizamani,Alp Tural,Elif Tural,Saad Nizamani,Andrew Katz,Yoonje Lee,Nada Basit
机构: Virginia Tech (弗吉尼亚理工大学); University of Virginia (弗吉尼亚大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures. Preprint
Abstract:Understanding student engagement usually requires time-consuming manual observation or invasive recording that raises privacy concerns. We present a privacy-preserving pipeline that analyzes classroom videos to extract insights about student attention, without storing any identifiable footage. Our system runs on a single GPU, using OpenPose for skeletal extraction and Gaze-LLE for visual attention estimation. Original video frames are deleted immediately after pose extraction, thus only geometric coordinates (stored as JSON) are retained, ensuring compliance with FERPA. The extracted pose and gaze data is processed by QwQ-32B-Reasoning, which performs zero-shot analysis of student behavior across lecture segments. Instructors access results through a web dashboard featuring attention heatmaps and behavioral summaries. Our preliminary findings suggest that LLMs may show promise for multimodal behavior understanding, although they still struggle with spatial reasoning about classroom layouts. We discuss these limitations and outline directions for improving LLM spatial comprehension in educational analytics contexts.
[HC-52] oward Full Autonomous Laboratory Instrumentation Control with Large Language Models
【速读】:该论文旨在解决复杂实验室仪器控制对编程技能要求高所导致的研究人员技术门槛问题,从而阻碍实验定制化与自动化。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)及其衍生的AI代理(AI agents),通过自然语言交互生成定制化仪器控制脚本,并进一步实现自主操作实验设备及迭代优化控制策略的能力,从而显著降低实验自动化门槛,推动实验室自动化向更普及、高效的方向发展。
链接: https://arxiv.org/abs/2604.03286
作者: Yong Xie,Kexin He,Andres Castellanos-Gomez
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Human-Computer Interaction (cs.HC)
备注: 16 pages, 5 figures. Accepted manuscript published in Small Structures. Supporting data and code available at this https URL
Abstract:The control of complex laboratory instrumentation often requires significant programming expertise, creating a barrier for researchers lacking computational skills. This work explores the potential of large language models (LLMs), such as ChatGPT, and LLM-based artificial intelligence (AI) agents to enable efficient programming and automation of scientific equipment. Through a case study involving the implementation of a setup that can be used as a single-pixel camera or a scanning photocurrent microscope, we demonstrate how ChatGPT can facilitate the creation of custom scripts for instrumentation control, significantly reducing the technical barrier for experimental customization. Building on this capability, we further illustrate how LLM-assisted tools can be extended into autonomous AI agents capable of independently operating laboratory instruments and iteratively refining control strategies. This approach underscores the transformative role of LLM-based tools and AI agents in democratizing laboratory automation and accelerating scientific progress.
[HC-53] VIGIL: An Extensible System for Real-Time Detection and Mitigation of Cognitive Bias Triggers
【速读】:该论文旨在解决生成式 AI(Generative AI)在在线信息传播中引发的认知偏差诱导问题,即通过利用人类认知偏见和认知局限进行说服或操纵,从而对公共话语造成潜在但隐蔽的危害。现有工具主要聚焦于事实核查与信息源可信度评估,却未能有效识别和缓解此类基于认知机制的误导性内容。解决方案的关键在于提出 VIGIL(VIrtual GuardIan angeL),这是一个实时浏览器扩展程序,具备三项核心技术:一是滚动同步的实时认知偏差触发点检测;二是基于大语言模型(LLM)的可逆改写功能,以降低误导性表述的影响;三是支持从完全离线到云端的隐私分级推理架构。此外,VIGIL 具备良好的可扩展性,已集成经自然语言处理(NLP)基准严格验证的第三方插件,且开源发布以促进进一步研究与应用。
链接: https://arxiv.org/abs/2604.03261
作者: Bo Kang,Sander Noels,Tijl De Bie
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:The rise of generative AI is posing increasing risks to online information integrity and civic discourse. Most concretely, such risks can materialise in the form of mis- and disinformation. As a mitigation, media-literacy and transparency tools have been developed to address factuality of information and the reliability and ideological leaning of information sources. However, a subtler but possibly no less harmful threat to civic discourse is to use of persuasion or manipulation by exploiting human cognitive biases and related cognitive limitations. To the best of our knowledge, no tools exist to directly detect and mitigate the presence of triggers of such cognitive biases in online information. We present VIGIL (VIrtual GuardIan angeL), the first browser extension for real-time cognitive bias trigger detection and mitigation, providing in-situ scroll-synced detection, LLM-powered reformulation with full reversibility, and privacy-tiered inference from fully offline to cloud. VIGIL is built to be extensible with third-party plugins, with several plugins that are rigorously validated against NLP benchmarks are already included. It is open-sourced at this https URL.
[HC-54] State of the Art Report for Smart Habitat for Older Persons – Working Group 3 – Healthcare ALT
【速读】:该论文旨在解决老年人居家健康与智能老龄化(Smart and Healthy Ageing at Home)中跨领域协同不足的问题,尤其聚焦于家居环境、信息通信技术(Information and Communication Technologies, ICT)与医疗保健三大领域的现状与整合挑战。其解决方案的关键在于通过COST Action CA16226“Sheld-on”网络下三个工作组的独立研究,系统梳理各领域在智能家具与居住环境、ICT应用及医疗健康服务方面的最新进展、产业实践与成功案例,并将成果汇编为可互操作的三部分“状态报告”,为后续第四工作组——即跨领域协同制定居家健康老龄化综合解决方案——提供科学依据与实践基础,从而推动多学科融合以实现老年人在家庭、社区和工作场所中的高质量生活。
链接: https://arxiv.org/abs/2604.03255
作者: Birgitta Langhammer,Oscar Martinez Mozos,Ana Mendes,Joana Madureira,Lina Seduikyte,Martin Weigl,Heidi Salonen,Veronika Kotradyova,Ondrej Krejcar,Sarmite Mikulioniene,Willeke van Staalduinen,Carina Dantas,Petra Maresova,Willeke van Staalduinen,Carina Dantas,Barakovic Sabina,Barakovic Husic Jasmina,Jonathan Gomez-Raja
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: COST Action CA16226 Indoor living space improvement: Smart Habitat for the Elderly. Sheld-on Furniture, Habitat, Active and Healthy Ageing, ICT, Healthcare Report on the State of the Art of the Working Group 1: Furniture and Habitat Industries, Working Group 2: ITC Developments, and Working Group 3: Healthcare
Abstract:This document reports the State of the Art of science and practice on three topics related to smart and healthy ageing at home: furniture and habitats, Information and Communication Technologies (ICT), and healthcare. The reports were prepared by the working groups of COST Action CA16226, Sheld-on. Sheld-on is a network of researchers, user representatives, industry members, and other stakeholders. The three domains covered in this report were the areas of interest for three working groups from the COST Action. The aim of each working group was to assess the State of the Art for disciplinary understanding, identification of advances in smart furniture and habitat, products, industries and success stories. The findings on these topics of all working groups are compiled here. Due to the different backgrounds of the members of each of the working groups, the document is divided in three separate parts that can be considered as separate State of the Art reports. The goal of this document is to be used as input in the fourth working group of Sheld-on COST Action: Solutions for Ageing Well at Home, in the Community, and at Work, where experts from the three different domains converge to a single working group in order to achieve the action objectives.
[HC-55] BLK-Assist: A Methodological Framework for Artist-Led Co-Creation with Generative AI Models
【速读】:该论文旨在解决如何在保护艺术家隐私和数据所有权的前提下,实现针对特定艺术家风格的扩散模型微调问题。传统方法往往依赖于大规模公开数据集或未经许可的数据使用,难以满足艺术创作中对版权与风格忠实性的要求。解决方案的关键在于提出一个模块化框架BLK-Assist,其核心是采用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术,整合三个子系统:BLK-Conceptor(基于LoRA的语义草图生成)、BLK-Stencil(基于LayerDiffuse的透明度保持资产生成)以及BLK-Upscale(结合Real-ESRGAN与纹理条件扩散的高分辨率增强),从而在不暴露原始数据的情况下,实现高质量、风格一致的艺术生成,并支持可复现的、基于授权的艺术家-AI协同创作流程。
链接: https://arxiv.org/abs/2604.03249
作者: Daniel Grimes,Rachel M. Harrison
机构: Ophiuchus LLC (Ophiuchus LLC); Black Forest Labs (Black Forest Labs)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper presents BLK-Assist, a modular framework for artist-specific fine-tuning of diffusion models using parameter-efficient methods. The system is implemented as a case study with a single professional artist’s proprietary corpus and consists of three components: BLK-Conceptor (LoRA-adapted conceptual sketch generation), BLK-Stencil (LayerDiffuse-based transparency-preserving asset generation), and BLK-Upscale (hybrid Real-ESRGAN and texture-conditioned diffusion for high-resolution outputs). We document dataset composition, preprocessing, training configurations, and inference workflows to enable reproducibility with publicly available models to illustrate a privacy-preserving, consent-based approach to human-AI co-creation that maintains stylistic fidelity to the source corpus and can be adapted for other artists under similar constraints.
[HC-56] Incidental Interaction: Technology to Support Elder Strength Training through Everyday Movements
【速读】:该论文旨在解决老年人对正式锻炼计划依从性低的问题,从而促进健康老龄化。现有技术多依赖专用设备或明确的运动任务,难以融入日常生活。其解决方案的关键在于提出“偶然交互”(Incidental Interaction)新范式,将日常行为如坐、站、举物等转化为有意识的力量训练机会,通过重复动作(“做两次”)并结合运动质量指标提供实时反馈与进度追踪,无需用户改变习惯或使用额外设备,即可提升功能性体能。
链接: https://arxiv.org/abs/2604.03241
作者: Arturo Vazquez Galvez,Christopher Tacca,Isobel Margaret Thompson,Alexander Dawid Bincalar,Christoph Tremmel,Martin Warner,Richard Gomer,Alexander Ng,Chris Freeman,m.c. Schraefel
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Strength training is a key determinant of healthy aging, yet adherence to formal exercise programs among older adults remains low. While many technologies aim to encourage physical activity in older adults, they typically rely on dedicated devices, wearables, or explicit exercise tasks. They therefore do not embed task practice into daily life. Our new approach, termed Incidental Interaction, instead transforms everyday actions into opportunities for deliberate strength building. It thereby operationalizes everyday movements such as sitting, standing, or lifting objects as strength exercises, encouraging participants to repeat them to build functional capacity. This repetition is encapsulated in the phrase “do it twice”, and is combined with movement quality metrics to provide feedback and support progression, without requiring users to adopt new routines or equipment. We illustrate the concept by designing and implementing an ecosystem of instrumented everyday objects and pressure-sensitive mats embedded into ordinary furniture, providing real-time feedback, progress tracking, and motivational cues. To evaluate technical efficacy, we report on two structured pilot deployments with elders (2 week and 4 week studies, n=7).
[HC-57] Measuring Human Preferences in RLHF is a Social Science Problem
【速读】:该论文试图解决强化学习与人类反馈(Reinforcement Learning from Human Feedback, RLHF)中一个核心假设的 validity 问题,即标注响应是否真实反映人类偏好。作者指出,行为科学文献表明,人们常在无真实观点的情况下作出回应、基于情境线索临时构建偏好,或对相同问题产生不同理解,而这些现象在价值导向的判断中尤为普遍,却未被纳入机器学习实践。解决方案的关键在于将 RLHF 中的人类偏好测量视为社会科学研究问题,并提出一个分类框架,区分真实偏好(genuine preferences)、非态度(non-attitudes)、建构偏好(constructed preferences)及测量伪影(measurement artifacts),同时提供诊断工具以识别并排除无效信号,从而在训练前确保所采集数据的有效性。
链接: https://arxiv.org/abs/2604.03238
作者: Bijean Ghafouri,Eun Cheol Choi,Priyanka Dey,Emilio Ferrara
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:RLHF assumes that annotation responses reflect genuine human preferences. We argue this assumption warrants systematic examination, and that behavioral science offers frameworks that bring clarity to when it holds and when it breaks down. Behavioral scientists have documented for sixty years that people routinely produce responses without holding genuine opinions, construct preferences on the spot based on contextual cues, and interpret identical questions differently. These phenomena are pervasive for precisely the value-laden judgments that matter most for alignment, yet this literature has not yet been systematically integrated into ML practice. We argue that the ML community must treat measurement validity as logically prior to preference aggregation. Specifically, we contend that measuring human preferences in RLHF is a social science problem. We present a taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each. This framework has two important implications. First, it raises the question of whether current RLHF practice may be systematically modeling noise as signal and elicitation artifacts as human values. Second, it provides a path forward by suggesting diagnostic tools that can distinguish valid preferences from artifacts before they enter the training pipeline.
[HC-58] he Persuasion Paradox: When LLM Explanations Fail to Improve Human-AI Team Performance
【速读】:该论文旨在解决生成式 AI (Generative AI) 中自然语言解释对人机协作任务性能影响不明确的问题,尤其是解释是否能提升人类用户在与AI协同决策中的准确性与错误恢复能力。研究发现,尽管流畅的解释会显著增强用户信心和依赖度(即“说服悖论”),但其对任务准确性的提升并不稳定,甚至可能削弱用户从AI错误中恢复的能力。解决方案的关键在于:摒弃将解释视为通用优化手段的传统思路,转而设计以校准依赖关系(calibrated reliance)和有效错误恢复为核心的交互机制,例如通过暴露模型不确定性概率或采用选择性自动化策略来替代单纯依赖解释,从而实现更可靠的人机协作表现。
链接: https://arxiv.org/abs/2604.03237
作者: Ruth Cohen,Lu Feng,Ayala Bloch,Sarit Kraus
机构: Bar-Ilan University (巴伊兰大学); University of Virginia (弗吉尼亚大学); Ariel University (阿里尔大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While natural-language explanations from large language models (LLMs) are widely adopted to improve transparency and trust, their impact on objective human-AI team performance remains poorly understood. We identify a Persuasion Paradox: fluent explanations systematically increase user confidence and reliance on AI without reliably improving, and in some cases undermining, task accuracy. Across three controlled human-subject studies spanning abstract visual reasoning (RAVEN matrices) and deductive logical reasoning (LSAT problems), we disentangle the effects of AI predictions and explanations using a multi-stage reveal design and between-subjects comparisons. In visual reasoning, LLM explanations increase confidence but do not improve accuracy beyond the AI prediction alone, and substantially suppress users’ ability to recover from model errors. Interfaces exposing model uncertainty via predicted probabilities, as well as a selective automation policy that defers uncertain cases to humans, achieve significantly higher accuracy and error recovery than explanation-based interfaces. In contrast, for language-based logical reasoning tasks, LLM explanations yield the highest accuracy and recovery rates, outperforming both expert-written explanations and probability-based support. This divergence reveals that the effectiveness of narrative explanations is strongly task-dependent and mediated by cognitive modality. Our findings demonstrate that commonly used subjective metrics such as trust, confidence, and perceived clarity are poor predictors of human-AI team performance. Rather than treating explanations as a universal solution, we argue for a shift toward interaction designs that prioritize calibrated reliance and effective error recovery over persuasive fluency. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.03237 [cs.HC] (or arXiv:2604.03237v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.03237 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Rut Bracha Cohen [view email] [v1] Sat, 31 Jan 2026 20:23:37 UTC (1,011 KB)
[HC-59] BLADE: Better Language Answers through Dialogue and Explanations
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的教育助教在教学中直接提供答案导致学习过程被短路的问题,如削弱学生的探索、自我解释和对课程材料的参与。其解决方案的关键在于提出BLADE(Better Language Answers through Dialogue and Explanations),一个基于检索增强生成(Retrieval-Augmented Generation, RAG)框架的对话式助教系统,通过动态调用课程内容中的教学相关片段来引导学生接触原始资源,而非直接给出答案,从而促进概念理解与主动学习。
链接: https://arxiv.org/abs/2604.03236
作者: Chathuri Jayaweera,Bonnie J. Dorr
机构: University of Florida (佛罗里达大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Contains 9 figures
Abstract:Large language model (LLM)-based educational assistants often provide direct answers that short-circuit learning by reducing exploration, self-explanation, and engagement with course materials. We present BLADE (Better Language Answers through Dialogue and Explanations), a grounded conversational assistant that guides learners to relevant instructional resources rather than supplying immediate solutions. BLADE uses a retrieval-augmented generation (RAG) framework over curated course content, dynamically surfacing pedagogically relevant excerpts in response to student queries. Instead of delivering final answers, BLADE prompts direct engagement with source materials to support conceptual understanding. We conduct an impact study in an undergraduate computer science course, with different course resource configurations and show that BLADE improves students’ navigation of course resources and conceptual performance compared to simply providing the full inventory of course resources. These results demonstrate the potential of grounded conversational AI to reinforce active learning and evidence-based reasoning.
[HC-60] oward a Universal Color Naming System: A Clustering-Based Approach using Multisource Data
【速读】:该论文旨在解决跨行业色彩命名缺乏统一标准的问题,这一问题导致设计、技术与通信领域中颜色标识混乱,且现有系统存在大量感知上难以区分的重叠色阶,违背人类实际可辨识的颜色类别数量。解决方案的关键在于构建一个基于聚类的多源数据框架:首先收集来自20个不同来源的超过19,555组RGB颜色值及其名称,经清洗和归一化后转换至感知均匀的CIELAB色彩空间,并采用CIEDE2000颜色差异度量进行K-means聚类,最终识别出280个最优簇;每个簇通过名称频率分析确定代表性标签,从而形成符合自然语言模式的标准化色彩命名体系。该方法为生成式AI、视觉搜索和设计系统等应用场景提供了可感知基础的统一颜色标注方案。
链接: https://arxiv.org/abs/2604.03235
作者: Aruzhan Sabitkyzy,Maksat Shagyrov,Pakizar Shamoi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Wiley for consideration
Abstract:Is it coral, salmon, or peach? What seems like a simple color can have many names, and without a standard, these variations create confusion across design, technology, and communication. Color naming is a fundamental task across industries such as fashion, cosmetics, web design, and visualization tools. However, the lack of universally accepted color naming standards leads to inconsistent color standards across platforms, applications, and industries. Moreover, these systems include hundreds or thousands of overlapping, perceptually indistinct shades, despite the fact that humans typically distinguish only a limited number of unique color categories in practice. In this study, we propose a clustering-based multisource data framework to build a standardized color-naming system. We collected a dataset of over 19,555 RGB values paired with color names from 20 diverse sources. After data cleaning and normalization, we converted the colors to the perceptually uniform CIELAB color space and applied K-means clustering using the CIEDE2000 color difference metric, identifying 280 optimal clusters. For each cluster, we performed a frequency analysis of the associated names to assign representative labels. The resulting system reflects naturally occurring linguistic patterns. We demonstrate its effectiveness in automatic annotation and content-based image retrieval on a clothing dataset. This approach opens new opportunities for standardized, perceptually grounded color labeling in practical applications such as generative AI, visual search, and design systems.
[HC-61] Copilot-Assisted Second-Thought Framework for Brain-to-Robot Hand Motion Decoding
【速读】:该论文旨在解决从脑电图(EEG)中高精度预测运动学参数(如手部抓握与提举任务中的轨迹)的问题,以提升运动相关脑-机接口(BCI)的性能。其核心解决方案是提出一种CNN-注意力混合模型,结合卷积神经网络(CNN)对局部特征的提取能力与Transformer架构中注意力机制对长序列依赖关系的建模优势,从而有效解码EEG信号中的手部运动信息;进一步通过引入EEG-肌电图(EMG)多模态融合策略显著提升解码准确性,并设计了一个基于有限状态机的“副驾驶”(copilot)后处理框架,利用运动状态感知的评判器过滤低置信度的解码点,在保留超过80%数据的前提下将纯EEG解码的皮尔逊相关系数(PCC)从0.89提升至0.93,显著增强了轨迹保真度。
链接: https://arxiv.org/abs/2603.27492
作者: Yizhe Li(1),Shixiao Wang(1),Jian K. Liu(1) ((1) University of Birmingham, Birmingham, United Kingdom)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Motor kinematics prediction (MKP) from electroencephalography (EEG) is an important research area for developing movement-related brain-computer interfaces (BCIs). While traditional methods often rely on convolutional neural networks (CNNs) or recurrent neural networks (RNNs), Transformer-based models have shown strong ability in modeling long sequential EEG data. In this study, we propose a CNN-attention hybrid model for decoding hand kinematics from EEG during grasp-and-lift tasks, achieving strong performance in within-subject experiments. We further extend this approach to EEG-EMG multimodal decoding, which yields substantially improved results. Within-subject tests achieve PCC values of 0.9854, 0.9946, and 0.9065 for the X, Y, and Z axes, respectively, computed on the midpoint trajectory between the thumb and index finger, while cross-subject tests result in 0.9643, 0.9795, and 0.5852. The decoded trajectories from both modalities are then used to control a Franka Panda robotic arm in a MuJoCo simulation. To enhance trajectory fidelity, we introduce a copilot framework that filters low-confidence decoded points using a motion-state-aware critic within a finite-state machine. This post-processing step improves the overall within-subject PCC of EEG-only decoding to 0.93 while excluding fewer than 20% of the data points.
[HC-62] From Paper to Program: A Multi-Stage LLM -Assisted Workflow for Accelerating Quantum Many-Body Algorithm Development
【速读】:该论文旨在解决将量子多体理论转化为可扩展软件的传统方法耗时长达数月的问题,以及大语言模型(Large Language Models, LLMs)在零样本生成张量网络算法时因空间推理错误和内存瓶颈导致频繁失败的局限性。其解决方案的关键在于设计了一个多阶段工作流,模拟物理研究团队的协作模式:首先生成一个数学严谨的LaTeX规范作为中间蓝图,从而约束编码LLM仅输出精确、无矩阵存储的O(D3)复杂度操作;在此基础上成功构建了密度矩阵重整化群(Density-Matrix Renormalization Group, DMRG)引擎,准确刻画了自旋-1/2海森堡模型的临界纠缠标度和自旋-1 AKLT模型的对称保护拓扑(Symmetry-Protected Topological, SPT)序,且在16种主流基础模型组合测试中实现100%成功率,将原本需数月的研发周期压缩至24小时内(约14小时有效工作时间)。
链接: https://arxiv.org/abs/2604.04089
作者: Yi Zhou
机构: Institute of Physics, Chinese Academy of Sciences (中国科学院物理研究所)
类目: Computational Physics (physics.comp-ph); Strongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Translating quantum many-body theory into scalable software traditionally requires months of effort. Zero-shot generation of tensor network algorithms by Large Language Models (LLMs) frequently fails due to spatial reasoning errors and memory bottlenecks. We resolve this using a multi-stage workflow that mimics a physics research group. By generating a mathematically rigorous LaTeX specification as an intermediate blueprint, we constrain the coding LLM to produce exact, matrix-free \mathcalO(D^3) operations. We validate this approach by generating a Density-Matrix Renormalization Group (DMRG) engine that accurately captures the critical entanglement scaling of the Spin- 1/2 Heisenberg model and the symmetry-protected topological (SPT) order of the Spin- 1 AKLT model. Testing across 16 combinations of leading foundation models yielded a 100% success rate. By compressing a months-long development cycle into under 24 hours ( \sim 14 active hours), this framework offers a highly reproducible paradigm for accelerating computational physics research.
计算机视觉
[CV-0] Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision CVPR2026
【速读】:该论文旨在解决传统两阶段流水线在虚拟试衣(image-based virtual try-on)与姿态驱动动画(pose-driven animation)中存在的人体身份漂移(identity drift)、服装形变(garment distortion)及前后视角不一致(front-back inconsistency)等问题。其解决方案的关键在于提出一个统一框架Vanast,通过端到端的单步生成方式实现服装转移的人体动画视频合成,并构建大规模三元组监督数据(triplet supervision),包括生成保持身份一致性的换装图像、捕捉完整上下装三元组以突破单一服装姿态视频对的限制,以及无需依赖服装目录图像即可组装多样野外场景三元组;同时引入双模块(Dual Module)架构优化视频扩散变换器(video diffusion transformers)的训练稳定性与生成质量,在保留预训练生成能力的基础上显著提升服装准确性、姿态贴合度和身份一致性,支持零样本服装插值(zero-shot garment interpolation)。
链接: https://arxiv.org/abs/2604.04934
作者: Hyunsoo Cha,Wonjung Woo,Byungjun Kim,Hanbyul Joo
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026, Project Page: this https URL
Abstract:We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.
[CV-1] PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding CVPR2026
【速读】:该论文旨在解决场景级点云理解中因几何多样性、类别分布不均衡及空间布局差异大而导致的性能瓶颈问题,尤其针对现有方法在推理阶段依赖静态网络参数、难以适应动态场景数据的局限性。其解决方案的关键在于提出PointTPA(Test-time Parameter Adaptation)框架,通过引入基于序列化邻域分组(Serialization-based Neighborhood Grouping, SNG)构建局部一致的点云块,并设计动态参数投影器(Dynamic Parameter Projector, DPP)生成与输入相关的自适应权重,使骨干网络能够在测试时根据场景特征动态调整行为,同时保持极低的参数开销(仅占主干参数的2%以下)。该机制显著提升了3D场景理解的泛化能力,在ScanNet验证集上达到78.4%的mIoU,优于多种参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法。
链接: https://arxiv.org/abs/2604.04933
作者: Siyuan Liu,Chaoqun Zheng,Xin Zhou,Tianrui Feng,Dingkang Liang,Xiang Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. The code is available at this https URL
Abstract:Scene-level point cloud understanding remains challenging due to diverse geometries, imbalanced category distributions, and highly varied spatial layouts. Existing methods improve object-level performance but rely on static network parameters during inference, limiting their adaptability to dynamic scene data. We propose PointTPA, a Test-time Parameter Adaptation framework that generates input-aware network parameters for scene-level point clouds. PointTPA adopts a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights, enabling the backbone to adjust its behavior according to scene-specific variations while maintaining a low parameter overhead. Integrated into the PTv3 structure, PointTPA demonstrates strong parameter efficiency by introducing two lightweight modules of less than 2% of the backbone’s parameters. Despite this minimal parameter overhead, PointTPA achieves 78.4% mIoU on ScanNet validation, surpassing existing parameter-efficient fine-tuning (PEFT) methods across multiple benchmarks, highlighting the efficacy of our test-time dynamic network parameter adaptation mechanism in enhancing 3D scene understanding. The code is available at this https URL.
[CV-2] LoMa: Local Feature Matching Revisited
【速读】:该论文旨在解决局部特征匹配(Local Feature Matching)在数据驱动方法中进展缓慢的问题,尤其是在大规模、多样化数据集上的训练不足导致性能提升受限。其关键解决方案是提出名为LoMa的新型框架,通过融合大规模且多样化的数据混合、现代训练策略、模型容量与计算资源的扩展,显著提升了特征匹配性能。此外,为突破现有基准测试(如WxBS、InLoc等)因图像对过于简单而导致的性能饱和问题,作者构建了包含1000个高难度图像对的新数据集HardMatch,并基于人工标注获取真实对应关系,从而更全面地评估模型能力。实验表明,LoMa在多个基准上均大幅超越当前最优方法ALIKED+LightGlue,验证了其有效性。
链接: https://arxiv.org/abs/2604.04931
作者: David Nordström,Johan Edstedt,Georg Bökman,Jonathan Astermark,Anders Heyden,Viktor Larsson,Mårten Wadenbäck,Michael Felsberg,Fredrik Kahl
机构: Chalmers University of Technology (查尔姆斯理工大学); Lund University (隆德大学); KTH Royal Institute of Technology (皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Local feature matching has long been a fundamental component of 3D vision systems such as Structure-from-Motion (SfM), yet progress has lagged behind the rapid advances of modern data-driven approaches. The newer approaches, such as feed-forward reconstruction models, have benefited extensively from scaling dataset sizes, whereas local feature matching models are still only trained on a few mid-sized datasets. In this paper, we revisit local feature matching from a data-driven perspective. In our approach, which we call LoMa, we combine large and diverse data mixtures, modern training recipes, scaled model capacity, and scaled compute, resulting in remarkable gains in performance. Since current standard benchmarks mainly rely on collecting sparse views from successful 3D reconstructions, the evaluation of progress in feature matching has been limited to relatively easy image pairs. To address the resulting saturation of benchmarks, we collect 1000 highly challenging image pairs from internet data into a new dataset called HardMatch. Ground truth correspondences for HardMatch are obtained via manual annotation by the authors. In our extensive benchmarking suite, we find that LoMa makes outstanding progress across the board, outperforming the state-of-the-art method ALIKED+LightGlue by +18.6 mAA on HardMatch, +29.5 mAA on WxBS, +21.4 (1m, 10 ^\circ ) on InLoc, +24.2 AUC on RUBIK, and +12.4 mAA on IMC 2022. We release our code and models publicly at this https URL.
[CV-3] Rethinking Model Efficiency: Multi-Agent Inference with Large Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在推理过程中因输出 token 数量差异导致的端到端延迟不一致问题,尤其是大模型虽性能优越但可能因长序列生成而效率低下,而小模型虽响应快却难以达到同等性能的问题。其解决方案的关键在于提出一种多智能体推理框架,通过保留大模型对短响应的高效性,并在必要时复用小模型中的关键推理 token 来增强大模型的推理能力,从而在保持低延迟的同时逼近大模型独立推理的性能表现。
链接: https://arxiv.org/abs/2604.04929
作者: Sixun Dong,Juhua Hu,Steven Li,Wei Wen,Qi Qian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.
[CV-4] Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo
【速读】:该论文旨在解决多视图立体重建(Multi-View Stereo, MVS)中高质量训练数据稀缺的问题,特别是如何高效生成大规模、多样化且具有挑战性的合成数据以提升MVS模型性能。其解决方案的关键在于提出SimpleProc——一种基于极简规则驱动的全程序化数据生成器,利用非均匀有理B样条(Non-Uniform Rational Basis Splines, NURBS)以及基础的位移和纹理模式,仅用少量规则即可生成高质量训练样本。实验表明,即使在8,000张图像的小规模下,该方法也优于同等规模的人工标注数据;在352,000张图像的大规模场景中,其性能可媲美甚至超越使用超过692,000张人工标注图像训练的模型,验证了该方案在数据生成效率与质量上的显著优势。
链接: https://arxiv.org/abs/2604.04925
作者: Zeyu Ma,Alexander Raistrick,Jia Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we explore the design space of procedural rules for multi-view stereo (MVS). We demonstrate that we can generate effective training data using SimpleProc: a new, fully procedural generator driven by a very small set of rules using Non-Uniform Rational Basis Splines (NURBS), as well as basic displacement and texture patterns. At a modest scale of 8,000 images, our approach achieves superior results compared to manually curated images (at the same scale) sourced from games and real-world objects. When scaled to 352,000 images, our method yields performance comparable to–and in several benchmarks, exceeding–models trained on over 692,000 manually curated images. The source code and the data are available at this https URL.
[CV-5] Your Pre-trained Diffusion Model Secretly Knows Restoration
【速读】:该论文旨在解决预训练扩散模型在All-in-One Restoration (AiOR)任务中难以直接利用其先验知识进行图像修复的问题,传统方法依赖微调或Control-Net类模块来引入修复能力,但存在复杂性和资源开销。解决方案的关键在于发现预训练扩散模型本身具备内在的修复行为,可通过在文本编码器输出层直接学习提示嵌入(prompt embeddings)来解锁这一能力;同时,为克服朴素提示学习因前向加噪过程与反向采样轨迹不一致导致的不稳定问题,提出基于扩散桥(diffusion bridge)框架的训练策略,确保从噪声退化状态到干净图像的去噪路径一致性,从而无需微调或专用控制模块即可实现高性能、高泛化的图像修复效果。
链接: https://arxiv.org/abs/2604.04924
作者: Sudarshan Rajagopalan,Vishal M. Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Pre-trained diffusion models have enabled significant advancements in All-in-One Restoration (AiOR), offering improved perceptual quality and generalization. However, diffusion-based restoration methods primarily rely on fine-tuning or Control-Net style modules to leverage the pre-trained diffusion model’s priors for AiOR. In this work, we show that these pre-trained diffusion models inherently possess restoration behavior, which can be unlocked by directly learning prompt embeddings at the output of the text encoder. Interestingly, this behavior is largely inaccessible through text prompts and text-token embedding optimization. Furthermore, we observe that naive prompt learning is unstable because the forward noising process using degraded images is misaligned with the reverse sampling trajectory. To resolve this, we train prompts within a diffusion bridge formulation that aligns training and inference dynamics, enforcing a coherent denoising path from noisy degraded states to clean images. Building on these insights, we introduce our lightweight learned prompts on the pre-trained WAN video model and FLUX image models, converting them into high-performing restoration models. Extensive experiments demonstrate that our approach achieves competitive performance and generalization across diverse degradations, while avoiding fine-tuning and restoration-specific control modules.
[CV-6] A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens CVPR2026
【速读】:该论文旨在解决视频世界建模中生成多样化未来状态的挑战,特别是现有生成式世界模型计算成本过高、难以高效实现多假设预测的问题。其核心解决方案是提出DeltaTok和DeltaWorld:DeltaTok通过编码视觉基础模型(Vision Foundation Model, VFM)特征空间中连续帧间的差异,将三维时空视频表示压缩为一维时间序列的“delta”令牌;DeltaWorld基于这些紧凑的delta tokens构建生成式世界模型,支持并行生成多个未来假设,并仅对最优结果进行监督训练,从而在推理阶段实现单次前向传播即输出多样化的合理未来预测。此方法显著降低参数量(>35倍)与计算复杂度(>2000倍FLOPs),同时提升预测准确性。
链接: https://arxiv.org/abs/2604.04913
作者: Tommie Kerssies,Gabriele Berton,Ju He,Qihang Yu,Wufei Ma,Daan de Geus,Gijs Dubbelman,Liang-Chieh Chen
机构: Amazon(亚马逊); Eindhoven University of Technology (埃因霍温理工大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Code and weights: this https URL
Abstract:Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous “delta” token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: this https URL.
[CV-7] SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
【速读】:该论文旨在解决当前图像空间编辑(image spatial editing)模型在细粒度空间操控能力上的不足,尤其是难以实现对物体布局和相机视角的精确控制。其核心挑战在于缺乏一个能同时评估感知合理性与几何保真度的系统性评测基准,以及高质量、大规模且带有精确标注的训练数据稀缺。解决方案的关键在于:首先提出SpatialEdit-Bench基准,通过视角重建和构图分析联合衡量编辑结果的几何一致性与视觉真实性;其次构建了包含50万张合成图像的SpatialEdit-500k数据集,利用可控的Blender渲染管线生成多样背景与系统化相机轨迹,提供对象级与相机级操作的精确真实标签;最终基于此数据训练出SpatialEdit-16B模型,在通用编辑任务中表现优异,并显著优于现有方法在空间操纵任务上的性能。
链接: https://arxiv.org/abs/2604.04911
作者: Yicheng Xiao,Wenhu Zhang,Lin Song,Yukang Chen,Wenbo Li,Nan Jiang,Tianhe Ren,Haokun Lin,Wei Huang,Haoyang Huang,Xiu Li,Nan Duan,Xiaojuan Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at this https URL.
[CV-8] FileGram: Grounding Agent Personalization in File-System Behavioral Traces
【速读】:该论文旨在解决当前协作式人工智能(AI)代理在本地文件系统中进行个性化时面临的严重数据约束问题,即由于严格的隐私限制和难以联合收集多模态真实世界行为轨迹,导致可扩展训练与评估受限,且现有方法过于聚焦交互层面而忽视了文件系统操作中的密集行为痕迹。其解决方案的关键在于提出 FileGram 框架,该框架将代理记忆与个性化建立在文件系统行为痕迹基础上,包含三个核心组件:(1) FileGramEngine——一个基于角色驱动的数据引擎,可模拟真实工作流并大规模生成细粒度多模态动作序列;(2) FileGramBench——一个基于文件系统行为痕迹的诊断基准,用于评估记忆系统的用户画像重建、轨迹解耦、角色漂移检测及多模态定位能力;(3) FileGramOS——一种自底向上的记忆架构,直接从原子级操作和内容差异构建用户画像,通过查询时抽象机制编码为过程、语义和情景三类通道,从而实现更精准的个性化记忆建模。
链接: https://arxiv.org/abs/2604.04901
作者: Shuai Liu,Shulin Tian,Kairui Hu,Yuhao Dong,Zhe Yang,Bo Li,Jingkang Yang,Chen Change Loy,Ziwei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL , Code: this https URL
Abstract:Coworking AI agents operating within local file systems are rapidly emerging as a paradigm in human-AI interaction; however, effective personalization remains limited by severe data constraints, as strict privacy barriers and the difficulty of jointly collecting multimodal real-world traces prevent scalable training and evaluation, and existing methods remain interaction-centric while overlooking dense behavioral traces in file-system operations; to address this gap, we propose FileGram, a comprehensive framework that grounds agent memory and personalization in file-system behavioral traces, comprising three core components: (1) FileGramEngine, a scalable persona-driven data engine that simulates realistic workflows and generates fine-grained multimodal action sequences at scale; (2) FileGramBench, a diagnostic benchmark grounded in file-system behavioral traces for evaluating memory systems on profile reconstruction, trace disentanglement, persona drift detection, and multimodal grounding; and (3) FileGramOS, a bottom-up memory architecture that builds user profiles directly from atomic actions and content deltas rather than dialogue summaries, encoding these traces into procedural, semantic, and episodic channels with query-time abstraction; extensive experiments show that FileGramBench remains challenging for state-of-the-art memory systems and that FileGramEngine and FileGramOS are effective, and by open-sourcing the framework, we hope to support future research on personalized memory-centric file-system agents.
[CV-9] HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes CVPR
【速读】:该论文旨在解决自动驾驶场景生成中安全性验证的瓶颈问题,即如何在真实道路测试之外,实现大规模、可控且逼真的驾驶场景生成,以满足对复杂交通环境的全面安全评估需求。现有基于指令的图像编辑方法因训练数据偏向物体中心或艺术化内容,在密集且安全关键的驾驶布局中表现不佳。解决方案的关键在于提出HorizonWeaver框架,其核心贡献包括:(1)构建大规模真实-合成配对数据集(基于Boreas、nuScenes和Argoverse2),提升跨域泛化能力;(2)引入语言引导掩码(Language-Guided Masks),通过语义增强的掩码与提示实现细粒度编辑控制;(3)设计内容保留与指令对齐联合损失函数,确保场景一致性与指令忠实性。该方案在L1、CLIP、DINO等指标上优于基线方法,并显著提升用户偏好(+46.4%)和BEV分割IoU(+33%)。
链接: https://arxiv.org/abs/2604.04887
作者: Mauricio Soroco,Francesco Pittaluga,Zaid Tasneem,Abhishek Aich,Bingbing Zhuang,Wuyang Chen,Manmohan Chandraker,Ziyu Jiang
机构: Simon Fraser University (西蒙弗雷泽大学); NEC Labs America (NEC美国实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Findings 2026
Abstract:Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-level granularity, requiring coherent object- and scene-level edits in dense environments; (2) rich high-level semantics, preserving diverse objects while following detailed instructions; and (3) ubiquitous domain shifts, handling changes in climate, layout, and traffic across unseen environments. The core of HorizonWeaver is a set of complementary contributions across data, model, and training: (1) Data: Large-scale dataset generation, where we build a paired real/synthetic dataset from Boreas, nuScenes, and Argoverse2 to improve generalization; (2) Model: Language-Guided Masks for fine-grained editing, where semantics-enriched masks and prompts enable precise, language-guided edits; and (3) Training: Content preservation and instruction alignment, where joint losses enforce scene consistency and instruction fidelity. Together, HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, collecting 255K images across 13 editing categories and outperforming prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%. Project page: this https URL
[CV-10] DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing
【速读】:该论文旨在解决视频混剪(video mashup)创作中因缺乏跨层级多模态协同而造成的专业级流畅性不足问题,具体表现为视觉过渡突兀和音频错位。其解决方案的关键在于将视频混剪建模为多模态一致性满足问题(Multimodal Coherency Satisfaction Problem, MMCSP),并提出DIRECT框架——一个模拟专业制作流程的分层多智能体系统,包含三个级联层级:编剧(Screenwriter)负责基于源内容的全局结构锚定,导演(Director)生成自适应编辑意图与指导,编辑(Editor)则执行意图引导的镜头序列编辑并进行细粒度优化,从而实现语义、视觉与听觉维度的协同统一。
链接: https://arxiv.org/abs/2604.04875
作者: Ke Li,Maoliang Li,Jialiang Chen,Jiayu Chen,Zihao Zheng,Shaoqi Wang,Xiang Chen
机构: Peking University (北京大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: this https URL
[CV-11] Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction
【速读】:该论文旨在解决多视角重建中因使用像素或体素对齐的高斯分布导致的冗余表示、未观测区域出现孔洞或模糊等问题。其核心解决方案是提出自由范围高斯(Free-Range Gaussians),通过在高斯参数空间进行流匹配(flow matching),实现非网格对齐的3D高斯预测,从而支持以非格点对齐的3D数据进行监督,并在未观测区域合成合理内容。此外,为提升效率与结构保持能力,引入分层补丁机制将空间相关的高斯聚类为联合Transformer标记,使序列长度减半;同时在训练中采用时间步加权渲染损失,在推理时结合光度梯度引导和无分类器引导策略,显著提升了重建质量和鲁棒性。
链接: https://arxiv.org/abs/2604.04874
作者: Ahan Shabanov,Peter Hedman,Ethan Weber,Zhengqin Li,Denis Rozumny,Gael Le Lan,Naina Dhingra,Lei Luo,Andrea Vedaldi,Christian Richardt,Andrea Tagliasacchi,Bo Zhu,Numair Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We present Free-Range Gaussians, a multi-view reconstruction method that predicts non-pixel, non-voxel-aligned 3D Gaussians from as few as four images. This is done through flow matching over Gaussian parameters. Our generative formulation of reconstruction allows the model to be supervised with non-grid-aligned 3D data, and enables it to synthesize plausible content in unobserved regions. Thus, it improves on prior methods that produce highly redundant grid-aligned Gaussians, and suffer from holes or blurry conditional means in unobserved regions. To handle the number of Gaussians needed for high-quality results, we introduce a hierarchical patching scheme to group spatially related Gaussians into joint transformer tokens, halving the sequence length while preserving structure. We further propose a timestep-weighted rendering loss during training, and photometric gradient guidance and classifier-free guidance at inference to improve fidelity. Experiments on Objaverse and Google Scanned Objects show consistent improvements over pixel and voxel-aligned methods while using significantly fewer Gaussians, with large gains when input views leave parts of the object unobserved.
[CV-12] Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations CVPR2026
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在视觉推理任务中普遍存在幻觉(hallucination)的问题,尤其是现有检测方法依赖全局图像层面的粗粒度评估,难以识别那些在局部区域呈现弱但分散相关性的幻觉token。解决方案的关键在于提出一种基于patch-level(图像块级别)的细粒度 hallucination 检测框架,通过分析模型各层中token与图像区域之间的精细交互关系,识别出两类幻觉token的特征:(i) 注意力分布稀疏且非局部化,区别于忠实token的紧凑聚焦模式;(ii) 无法与任何视觉区域建立有意义的语义对齐。基于此,作者设计了一种轻量且可解释的检测方法,结合patch级统计特征与隐藏层表示,在token级别实现高达90%的检测准确率,验证了细粒度结构分析在提升幻觉检测能力上的有效性。
链接: https://arxiv.org/abs/2604.04863
作者: Tuan Dung Nguyen,Minh Khoi Ho,Qi Chen,Yutong Xie,Nguyen Cam-Tu,Minh Khoi Nguyen,Dang Huy Pham Nguyen,Anton van den Hengel,Johan W. Verjans,Phi Le Nguyen,Vu Minh Hieu Phan
机构: Hanoi University of Science and Technology (河内科技大学); Australian Institute for Machine Learning, University of Adelaide (阿德莱德大学机器学习研究所); Nanjing University (南京大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Hanoi-Amsterdam High School for the Gifted (河内-阿姆斯特丹天才高中)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR2026 Main Track
Abstract:Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention seen in faithful tokens; and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.
[CV-13] Unified Vector Floorplan Generation via Markup Representation CVPR2026
【速读】:该论文旨在解决住宅平面图(residential floorplan)自动生成中的多样性与泛化能力不足问题,尤其是现有生成模型在面对异构条件任务(如基于场地边界、房间邻接图或部分布局生成)时表现不佳的问题。其解决方案的关键在于提出了一种通用的表示方法——Floorplan Markup Language (FML),该方法将平面图信息编码为单一结构化的语法,从而将整个生成任务转化为下一个标记预测(next token prediction)问题;在此基础上构建的基于Transformer的生成模型FMLM,能够在多种条件下生成高质量且功能合理的平面图,且仅需一个统一模型即可超越以往针对特定任务设计的最优方法。
链接: https://arxiv.org/abs/2604.04859
作者: Kaede Shiohara,Toshihiko Yamasaki
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Webpage: this https URL
Abstract:Automatic residential floorplan generation has long been a central challenge bridging architecture and computer graphics, aiming to make spatial design more efficient and accessible. While early methods based on constraint satisfaction or combinatorial optimization ensure feasibility, they lack diversity and flexibility. Recent generative models achieve promising results but struggle to generalize across heterogeneous conditional tasks, such as generation from site boundaries, room adjacency graphs, or partial layouts, due to their suboptimal representations. To address this gap, we introduce Floorplan Markup Language (FML), a general representation that encodes floorplan information within a single structured grammar, which casts the entire floorplan generation problem into a next token prediction task. Leveraging FML, we develop a transformer-based generative model, FMLM, capable of producing high-fidelity and functional floorplans under diverse conditions. Comprehensive experiments on the RPLAN dataset demonstrate that FMLM, despite being a single model, surpasses the previous task-specific state-of-the-art methods.
[CV-14] he Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models CVPR2026
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶场景中因微调导致的灾难性遗忘(catastrophic forgetting)问题,即在适应驾驶特定数据的过程中,模型原有的预训练世界知识被严重侵蚀,从而削弱其泛化能力与应用价值。解决方案的关键在于提出一种名为Drive Expert Adapter (DEA) 的新框架,该框架通过将适应过程从权重空间迁移至提示(prompt)空间,并基于场景特征动态路由到不同的知识专家模块,从而在不修改模型基础参数的前提下提升驾驶任务性能并有效缓解知识退化问题。
链接: https://arxiv.org/abs/2604.04857
作者: Runhao Mao,Hanshi Wang,Yixiang Yang,Qianli Ma,Jingmeng Zhou,Zhipeng Zhang
机构: AutoLab, School of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: received by cvpr2026
Abstract:The integration of Vision-Language Models (VLMs) into autonomous driving promises to solve long-tail scenarios, but this paradigm faces the critical and unaddressed challenge of catastrophic forgetting. The very fine-tuning process used to adapt these models to driving-specific data simultaneously erodes their invaluable pre-trained world knowledge, creating a self-defeating paradox that undermines the core reason for their use. This paper provides the first systematic investigation into this phenomenon. We introduce a new large-scale dataset of 180K scenes, which enables the first-ever benchmark specifically designed to quantify catastrophic forgetting in autonomous driving. Our analysis reveals that existing methods suffer from significant knowledge degradation. To address this, we propose the Drive Expert Adapter (DEA), a novel framework that circumvents this trade-off by shifting adaptation from the weight space to the prompt space. DEA dynamically routes inference through different knowledge experts based on scene-specific cues, enhancing driving-task performance without corrupting the model’s foundational parameters. Extensive experiments demonstrate that our approach not only achieves state-of-the-art results on driving tasks but also effectively mitigates catastrophic forgetting, preserving the essential generalization capabilities that make VLMs a transformative force for autonomous systems. Data and model are released at FidelityDrivingBench.
[CV-15] InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement ICLR2026
【速读】:该论文旨在解决人-物-场景交互(Human-Object-Scene Interactions, HOSI)生成中的关键挑战,包括动态物体与场景变化的推理困难、标注数据稀缺以及物理冲突(如碰撞和穿透)问题。解决方案的核心在于提出一种粗粒度到细粒度的指令条件交互生成框架,该框架显式对齐一致性模型(consistency model)的迭代去噪过程,并引入动态感知策略,在每个去噪步骤中利用前序优化轨迹更新场景上下文,从而保证交互的一致性;同时设计了一种无需精细场景几何信息的“ bump-aware guidance”机制以减少物理伪影,实现实时生成;此外,通过混合训练策略合成伪HOSI样本(将体素化场景占据注入HOI数据集),并联合高保真HSI数据进行训练,有效缓解数据稀缺问题,同时保持真实场景感知能力。
链接: https://arxiv.org/abs/2604.04843
作者: Yude Zou,Junji Gong,Xing Gao,Zixuan Li,Tianxing Chen,Guanjie Zheng
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Sichuan University (四川大学); Shenzhen University (深圳大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human-object interaction (HOI) and human-scene interaction (HSI), HOSI generation requires reasoning over dynamic object-scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse-to-fine instruction-conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump-aware guidance that mitigates collisions and penetrations during sampling without requiring fine-grained scene geometry, enabling real-time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high-fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Project page: this https URL
[CV-16] Less Detail Better Answers: Degradation-Driven Prompting for VQA CVPR
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在视觉问答(Visual Question Answering, VQA)任务中因高分辨率细节引入噪声而导致幻觉或推理错误的问题。其解决方案的关键在于提出一种降级驱动提示(Degradation-Driven Prompting, DDP)框架,通过有策略地降低图像保真度,迫使模型聚焦于关键结构信息;具体包括针对物理属性任务采用80p下采样、结构视觉辅助(如白底遮罩和正交线)与上下文学习(In-Context Learning, ICL)相结合的方法,以及针对感知现象任务引入任务分类阶段与专用工具(如模糊掩码和对比度增强)来引导模型忽略干扰纹理并提升推理准确性。
链接: https://arxiv.org/abs/2604.04838
作者: Haoxuan Han,Weijie Wang,Zeyu Zhang,Yefei He,Bohan Zhuang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages,5 figures,CVPRW
Abstract:Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model’s focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.
[CV-17] E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
【速读】:该论文旨在解决机器人视觉-语言-动作(Vision-Language-Action, VLA)模型在感知阶段遭遇极端低光、运动模糊和黑 clipping 等传感退化时,其操作鲁棒性显著下降的问题。解决方案的关键在于提出 E-VLA 框架,通过直接利用事件流中的运动与结构线索来维持语义感知一致性与感知-动作一致性,而非依赖传统图像重建。该方法不需复杂图像重构,而是采用轻量级预训练兼容的事件融合策略(如事件累积图叠加至 RGB 图像),并在真实世界中构建了同步 RGB-事件-动作数据集以验证有效性。实验表明,即使是最简单的无参数融合方式也能显著提升任务成功率,在 20 lux 低光照下 Pick-Place 成功率从 0% 提升至 60%,严重运动模糊(1000 ms 曝光)下 Sorting 成功率从 5% 提升至 32.5%,证明事件驱动感知可有效增强 VLA 模型的鲁棒性。
链接: https://arxiv.org/abs/2604.04834
作者: Jiajun Zhai,Hao Shi,Shangwei Guo,Kailun Yang,Kaiwei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Code and dataset will be available at this https URL
Abstract:Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illumination settings. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and blur-heavy scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms exposure), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at this https URL.
[CV-18] Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving
【速读】:该论文旨在解决自动驾驶中3D目标检测的精度问题,即如何有效融合相机与毫米波雷达这两种互补传感器的数据以提升检测性能。其核心挑战在于相机提供丰富语义信息但深度估计不可靠,而雷达虽能精确测量距离和速度,却存在几何信息稀疏的问题。解决方案的关键是提出MMF-BEV框架,通过引入可变形注意力机制(Deformable Attention)实现跨模态特征对齐:构建基于BEVDepth的相机分支和RadarBEVNet的雷达分支,并利用可变形交叉注意力模块进行特征融合,从而在View-of-Delft (VoD) 4D雷达数据集上实现更鲁棒、准确的多模态感知。
链接: https://arxiv.org/abs/2604.04797
作者: Mayank Mayank,Bharanidhar Duraisamy,Florian Geiß,Abhinav Valada
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 8 figures
Abstract:Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.
[CV-19] AvatarPointillist: AutoRegressive 4D Gaussian Avatarization CVPR2026
【速读】:该论文旨在解决从单张肖像图像生成动态4D高斯化身(4D Gaussian Avatars)的挑战,尤其是如何在保证高保真度和可控性的同时实现高效、自适应的点云构建与动画绑定。其解决方案的关键在于提出了一种基于自回归(Autoregressive, AR)的解码器-only Transformer架构,该架构通过逐点生成的方式动态调整点密度和总点数以适配主体复杂度,并在生成过程中联合预测每个点的绑定信息(binding information),从而实现逼真的动画效果;随后,一个专门设计的高斯解码器将生成的点云转换为可渲染的高斯属性,且通过将解码器条件化于自回归生成器的潜在特征,实现了两阶段间的有效交互,显著提升了生成质量与一致性。
链接: https://arxiv.org/abs/2604.04787
作者: Hongyu Liu,Xuan Wang,Yating Wang,Zijian Wu,Ziyu Wan,Yue Ma,Runtao Liu,Boyao Zhou,Yujun Shen,Qifeng Chen
机构: Ant Group(蚂蚁集团); City University of Hong Kong (香港城市大学); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the CVPR 2026 main conference. Project page: this https URL
Abstract:We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject’s complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. We believe this autoregressive formulation represents a new paradigm for avatar generation, and we will release our code inspire future research.
[CV-20] CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
【速读】:该论文旨在解决真实场景中图像退化(如模糊、噪声、压缩失真和光照不良)对多模态理解性能的严重损害问题。现有统一多模态模型虽具备生成能力,但未能有效利用其生成路径来恢复退化图像的细粒度结构,导致在处理退化输入时表现不佳。解决方案的关键在于提出CLEAR框架,通过三个渐进步骤实现生成与推理能力的深度融合:(1) 在退化感知数据集上进行监督微调,建立“先生成后回答”的推理模式;(2) 引入潜空间桥梁(Latent Representation Bridge),替代传统的解码-重编码路径,构建生成与推理之间的可优化直接连接;(3) 设计交错式GRPO强化学习方法,在答案正确性奖励下联合优化文本推理与视觉生成。该方案显著提升了模型在退化输入下的鲁棒性,同时保持了对清晰图像的良好性能。
链接: https://arxiv.org/abs/2604.04780
作者: Xiangzhao Hao,Zefeng Zhang,Zhenyu Zhang,Linhao Yu,Yao Chen,Yiqian Zhang,Haiyun Guo,Shuohuan Wang,Yu Sun
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.
[CV-21] hink in Strokes Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
【速读】:该论文旨在解决当前统一多模态模型在文本到图像生成任务中缺乏对中间状态(intermediate states)显式建模的问题,即模型通常以单步方式生成图像,无法模拟人类绘画过程中逐步迭代、反思与修正的思维链。其解决方案的关键在于提出一种**过程驱动的图像生成(process-driven image generation)**范式,将生成过程分解为四个交替进行的阶段:文本规划(textual planning)、视觉草图(visual drafting)、文本反思(textual reflection)和视觉精修(visual refinement)。该方法通过显式地约束每一步的文本与视觉中间状态,实现跨模态的协同演进:文本推理指导视觉状态演化,而生成的视觉中间结果又反过来约束后续文本推理,从而形成可解释且可监督的多步生成轨迹。关键创新在于引入密集的阶段性监督机制,确保视觉中间状态的空间与语义一致性,并保持文本中间状态对先前视觉知识的继承性,同时具备识别并纠正提示违反内容的能力。
链接: https://arxiv.org/abs/2604.04746
作者: Lei Zhang,Junjiao Tian,Zhipeng Fan,Kunpeng Li,Jialiang Wang,Weifeng Chen,Markos Georgopoulos,Felix Juefei-Xu,Yuxiang Bao,Julian McAuley,Manling Li,Zecheng He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.
[CV-22] Discovering Failure Modes in Vision-Language Models using RL
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理基础视觉概念(如计数、空间推理和视角理解)时易出现错误的问题,而这些问题往往在人工设计的评测中被忽视,导致对模型脆弱性的认知不完整。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的自动化框架,通过训练一个提问代理(questioner agent),根据候选VLM的响应动态生成复杂度递增的查询,聚焦于细粒度视觉细节和技能组合的变化,从而无监督地识别出36种新型失败模式,显著提升了对VLM盲区的探测能力与可扩展性。
链接: https://arxiv.org/abs/2604.04733
作者: Kanishk Jain,Qian Yang,Shravan Nayak,Parisa Kordjamshidi,Nishanth Anand,Aishwarya Agrawal
机构: Mila – Québec AI Institute (Mila – 魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); Michigan State University (密歇根州立大学); McGill University (麦吉尔大学); Canada CIFAR AI Chair (加拿大 CIFAR 人工智能主席)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual concepts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoint understanding. Previous studies manually identified these weaknesses and found that they often stem from deficits in specific skills. However, such manual efforts are costly, unscalable, and subject to human bias, which often overlooks subtle details in favor of salient objects, resulting in an incomplete understanding of a model’s vulnerabilities. To address these limitations, we propose a Reinforcement Learning (RL)-based framework to automatically discover the failure modes or blind spots of any candidate VLM on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM’s responses to elicit incorrect answers. Our approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses, consequently identifying 36 novel failure modes in which VLMs struggle. We demonstrate the broad applicability of our framework by showcasing its generalizability across various model combinations.
[CV-23] Dont Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLM s CVPR
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在移动、嵌入式及边缘设备上部署时,因键值缓存(Key-Value Cache, KV cache)内存和带宽开销随上下文长度线性增长而导致的推理效率瓶颈问题。现有KV缓存量化方法通常采用固定精度或人工设计规则,造成对低重要性token过度压缩而对高信息量token压缩不足,从而引发可避免的准确率下降。解决方案的关键在于提出一种基于学习的自适应KV缓存量化框架,其核心是通过轻量级token级特征(如词频、质量得分、注意力方差和基于熵的不确定性)构建一个数据驱动的控制器,在解码过程中动态选择2-bit、4-bit、8-bit或FP16等不同位宽的精度分配策略,使比特分配与token重要性成正比,从而在显著降低KV缓存内存占用和解码延迟的同时,保持接近FP16精度的性能表现。
链接: https://arxiv.org/abs/2604.04722
作者: Sayed Pedram Haeri Boroujeni,Niloufar Mehrabi,Patrick Woods,Gabriel Hillesheim,Abolfazl Razi
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Abstract:Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding’s principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from 2-bit, 4-bit, 8-bit, FP16 during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.
[CV-24] OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
【速读】:该论文旨在解决当前人工智能领域中世界模型(World Model)缺乏统一定义与标准化框架的问题,从而阻碍了跨任务模型的整合与高效协作。解决方案的关键在于提出一个清晰的定义:世界模型是围绕感知构建、具备交互能力和长期记忆能力的模型或框架,用于理解和预测复杂环境;在此基础上,作者开发了OpenWorldLib——一个集成多任务模型的标准化推理框架,实现了不同任务模型间的高效复用与协同推理,为未来世界模型研究提供了系统性支撑。
链接: https://arxiv.org/abs/2604.04707
作者: DataFlow Team,Bohan Zeng,Daili Hua,Kaixin Zhu,Yifan Dai,Bozhou Li,Yuran Wang,Chengzhuo Tong,Yifan Yang,Mingkun Chang,Jianbin Zhao,Zhou Liu,Hao Liang,Xiaochen Ma,Ruichuan An,Junbo Niu,Zimo Meng,Tianyi Bai,Meiyi Qiang,Huanyao Zhang,Zhiyou Xiao,Tianyu Guo,Qinhan Yu,Runhao Zhao,Zhengpin Li,Xinyi Huang,Yisheng Pan,Yiwen Tang,Yang Shi,Yue Ding,Xinlong Chen,Hongcheng Gao,Minglei Shi,Jialong Wu,Zekun Wang,Yuanxing Zhang,Xintao Wang,Pengfei Wan,Yiren Song,Mike Zheng Shou,Wentao Zhang
机构: Peking University (北京大学); Tsinghua University (清华大学); Chinese Academy of Sciences (中国科学院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Alibaba Cloud (阿里云); Baidu (百度); Tencent (腾讯); Microsoft (微软); Huawei (华为); Google (谷歌); Meta (Meta); Stability.AI (Stability.AI); Anthropic (Anthropic); Character.ai (Character.ai); Claude (Claude)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 6 figures
Abstract:World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: this https URL
[CV-25] Explainable Machine Learning for Sepsis Outcome Prediction Using a Novel Romanian Electronic Health Record Dataset
【速读】:该论文旨在解决脓毒症(sepsis)预后预测中模型性能与临床可解释性之间的平衡问题。其解决方案的关键在于构建并分析可解释的机器学习(explainable machine learning, XAI)模型,利用来自罗马尼亚大型急诊医院的12,286例住院患者的电子健康记录(Electronic Health Record, EHR)数据,涵盖人口统计学信息、国际疾病分类(ICD-10)诊断及600种实验室检测指标。研究通过训练五种机器学习模型,在保留临床可解释性的前提下捕捉复杂的变量分布,并采用SHapley Additive exPlanations(SHAP)方法评估特征重要性,从而识别出如嗜酸性粒细胞减少(eosinopenia)、尿素水平、天冬氨酸氨基转移酶(aspartate aminotransferase)、血小板计数等关键预测因子。实验表明,在“死亡 vs. 恢复”任务中达到AUC=0.983、准确率=0.93的最优性能,验证了该方法在临床场景中的高适用性与潜在价值。
链接: https://arxiv.org/abs/2604.04698
作者: Andrei-Alexandru Bunea,Ovidiu Ghibea,Dan-Matei Popovici,Ion Daniel,Octavian Andronic
机构: Carol Davila University of Medicine and Pharmacy, Faculty of Medicine, Bucharest, Romania (布加勒斯特卡罗尔·戴维拉医科大学医学院); POLITEHNICA Bucharest National University of Science and Technology, Bucharest, Romania (布加勒斯特波利特赫尼卡国立科学与技术大学); General Surgery Department, University Emergency Hospital of Bucharest, Bucharest, Romania (布加勒斯特大学急救医院普外科); Innovation and eHealth Center, Carol Davila University of Medicine and Pharmacy, Bucharest, Romania (布加勒斯特卡罗尔·戴维拉医科大学创新与数字健康中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We develop and analyze explainable machine learning (ML) models for sepsis outcome prediction using a novel Electronic Health Record (EHR) dataset from 12,286 hospitalizations at a large emergency hospital in Romania. The dataset includes demographics, International Classification of Diseases (ICD-10) diagnostics, and 600 types of laboratory tests. This study aims to identify clinically strong predictors while achieving state-of-the-art results across three classification tasks: (1)deceased vs. discharged, (2)deceased vs. recovered, and (3)recovered vs. ameliorated. We trained five ML models to capture complex distributions while preserving clinical interpretability. Experiments explored the trade-off between feature richness and patient coverage, using subsets of the 10–50 most frequent laboratory tests. Model performance was evaluated using accuracy and area under the curve (AUC), and explainability was assessed using SHapley Additive exPlanations (SHAP). The highest performance was obtained for the deceased vs. recovered case study (AUC=0.983, accuracy=0.93). SHAP analysis identified several strong predictors such as cardiovascular comorbidities, urea levels, aspartate aminotransferase, platelet count, and eosinophil percentage. Eosinopenia emerged as a top predictor, highlighting its value as an underutilized marker that is not included in current assessment standards, while the high performance suggests the applicability of these models in clinical settings.
[CV-26] 3D Gaussian Splatting for Annular Dark Field Scanning Transmission Electron Microscopy Tomography Reconstruction
【速读】:该论文旨在解决稀疏视角(sparse-view)条件下Analytical Dark Field Scanning Transmission Electron Microscopy (ADF-STEM)断层扫描重建中图像质量下降的问题,即在减少电子剂量以保护敏感样品的同时,如何维持三维结构 fidelity 并抑制缺失楔形(missing wedge)伪影。其解决方案的关键在于提出一种名为 DenZa-Gaussian 的3D生成式模型(3D Generative Model),包含三个核心组件:首先将局部散射强度建模为可学习的标量场 denza,以弥合传统3D生成式方法与ADF-STEM成像物理之间的不匹配;其次引入系数 γ 对不同倾斜角度下的散射强度进行归一化,确保 denza 的稳定性;最后设计包含2D傅里叶振幅损失项的损失函数,有效抑制稀疏视角下的缺失楔形伪影,从而显著提升重建图像的质量和投影一致性。
链接: https://arxiv.org/abs/2604.04693
作者: Beiyuan Zhang,Hesong Li,Ruiwen Shao,Ying Fu
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Analytical Dark Field Scanning Transmission Electron Microscopy (ADF-STEM) tomography reconstructs nanoscale materials in 3D by integrating multi-view tilt-series images, enabling precise analysis of their structural and compositional features. Although integrating more tilt views improves 3D reconstruction, it requires extended electron exposure that risks damaging dose-sensitive materials and introduces drift and misalignment, making it difficult to balance reconstruction fidelity with sample preservation. In practice, sparse-view acquisition is frequently required, yet conventional ADF-STEM methods degrade under limited views, exhibiting artifacts and reduced structural fidelity. To resolve these issues, in this paper, we adapt 3D GS to this domain with three key components. We first model the local scattering strength as a learnable scalar field, denza, to address the mismatch between 3DGS and ADF-STEM imaging physics. Then we introduce a coefficient \gamma to stabilize scattering across tilt angles, ensuring consistent denza via scattering view normalization. Finally, We incorporate a loss function that includes a 2D Fourier amplitude term to suppress missing wedge artifacts in sparse-view reconstruction. Experiments on 45-view and 15-view tilt series show that DenZa-Gaussian produces high-fidelity reconstructions and 2D projections that align more closely with original tilts, demonstrating superior robustness under sparse-view conditions.
[CV-27] Batch Loss Score for Dynamic Data Pruning CVPR2026
【速读】:该论文旨在解决动态数据剪枝(Dynamic Data Pruning)中样本重要性评估的难题,即在深度学习训练过程中,如何高效、准确地识别并剔除低信息量样本以加速训练,而无需对复杂模型或损失函数计算逐样本损失(per-sample loss),后者往往实现困难且计算开销大。其解决方案的关键在于提出一种名为“批次损失得分”(Batch Loss Score, BLS)的新指标,利用可直接获取的批次损失(batch loss)通过指数移动平均(Exponential Moving Average, EMA)机制构建样本重要性代理分数;理论上证明EMA相当于一阶低通滤波器,能有效抑制由批次随机组成带来的噪声,从而近似得到单个样本对整体损失的稳定贡献,为BLS提供了坚实的理论基础。该方法仅需三行代码即可集成到现有训练流程中,并可无缝替代原有基于逐样本损失的方法,显著提升了在复杂场景下的实用性与普适性。
链接: https://arxiv.org/abs/2604.04681
作者: Qing Zhou,Bingxuan Zhao,Tao Yang,Hongyuan Zhang,Junyu Gao,Qi Wang
机构: Northwestern Polytechnical University (西北工业大学); The University of Hong Kong (香港大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026 accepted
Abstract:Dynamic data pruning accelerates deep learning by selectively omitting less informative samples during training. While per-sample loss is a common importance metric, obtaining it can be challenging or infeasible for complex models or loss functions, often requiring significant implementation effort. This work proposes the Batch Loss Score (BLS), a computationally efficient alternative using an Exponential Moving Average (EMA) of readily available batch losses to assign scores to individual samples. We frame the batch loss, from the perspective of a single sample, as a noisy measurement of its scaled individual loss, with noise originating from stochastic batch composition. It is formally shown that the EMA mechanism functions as a first-order low-pass filter, attenuating high-frequency batch composition noise. This yields a score approximating the smoothed and persistent contribution of the individual sample to the loss, providing a theoretical grounding for BLS as a proxy for sample importance. BLS demonstrates remarkable code integration simplicity (\textbfthree-line injection) and readily adapts existing per-sample loss-based methods (\textbfone-line proxy). Its effectiveness is demonstrated by enhancing two such methods to losslessly prune \textbf20%-50% of samples across \textit14 datasets, \textit11 tasks and \textit18 models, highlighting its utility and broad applicability, especially for complex scenarios where per-sample loss is difficult to access. Code is available at this https URL.
[CV-28] ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
【速读】:该论文旨在解决从超高清无人机(UAV)影像中实时重建深度信息的难题,尤其针对灾害响应等时间敏感的地学任务。现有方法受限于宽基线视差、大图像尺寸、低纹理或镜面表面、遮挡以及严格的计算约束,而生成式扩散模型虽能实现无需任务特定微调的快速密集预测,但其概率推理特性导致度量准确性与连续帧间的时间一致性不足。解决方案的关键在于提出ZeD-MAP框架——通过将测试时的扩散深度模型转化为类似SLAM的映射流水线,引入增量式的基于聚类的束调整(Bundle Adjustment, BA),对流式输入的无人机帧进行分组并周期性优化位姿与稀疏三维特征点,再将这些度量一致的参考点重投影至选定帧作为引导,从而提升扩散模型的几何精度与时间一致性。该方法在约50米飞行高度下实现了亚米级精度(XY平面误差约0.87 m,Z方向约0.12 m),同时保持每帧1.47–4.91秒的实时处理性能。
链接: https://arxiv.org/abs/2604.04667
作者: Selim Ahmet Iz,Francesco Nex,Norman Kerle,Henry Meissner,Ralf Berger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.
[CV-29] Synthesis4AD: Synthetic Anomalies are All You Need for 3D Anomaly Detection
【速读】:该论文旨在解决工业场景下3D异常检测性能受限于异常样本稀缺性和长尾分布的问题。其核心解决方案是提出Synthesis4AD,一个端到端的范式,通过大规模、高保真合成异常数据来学习更具判别性的表示。关键创新在于:1)基于可控合成引擎MPAS构建的3D-DefectStudio平台,可注入几何上真实的缺陷并生成精确的逐点异常掩码;2)引入多模态大语言模型(Multimodal Large Language Model, MLLM)自动将产品设计信息转化为可执行的异常合成指令,实现知识驱动的异常数据规模化生成;3)设计基于空间分布归一化和几何忠实数据增强的训练策略,提升Point Transformer架构在非结构化点云上的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2604.04658
作者: Yihan Sun,Yuqi Cheng,Junjie Zu,Yuxiang Tan,Guoyang Xie,Yucheng Wang,Yunkang Cao,Weiming Shen
机构: Huazhong University of Science and Technology (华中科技大学); Hunan University (湖南大学); Contemporary Amperex Technology Ltd. (宁德时代); A*STAR (新加坡科技研究局); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Industrial 3D anomaly detection performance is fundamentally constrained by the scarcity and long-tailed distribution of abnormal samples. To address this challenge, we propose Synthesis4AD, an end-to-end paradigm that leverages large-scale, high-fidelity synthetic anomalies to learn more discriminative representations for 3D anomaly detection. At the core of Synthesis4AD is 3D-DefectStudio, a software platform built upon the controllable synthesis engine MPAS, which injects geometrically realistic defects guided by higher-dimensional support primitives while simultaneously generating accurate point-wise anomaly masks. Furthermore, Synthesis4AD incorporates a multimodal large language model (MLLM) to interpret product design information and automatically translate it into executable anomaly synthesis instructions, enabling scalable and knowledge-driven anomalous data generation. To improve the robustness and generalization of the downstream detector on unstructured point clouds, Synthesis4AD further introduces a training pipeline based on spatial-distribution normalization and geometry-faithful data augmentations, which alleviates the sensitivity of Point Transformer architectures to absolute coordinates and improves feature learning under realistic data variations. Extensive experiments demonstrate state-of-the-art performance on Real3D-AD, MulSen-AD, and a real-world industrial parts dataset. The proposed synthesis method MPAS and the interactive system 3D-DefectStudio will be publicly released at this https URL.
[CV-30] raining-Free Refinement of Flow Matching with Divergence-based Sampling
【速读】:该论文旨在解决流模型(Flow-based models)在生成过程中因样本间速度场冲突导致的中间状态误导问题,即当多个样本在相同中间状态处的速度方向不一致时,平均速度场可能将样本引向低密度区域,从而降低生成质量。解决方案的关键在于提出一种无需训练的Flow Divergence Sampler(FDS)框架,其核心发现是:通过计算推理阶段可得的边际速度场散度(divergence of the marginal velocity field),能够量化该误导程度,并利用此信号在每一步求解前优化中间状态,引导样本向更少歧义的区域迁移。该方法兼容标准求解器和现成的流模型骨干网络,在文本到图像合成等多类生成任务中显著提升保真度(fidelity)。
链接: https://arxiv.org/abs/2604.04646
作者: Yeonwoo Cha,Jaehoon Yoo,Semin Kim,Yunseo Park,Jinhyeon Kwon,Seunghoon Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Flow-based models learn a target distribution by modeling a marginal velocity field, defined as the average of sample-wise velocities connecting each sample from a simple prior to the target data. When sample-wise velocities conflict at the same intermediate state, however, this averaged velocity can misguide samples toward low-density regions, degrading generation quality. To address this issue, we propose the Flow Divergence Sampler (FDS), a training-free framework that refines intermediate states before each solver step. Our key finding reveals that the severity of this misguidance is quantified by the divergence of the marginal velocity field that is readily computable during inference with a well-optimized model. FDS exploits this signal to steer states toward less ambiguous regions. As a plug-and-play framework compatible with standard solvers and off-the-shelf flow backbones, FDS consistently improves fidelity across various generation tasks including text-to-image synthesis, and inverse problems.
[CV-31] Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale ICLR2026
【速读】:该论文旨在解决当前AI生成视频检测方法中存在的两大核心问题:一是现有检测模型依赖固定分辨率的预处理操作(如缩放和裁剪),导致高频伪造痕迹丢失及空间失真;二是训练与评估所用数据集过时,无法反映现代生成模型的高保真特性。其解决方案的关键在于提出一个大规模、多样化的视频数据集(超过140K条来自15种先进生成器的视频)和一个基于Qwen2.5-VL Vision Transformer的新型检测框架,该框架支持原生尺度(native-scale)处理——即在可变空间分辨率和时间长度下直接运行,从而有效保留高频伪影和时空不一致性特征,显著提升检测性能并建立新的基准。
链接: https://arxiv.org/abs/2604.04634
作者: Zhengcen Li,Chenyang Jiang,Hang Zhao,Shiyang Zhou,Yunyang Mo,Feng Gao,Fan Yang,Qiben Shan,Shaocong Wu,Jingyong Su
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Peng Cheng Laboratory (鹏城实验室); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026 Camera Ready
Abstract:The rapid advancement of video generation models has enabled the creation of highly realistic synthetic media, raising significant societal concerns regarding the spread of misinformation. However, current detection methods suffer from critical limitations. They rely on preprocessing operations like fixed-resolution resizing and cropping. These operations not only discard subtle, high-frequency forgery traces but also cause spatial distortion and significant information loss. Furthermore, existing methods are often trained and evaluated on outdated datasets that fail to capture the sophistication of modern generative models. To address these challenges, we introduce a comprehensive dataset and a novel detection framework. First, we curate a large-scale dataset of over 140K videos from 15 state-of-the-art open-source and commercial generators, along with Magic Videos benchmark designed specifically for evaluating ultra-realistic synthetic content. In addition, we propose a novel detection framework built on the Qwen2.5-VL Vision Transformer, which operates natively at variable spatial resolutions and temporal durations. This native-scale approach effectively preserves the high-frequency artifacts and spatiotemporal inconsistencies typically lost during conventional preprocessing. Extensive experiments demonstrate that our method achieves superior performance across multiple benchmarks, underscoring the critical importance of native-scale processing and establishing a robust new baseline for AI-generated video detection.
[CV-32] InCTRLv2: Generalist Residual Models for Few-Shot Anomaly Detection and Segmentation
【速读】:该论文旨在解决当前异常检测(Anomaly Detection, AD)模型在跨域泛化能力上的局限性,即大多数现有方法依赖于特定目标数据集的大量训练样本,难以直接应用于未见过的数据分布。为应对这一挑战,作者提出了InCTRLv2框架,其核心创新在于构建了一个双分支结构,通过两种互补的异常感知机制实现少样本通用异常检测与分割(Generalist Anomaly Detection and Segmentation, GADS)。关键解决方案包括:i) 主分支引入判别式异常分数学习(Discriminative Anomaly Score Learning, DASL),利用正常与异常样本共同建模语义引导的异常-正常空间,支持从异常性和正常性两个角度对查询样本进行分类;ii) 辅助分支采用单类异常分数学习(One-class Anomaly Score Learning, OASL),仅使用正常样本学习广义正常模式,聚焦于以偏离正常语义的方式识别异常。两个分支均受大规模视觉-语言模型提供的丰富视觉-文本语义先验指导,从而形成“异常-正常区分”与“正常偏离度”双重语义视角,显著提升了模型在多种场景下的性能与泛化能力。
链接: https://arxiv.org/abs/2604.04632
作者: Jiawen Zhu,Mengjia Niu,Guansong Pang
机构: Singapore Management University (新加坡管理大学); Imperial College London (伦敦帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While recent anomaly detection (AD) methods have made substantial progress in recognizing abnormal patterns within specific domains, most of them are specialist models that are trained on large training samples from a specific target dataset, struggling to generalize to unseen datasets. To address this limitation, the paradigm of Generalist Anomaly Detection (GAD) has emerged in recent years, aiming to learn a single generalist model to detect anomalies across diverse domains without retraining. To this end, this work introduces InCTRLv2, a novel few-shot Generalist Anomaly Detection and Segmentation (GADS) framework that significantly extends our previously proposed GAD model, InCTRL. Building on the idea of learning in-context residuals with few-shot normal examples to detect anomalies as in InCTRL, InCTRLv2 introduces two new, complementary perspectives of anomaly perception under a dual-branch framework. This is accomplished by two novel modules upon InCTRL: i) Discriminative Anomaly Score Learning (DASL) with both normal and abnormal data in the main branch, which learns a semantic-guided abnormality and normality space that supports the classification of query samples from both the abnormality and normality perspectives; and ii) One-class Anomaly Score Learning (OASL) using only the normal data, which learns generalized normality patterns in a semantic space via an auxiliary branch, focusing on detecting anomalies through the lens of normality solely. Both branches are guided by rich visual-text semantic priors encoded by large-scale vision-language models. Together, they offer a dual semantic perspective for AD: one emphasizes normal-abnormal discriminations, while the other emphasizes normality-deviated semantics. Extensive experiments on ten AD datasets demonstrate that InCTRLv2 achieves SotA performance in both anomaly detection and segmentation tasks across various settings.
[CV-33] Multimodal Backdoor Attack on VLMs for Autonomous Driving via Graffiti and Cross-Lingual Triggers
【速读】:该论文旨在解决视觉语言模型(Visual Language Model, VLM)在自动驾驶等安全关键系统中面临的后门攻击问题,特别是现有攻击方法依赖单一模态、显式且易被检测的触发器,难以在真实场景中构建隐蔽且稳定的攻击通道。其解决方案的关键在于提出GLA(Graffiti-based and Language-aware Backdoor Attack),引入两种自然主义触发器:一是通过稳定扩散图像修复(Stable Diffusion Inpainting)生成的涂鸦类视觉模式,可无缝融合进城市环境;二是跨语言文本触发器,在保持语义一致性的同时引入分布偏移,从而构建鲁棒的语言侧触发信号。实验表明,该方法仅需10%的数据污染率即可实现90%的攻击成功率(ASR)和0%的误报率(FPR),且不损害模型在干净样本上的性能,反而提升BLEU-1指标,显著增强了攻击的隐蔽性与有效性。
链接: https://arxiv.org/abs/2604.04630
作者: Jiancheng Wang,Lidan Liang,Yong Wang,Zengzhen Su,Haifeng Xia,Yuanting Yan,Wei Wang
机构: 中国电子科学研究院(China Electronic Science and Technology Institute); 中山大学(Sun Yat-sen University); 安徽大学(Anhui University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is a submission to the “Pattern Analysis and Applications”. The manuscript includes 14 pages and 6 figures. All authors have approved the submission, and there is no conflict of interest to declare
Abstract:Visual language model (VLM) is rapidly being integrated into safety-critical systems such as autonomous driving, making it an important attack surface for potential backdoor attacks. Existing backdoor attacks mainly rely on unimodal, explicit, and easily detectable triggers, making it difficult to construct both covert and stable attack channels in autonomous driving scenarios. GLA introduces two naturalistic triggers: graffiti-based visual patterns generated via stable diffusion inpainting, which seamlessly blend into urban scenes, and cross-language text triggers, which introduce distributional shifts while maintaining semantic consistency to build robust language-side trigger signals. Experiments on DriveVLM show that GLA requires only a 10% poisoning ratio to achieve a 90% Attack Success Rate (ASR) and a 0% False Positive Rate (FPR). More insidiously, the backdoor does not weaken the model on clean tasks, but instead improves metrics such as BLEU-1, making it difficult for traditional performance-degradation-based detection methods to identify the attack. This study reveals underestimated security threats in self-driving VLMs and provides a new attack paradigm for backdoor evaluation in safety-critical multimodal systems.
[CV-34] Beyond Semantics: Uncovering the Physics of Fakes via Universal Physical Descriptors for Cross-Modal Synthetic Detection
【速读】:该论文旨在解决当前深度伪造检测方法在面对多样化的生成式 AI (Generative AI) 内容时表现出的泛化能力不足问题,特别是现有检测模型容易过拟合特定生成架构(如 GANs 和扩散模型)而导致适应性差。其解决方案的关键在于识别并利用一组稳定且鲁棒的物理特征(physical features),这些特征能够跨不同数据集和生成模型持续区分真实图像与 AI 生成图像。研究提出了一种新颖的特征选择算法,从中筛选出五个核心像素级特征(包括拉普拉斯方差、Sobel 统计量和残差噪声方差),并通过将其编码为文本嵌入值的方式集成到 CLIP 等多模态模型中,从而增强图像-文本表示学习的同时降低对语言信息可靠性的依赖。此方法在多个 Genimage 基准测试中达到领先性能,证明了基于物理基础特征的融合策略可有效提升视觉-语言模型的可信度与抗幻觉能力。
链接: https://arxiv.org/abs/2604.04608
作者: Mei Qiu,Jianqiang Zhao,Yanyun Qu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of AI generated content (AIGC) has blurred the boundaries between real and synthetic images, exposing the limitations of existing deepfake detectors that often overfit to specific generative models. This adaptability crisis calls for a fundamental reexamination of the intrinsic physical characteristics that distinguish natural from AI-generated images. In this paper, we address two critical research questions: (1) What physical features can stably and robustly discriminate AI generated images across diverse datasets and generative architectures? (2) Can these objective pixel-level features be integrated into multimodal models like CLIP to enhance detection performance while mitigating the unreliability of language-based information? To answer these questions, we conduct a comprehensive exploration of 15 physical features across more than 20 datasets generated by various GANs and diffusion models. We propose a novel feature selection algorithm that identifies five core physical features including Laplacian variance, Sobel statistics, and residual noise variance that exhibit consistent discriminative power across all tested datasets. These features are then converted into text encoded values and integrated with semantic captions to guide image text representation learning in CLIP. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple Genimage benchmarks, with near-perfect accuracy (99.8%) on datasets such as Wukong and SDv1.4. By bridging pixel level authenticity with semantic understanding, this work pioneers the use of physically grounded features for trustworthy vision language modeling and opens new directions for mitigating hallucinations and textual inaccuracies in large multimodal models.
[CV-35] LP-GEMM: Integrating Layout Propagation into GEMM Operations
【速读】:该论文旨在解决科学计算与现代机器学习工作负载中,连续依赖的通用矩阵乘法(GEMM)操作因BLAS接口限制而导致的冗余数据打包与解包问题。传统BLAS库在每次GEMM调用时必须独立完成输入矩阵的打包及输出恢复到标准内存布局,这在序列化GEMM场景下造成大量不必要的计算资源浪费。其解决方案的关键在于提出LP-GEMM——一种将GEMM内核分解为可支持数据布局传播的新结构,允许在连续GEMM操作间传递中间数据的内存布局信息,从而消除冗余打包/解包过程,同时保持BLAS语义正确性边界。实验证明,该方法在x86(AVX-512)和RISC-V(RVV 1.0)架构上对MLP-like和Attention-like工作负载均实现显著加速,平均速度提升达2.25倍于OpenBLAS,并在Llama-3.2推理路径的纯BLAS级实现中验证了其实际可行性。
链接: https://arxiv.org/abs/2604.04599
作者: César Guedes Carneiro,Lucas Alvarenga,Guido Araujo,Sandro Rigo
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources. This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads. Our results show average speedups of 2.25x over OpenBLAS on Intel x86 for sequential GEMMs and competitive gains relative to vendor-optimized libraries such as Intel MKL. We demonstrate the practicality of the approach beyond microbenchmarks by implementing a standalone C++ version of the Llama-3.2 inference path using exclusively BLAS-level GEMM calls. These results confirm that leveraging data layout propagation between operations can significantly boost performance. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2604.04599 [cs.DC] (or arXiv:2604.04599v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.04599 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-36] Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在资源受限场景下部署效率低的问题,以及小型视觉-语言模型难以精准捕捉细粒度任务相关视觉区域导致的细粒度推理性能下降问题。解决方案的关键在于:首先,用液态基础模型(Liquid Foundation Model, LFM)替代传统的Transformer解码器,从而实现线性时间复杂度的推理;其次,提出Token-Grid相关性模块(Token-Grid Correlation Module),通过轻量级计算文本token与图像patch之间的相关性,并借助状态空间模型结合FiLM条件调节机制,动态强化与文本提示相关的视觉区域,显著提升视觉定位精度和整体推理效率。
链接: https://arxiv.org/abs/2604.04579
作者: Quoc-Huy Trinh,Mustapha Abdullahi,Bo Zhao,Debesh Jha
机构: Aalto University (阿尔托大学); University of South Dakota (南达科他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2511.11177
Abstract:Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: this https URL
[CV-37] PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis CVPR2026
【速读】:该论文旨在解决扩散模型(Diffusion Models)生成的伪真值视图在稀疏视角新视图合成(Sparse-View Novel View Synthesis, NVS)中存在光度和几何不一致性的难题,这些问题若直接用于监督3D重建(如3D Gaussian Splatting, 3DGS)会损害重建质量。解决方案的关键在于提出部分参考图像质量评估(Partial-Reference Image Quality Assessment, PR-IQA)框架:首先在不同视角图像的重叠区域计算几何一致的部分质量图,随后通过跨注意力机制引入参考视图上下文进行质量补全,从而生成稠密全图质量映射;该质量映射用于指导3DGS中的监督区域选择,仅保留高置信度区域进行优化,显著提升了重建精度与一致性。
链接: https://arxiv.org/abs/2604.04576
作者: Inseong Choi,Siwoo Lee,Seung-Hun Nam,Soohwan Song
机构: Dongguk University (东国大学); NAVER WEBTOON AI (NAVER WEBTOON AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026. Project Page: this https URL
Abstract:Diffusion models are promising for sparse-view novel view synthesis (NVS), as they can generate pseudo-ground-truth views to aid 3D reconstruction pipelines like 3D Gaussian Splatting (3DGS). However, these synthesized images often contain photometric and geometric inconsistencies, and their direct use for supervision can impair reconstruction. To address this, we propose Partial-Reference Image Quality Assessment (PR-IQA), a framework that evaluates diffusion-generated views using reference images from different poses, eliminating the need for ground truth. PR-IQA first computes a geometrically consistent partial quality map in overlapping regions. It then performs quality completion to inpaint this partial map into a dense, full-image map. This completion is achieved via a cross-attention mechanism that incorporates reference-view context, ensuring cross-view consistency and enabling thorough quality assessment. When integrated into a diffusion-augmented 3DGS pipeline, PR-IQA restricts supervision to high-confidence regions identified by its quality maps. Experiments demonstrate that PR-IQA outperforms existing IQA methods, achieving full-reference-level accuracy without ground-truth supervision. Thus, our quality-aware 3DGS approach more effectively filters inconsistencies, producing superior 3D reconstructions and NVS this http URL project page is available at this https URL.
[CV-38] Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models CVPR2026
【速读】:该论文旨在解决当前后验去遗忘(post-hoc unlearning)方法在移除文本到图像扩散模型中特定概念(如不适当内容)时,对生成式能力(特别是组合性文本到图像生成能力)造成的潜在损害问题。现有研究多聚焦于擦除成功率(erasure success),忽视了去遗忘操作对属性绑定、空间推理和计数等核心生成能力的影响。解决方案的关键在于通过系统性的实证研究,引入T2I-CompBench++和GenEval等组合性评估基准,揭示当前主流去遗忘方法在擦除效果与语义结构保持之间存在显著权衡:强擦除性能常导致组合生成能力退化,而保留结构完整性则难以实现鲁棒擦除。这一发现强调了未来去遗忘目标需显式纳入语义保真度(semantic preservation)的考量,以平衡概念移除与生成质量。
链接: https://arxiv.org/abs/2604.04575
作者: Arian Komaei Koma,Seyed Amir Kasaei,Ali Aghayari,AmirMahdi Sadeghzadeh,Mohammad Hossein Rohban
机构: Sharif University of Technology (Sharif大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Workshop on Machine Unlearning for Computer Vision
Abstract:Post-hoc unlearning has emerged as a practical mechanism for removing undesirable concepts from large text-to-image diffusion models. However, prior work primarily evaluates unlearning through erasure success; its impact on broader generative capabilities remains poorly understood. In this work, we conduct a systematic empirical study of concept unlearning through the lens of compositional text-to-image generation. Focusing on nudity removal in Stable Diffusion 1.4, we evaluate a diverse set of state-of-the-art unlearning methods using T2I-CompBench++ and GenEval, alongside established unlearning benchmarks. Our results reveal a consistent trade-off between unlearning effectiveness and compositional integrity: methods that achieve strong erasure frequently incur substantial degradation in attribute binding, spatial reasoning, and counting. Conversely, approaches that preserve compositional structure often fail to provide robust erasure. These findings highlight limitations of current evaluation practices and underscore the need for unlearning objectives that explicitly account for semantic preservation beyond targeted suppression.
[CV-39] APE: A two-stage parameter-efficient adaptation framework for foundation models in OCT-OCTA analysis
【速读】:该论文旨在解决光学相干断层扫描(Optical Coherence Tomography, OCT)和OCT血管成像(OCT Angiography, OCTA)图像自动化分析在资源受限临床环境中部署困难的问题,尤其是现有从头训练方法对大规模数据和模型规模的依赖性,以及基于基础模型(Foundation Models, FMs)的迁移学习所面临的数据域偏移(domain shift)和任务不匹配(task misalignment)挑战。其解决方案的关键在于提出一种两阶段自适应框架TAPE(Two-stage Adaptation Framework via Parameter-Efficient Fine-tuning),通过参数高效微调(Parameter-Efficient Fine-tuning, PEFT)将适应过程解耦为域对齐(domain alignment)与任务拟合(task fitting)两个阶段;其中,在域适应阶段创新性地引入基于掩码图像建模(masked image modeling)的PEFT策略,显著提升了医学图像域适应的效率与效果,从而实现了在多种病理场景下更优的泛化性能和更高的参数效率。
链接: https://arxiv.org/abs/2604.04571
作者: Xiaofei Su,Zengshuo Wang,Minghe Sun,Xin Zhao,Mingzhu Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, accepted by IEEE ISBI 2026
Abstract:Automated analysis of optical coherence tomography (OCT) and OCT angiography (OCTA) images is critical for robust ophthalmic diagnosis. Existing mainstream methods trained from scratch rely heavily on massive data and model scale, thereby hindering their practical deployment in resource-constrained clinical settings. Although transfer learning based on foundation models (FMs) is promising, it still faces significant challenges: domain shift and task misalignment. To address these, we propose TAPE: A Two-stage Adaptation Framework via Parameter-Efficient Fine-tuning, which strategically decouples adaptation into domain alignment and task fitting for downstream segmentation. The domain adaptation stage notably applies parameter-efficient fine-tuning (PEFT) in the context of masked image modeling for medical image domain adaptation, a novel approach to the best of our knowledge. Applying TAPE to retinal layer segmentation on both universal (masked auto-encoder, MAE) and specialized (RETFound) FMs, it demonstrates superior parameter efficiency and achieves state-of-the-art generalization performance across diverse pathologies.
[CV-40] Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLM s
【速读】:该论文旨在解决非结构化地形(off-road)自主驾驶中传统方法依赖多个独立模型进行地形分类、高程估计及滑移或坡度条件量化的问题,这些问题导致训练复杂、数据集任务特异性高且需繁琐调优。解决方案的关键在于提出一种零样本(zero-shot)统一框架,利用SAM2实现环境分割,并结合视觉语言模型(VLM)对分割后的图像(含数字标签的掩码)进行语义推理,从而直接识别可行驶区域。该方法摒弃了显式的地形特定模型,转而依赖VLM的内在推理能力,实现了从感知到决策的端到端集成,在高分辨率分割数据集上超越现有可训练模型,并在Isaac Sim非结构化环境中实现了完整的导航栈。
链接: https://arxiv.org/abs/2604.04564
作者: Abdelmoamen Nasser,Yousef Baba’a,Murad Mebrahtu,Nadya Abdel Madjid,Jorge Dias,Majid Khonji
机构: Khalifa University (哈利法大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional approaches to off-road autonomy rely on separate models for terrain classification, height estimation, and quantifying slip or slope conditions. Utilizing several models requires training each component separately, having task specific datasets, and fine-tuning. In this work, we present a zero-shot approach leveraging SAM2 for environment segmentation and a vision-language model (VLM) to reason about drivable areas. Our approach involves passing to the VLM both the original image and the segmented image annotated with numeric labels for each mask. The VLM is then prompted to identify which regions, represented by these numeric labels, are drivable. Combined with planning and control modules, this unified framework eliminates the need for explicit terrain-specific models and relies instead on the inherent reasoning capabilities of the VLM. Our approach surpasses state-of-the-art trainable models on high resolution segmentation datasets and enables full stack navigation in our Isaac Sim offroad environment.
[CV-41] mporal Inversion for Learning Interval Change in Chest X-Rays CVPR2026
【速读】:该论文旨在解决现有医学视觉-语言预训练模型在分析胸部X光片(CXR)时忽视时间序列变化的问题,即模型通常孤立地处理单张影像,而未充分建模前后影像之间的动态演变关系,这限制了其在临床实践中对病灶进展或消退的准确判断能力。解决方案的关键在于提出TILA(Temporal Inversion-aware Learning and Alignment)框架,其核心创新是利用“时间反转”(temporal inversion)——即将图像对顺序颠倒作为监督信号,增强模型对方向性变化的敏感性;该框架贯穿预训练、微调和推理阶段,通过引入逆序感知目标,显式学习时间顺序信息,从而提升模型在进展分类与时间嵌入对齐上的性能。
链接: https://arxiv.org/abs/2604.04563
作者: Hanbin Ko,Kyeongmin Jeon,Doowoong Choi,Chang Min Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026. 10 pages, 5 figures
Abstract:Recent advances in vision–language pretraining have enabled strong medical foundation models, yet most analyze radiographs in isolation, overlooking the key clinical task of comparing prior and current images to assess interval change. For chest radiographs (CXRs), capturing interval change is essential, as radiologists must evaluate not only the static appearance of findings but also how they evolve over time. We introduce TILA (Temporal Inversion-aware Learning and Alignment), a simple yet effective framework that uses temporal inversion, reversing image pairs, as a supervisory signal to enhance the sensitivity of existing temporal vision-language models to directional change. TILA integrates inversion-aware objectives across pretraining, fine-tuning, and inference, complementing conventional appearance modeling with explicit learning of temporal order. We also propose a unified evaluation protocol to assess order sensitivity and consistency under temporal inversion, and introduce MS-CXR-Tretrieval, a retrieval evaluation set constructed through a general protocol that can be applied to any temporal CXR dataset. Experiments on public datasets and real-world hospital cohorts demonstrate that TILA consistently improves progression classification and temporal embedding alignment when applied to multiple existing architectures.
[CV-42] Relational Epipolar Graphs for Robust Relative Camera Pose Estimation
【速读】:该论文旨在解决视觉同时定位与建图(Visual Simultaneous Localization and Mapping, VSLAM)中相对相机位姿估计的问题,尤其是由噪声匹配点引起的精度下降挑战。传统方法依赖随机假设采样和迭代优化,而基于学习的方法通常缺乏显式的几何结构约束。其解决方案的关键在于将相对位姿估计重构为在极线对应图(epipolar correspondence graph)上的关系推理问题:将匹配的关键点作为节点,邻近关键点通过边连接;利用图操作(如剪枝、消息传递和池化)联合估计四元数旋转、平移向量及本质矩阵(Essential Matrix, EM)。通过最小化包含L₂误差、本质矩阵Frobenius范数差异、奇异值差异、航向角差异和尺度差异的多任务损失函数,实现鲁棒的相对位姿估计。实验表明,该方法在室内和室外基准测试中相比经典与学习引导方法更具抗密集噪声和大基线变化的能力,凸显了全局关系一致性建模的有效性。
链接: https://arxiv.org/abs/2604.04554
作者: Prateeth Rao,Sachit Rao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 21 pages, 10 figures, yet to be submitted to IJCV
Abstract:A key component of Visual Simultaneous Localization and Mapping (VSLAM) is estimating relative camera poses using matched keypoints. Accurate estimation is challenged by noisy correspondences. Classical methods rely on stochastic hypothesis sampling and iterative estimation, while learning-based methods often lack explicit geometric structure. In this work, we reformulate relative pose estimation as a relational inference problem over epipolar correspondence graphs, where matched keypoints are nodes and nearby ones are connected by edges. Graph operations such as pruning, message passing, and pooling estimate a quaternion rotation, translation vector, and the Essential Matrix (EM). Minimizing a loss comprising (i) \mathcalL_2 differences with ground truth (GT), (ii) Frobenius norm between estimated and GT EMs, (iii) singular value differences, (iv) heading angle differences, and (v) scale differences, yields the relative pose between image pairs. The dense detector-free method LoFTR is used for matching. Experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation compared to classical and learning-guided approaches, highlighting the effectiveness of global relational consensus.
[CV-43] StableTTA: Training-Free Test-Time Adaptation that Improves Model Accuracy on ImageNet1K to 96%
【速读】:该论文旨在解决集成学习(Ensemble Methods)在提升预测性能时面临的稳定性不足与计算资源消耗过高的问题。其核心挑战在于聚合策略中存在的冲突导致预测结果不稳定,进而影响模型的可靠性和部署效率。解决方案的关键在于提出一种无需训练的稳定化聚合方法——StableTTA(Stable Test-Time Adaptation),通过优化聚合机制显著提升预测稳定性与计算效率,在ImageNet-1K数据集上实现了最高达32.82%的top-1准确率提升,同时使轻量级模型在参数量不足ViT的5%、计算成本降低约89.1%的情况下超越其性能,为资源受限设备上的高精度推理提供了可行路径。
链接: https://arxiv.org/abs/2604.04552
作者: Zheng Li,Jerry Cheng,Huanying Helen Gu
机构: New York Institute of Technology (纽约理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, 3 tables
Abstract:Ensemble methods are widely used to improve predictive performance, but their effectiveness often comes at the cost of increased memory usage and computational complexity. In this paper, we identify a conflict in aggregation strategies that negatively impacts prediction stability. We propose StableTTA, a training-free method to improve aggregation stability and efficiency. Empirical results on ImageNet-1K show gains of 10.93–32.82% in top-1 accuracy, with 33 models achieving over 95% accuracy and several surpassing 96%. Notably, StableTTA allows lightweight architectures to outperform ViT by 11.75% in top-1 accuracy while using less than 5% of parameters and reducing computational cost by approximately 89.1% (in GFLOPs), enabling high-accuracy inference on resource-constrained devices.
[CV-44] G-EDF-Loc: 3D Continuous Gaussian Distance Field for Robust Gradient-Based 6DoF Localization
【速读】:该论文旨在解决大规模环境下6自由度(6-DoF)位姿估计的鲁棒性与实时性问题,特别是在里程计严重退化或缺乏惯性测量单元(IMU)先验信息时仍能保持高精度定位。其解决方案的关键在于提出了一种名为G-EDF的新型连续且内存高效的三维距离场表示方法,该方法采用自适应空间划分的块稀疏高斯混合模型建模欧几里得距离场(Euclidean Distance Field, EDF),确保块间C^1连续性以消除边界伪影,并利用该连续地图的解析梯度实现Eikonal一致性优化,从而在无需IMU辅助的情况下实现高保真空间重建与实时定位。
链接: https://arxiv.org/abs/2604.04525
作者: José E. Maese,Lucía Coto-Elena,Luis Merino,Fernando Caballero
机构: Universidad Pablo de Olavide (庞培法布拉大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a robust 6-DoF localization framework based on a direct, CPU-based scan-to-map registration pipeline. The system leverages G-EDF, a novel continuous and memory-efficient 3D distance field representation. The approach models the Euclidean Distance Field (EDF) using a Block-Sparse Gaussian Mixture Model with adaptive spatial partitioning, ensuring C^1 continuity across block transitions and mitigating boundary artifacts. By leveraging the analytical gradients of this continuous map, which maintain Eikonal consistency, the proposed method achieves high-fidelity spatial reconstruction and real-time localization. Experimental results on large-scale datasets demonstrate that G-EDF-Loc performs competitively against state-of-the-art methods, exhibiting exceptional resilience even under severe odometry degradation or in the complete absence of IMU priors.
[CV-45] Reproducibility study on how to find Spurious Correlations Shortcut Learning Clever Hans or Group-Distributional non-robustness and how to fix them
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在高风险领域(如医疗诊断和自动驾驶)中因依赖伪相关特征(spurious correlations)而导致的可靠性问题,即模型未能学习到因果相关的特征,反而捕捉了数据中的混淆信号(confounding signals)。其核心挑战在于不同研究社区使用术语各异但目标一致的方法(如分布鲁棒优化、不变风险最小化、捷径学习等)缺乏整合。解决方案的关键在于通过系统性对比分析,在有限数据和严重子群体不平衡的约束下评估基于可解释人工智能(Explainable Artificial Intelligence, XAI)的修正方法与非XAI基线方法的效果,结果表明:XAI方法整体优于非XAI方法,其中反事实知识蒸馏(Counterfactual Knowledge Distillation, CFKD)在提升泛化性能方面最具一致性;同时指出当前多数方法受限于对群体标签的依赖,而这类标签在实践中难以获取,且少数群体样本稀缺导致验证集不可靠,阻碍了模型部署于安全关键场景。
链接: https://arxiv.org/abs/2604.04518
作者: Ole Delzer,Sidney Bender
机构: Technische Universität Berlin (柏林工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 62 pages, 27 figures
Abstract:Deep Neural Networks (DNNs) are increasingly utilized in high-stakes domains like medical diagnostics and autonomous driving where model reliability is critical. However, the research landscape for ensuring this reliability is terminologically fractured across communities that pursue the same goal of ensuring models rely on causally relevant features rather than confounding signals. While frameworks such as distributionally robust optimization (DRO), invariant risk minimization (IRM), shortcut learning, simplicity bias, and the Clever Hans effect all address model failure due to spurious correlations, researchers typically only reference work within their own domains. This reproducibility study unifies these perspectives through a comparative analysis of correction methods under challenging constraints like limited data availability and severe subgroup imbalance. We evaluate recently proposed correction methods based on explainable artificial intelligence (XAI) techniques alongside popular non-XAI baselines using both synthetic and real-world datasets. Findings show that XAI-based methods generally outperform non-XAI approaches, with Counterfactual Knowledge Distillation (CFKD) proving most consistently effective at improving generalization. Our experiments also reveal that the practical application of many methods is hindered by a dependency on group labels, as manual annotation is often infeasible and automated tools like Spectral Relevance Analysis (SpRAy) struggle with complex features and severe imbalance. Furthermore, the scarcity of minority group samples in validation sets renders model selection and hyperparameter tuning unreliable, posing a significant obstacle to the deployment of robust and trustworthy models in safety-critical areas.
[CV-46] MPTF-Net: Multi-view Pyramid Transformer Fusion Network for LiDAR-based Place Recognition
【速读】:该论文旨在解决基于激光雷达(LiDAR)的场景识别(LPR)在复杂或重复环境中因传统鸟瞰图(BEV)表示方法无法捕捉细粒度几何结构而导致匹配性能下降的问题。现有方法多依赖于简单的统计聚合方式构建BEV特征,难以建模局部几何复杂性和强度分布,从而影响全局定位与回环检测的准确性。解决方案的关键在于提出一种多视角多尺度金字塔Transformer融合网络(MPTF-Net),其核心创新是引入基于正态分布变换(Normal Distribution Transform, NDT)的多通道BEV编码机制,显式建模局部几何复杂性与强度分布,提供抗噪的结构先验;同时设计定制化的金字塔Transformer模块,在多个空间尺度上融合Range Image Views(RIV)与NDT-BEV之间的跨视图交互关系,从而显著提升特征表达能力与匹配鲁棒性。
链接: https://arxiv.org/abs/2604.04513
作者: Shuyuan Li,Zihang Wang,Xieyuanli Chen,Wenkai Zhu,Xiaoteng Fang,Peizhou Ni,Junhao Yang,Dong Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:LiDAR-based place recognition (LPR) is essential for global localization and loop-closure detection in large-scale SLAM systems. Existing methods typically construct global descriptors from Range Images or BEV representations for matching. BEV is widely adopted due to its explicit 2D spatial layout encoding and efficient retrieval. However, conventional BEV representations rely on simple statistical aggregation, which fails to capture fine-grained geometric structures, leading to performance degradation in complex or repetitive environments. To address this, we propose MPTF-Net, a novel multi-view multi-scale pyramid Transformer fusion network. Our core contribution is a multi-channel NDT-based BEV encoding that explicitly models local geometric complexity and intensity distributions via Normal Distribution Transform, providing a noise-resilient structural prior. To effectively integrate these features, we develop a customized pyramid Transformer module that captures cross-view interactive correlations between Range Image Views (RIV) and NDT-BEV at multiple spatial scales. Extensive experiments on the nuScenes, KITTI and NCLT datasets demonstrate that MPTF-Net achieves state-of-the-art performance, specifically attaining a Recall@1 of 96.31% on the nuScenes Boston split while maintaining an inference latency of only 10.02 ms, making it highly suitable for real-time autonomous unmanned systems.
[CV-47] MedROI: Codec-Agnostic Region of Interest-Centric Compression for Medical Images
【速读】:该论文旨在解决医学影像档案因数据量快速增长而带来的存储与传输效率问题,传统压缩编码器通常对全图像或包含非诊断背景区域的体积进行压缩,导致冗余信息占用额外带宽和存储空间。其解决方案的关键在于提出一种与编解码器无关、即插即用的感兴趣区域(Region of Interest, ROI)中心框架 MedROI:该框架通过轻量级基于强度的阈值分割提取紧密包裹组织的边界框,并仅保留一个固定长度为 54 字节的元数据记录以支持解压时的空间还原;随后将裁剪后的 ROI 使用任意现有 2D 或 3D 编码器压缩,无需架构修改或重新训练,从而显著提升压缩比并降低编码/解码时间,同时在 ROI 内保持重建质量稳定。
链接: https://arxiv.org/abs/2604.04511
作者: Jiwon Kim,Ikbeom Jang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical imaging archives are growing rapidly in both size and resolution, making efficient compression increasingly important for storage and data transfer. Most existing codecs compress full images/volumes(including non-diagnostic background) or apply differential ROI coding that still preserves background bits. We propose MedROI, a codec-agnostic, plug-and-play ROI-centric framework that discards background voxels prior to compression. MedROI extracts a tight tissue bounding box via lightweight intensity-based thresholding and stores a fixed 54byte meta data record to enable spatial restoration during decompression. The cropped ROI is then compressed using any existing 2D or 3D codec without architectural modifications or retraining. We evaluate MedROI on 200 T1-weighted brain MRI volumes from ADNI using 6 codec configurations spanning conventional codecs (JPEG2000 2D/3D, HEIF) and neural compressors (LIC_TCM, TCM+AuxT, BCM-Net, SirenMRI). MedROI yields statistically significant improvements in compression ratio and encoding/decoding time for most configurations (two-sided t-test with multiple-comparison correction), while maintaining comparable reconstruction quality when measured within the ROI; HEIF is the primary exception in compression-ratio gains. For example, on JPEG20002D (lv3), MedROI improves CR from 20.35 to 27.37 while reducing average compression time from 1.701s to 1.380s. Code is available at this https URL.
[CV-48] Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward CVPR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在推理过程中对文本线索过度依赖、缺乏视觉证据支撑以及生成内容可能不真实或虚构的问题,即提升模型的可解释性和忠实性(faithfulness)。解决方案的关键在于提出Saliency-R1框架,其核心创新是一种高效且无需额外计算开销的显著性图(saliency map)技术,能够精准定位图像中对生成token起关键作用的区域,并进一步追踪视觉信息在推理链中的流动路径,从而揭示模型思维过程与视觉上下文之间的对齐情况;同时,利用显著性图与人工标注边界框的重叠度作为奖励函数,结合分组相对策略优化(Group Relative Policy Optimization, GRPO)算法,引导模型在推理时聚焦于相关视觉区域,显著增强模型输出的可信度与可解释性。
链接: https://arxiv.org/abs/2604.04500
作者: Shizhan Gong,Minda Hu,Qiyuan Zhang,Chen Ma,Qi Dou
机构: The Chinese University of Hong Kong (香港中文大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.
[CV-49] he Indra Representation Hypothesis for Multimodal Alignment
【速读】:该论文旨在解决当前单模态基础模型(unimodal foundation models)在学习过程中趋向于收敛到内部抽象表示的问题,这些表示虽能独立刻画样本特征,但缺乏对样本间关系的建模能力,从而限制了其表达能力和跨模型、跨模态的对齐效果。解决方案的关键在于提出“因陀罗表示假设”(Indra Representation Hypothesis),受印度哲学中“因陀罗网”(Indra’s Net)的关联本体论启发,将样本表示重新定义为一种相对于其他样本的关系谱系(relational profile),并通过范畴论中的V- enriched Yoneda嵌入进行形式化,确保该表示在给定代价函数下具有唯一性、完备性和结构保持性。实验表明,基于角度距离实现的因陀罗表示可在不依赖训练的情况下显著提升不同架构与模态之间的鲁棒性和对齐性能。
链接: https://arxiv.org/abs/2604.04496
作者: Jianglin Lu,Hailing Wang,Kuo Yang,Yitian Zhang,Simon Jenni,Yun Fu
机构: Northeastern University(东北大学); Khoury College of Computer Science(科里计算机科学学院); Adobe Research(Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra’s Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra’s Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at this https URL.
[CV-50] A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在监督微调过程中易遭受后门攻击的问题,即攻击者通过低比例污染数据植入触发模式,使模型在特定输入下输出预设的有害响应,而正常任务性能却难以保障。其核心挑战在于如何在低中毒比例下有效抑制后门成功率,同时不损害模型正常的生成能力,二者存在本质冲突。解决方案的关键在于提出一种基于补丁增强(patch augmentation)与跨视图一致性正则化(cross-view regularity)的统一防御框架:一方面利用补丁级数据增强和跨视图输出差异正则化,迫使模型对非语义扰动不再产生异常不变的后门响应;另一方面引入输出熵约束以避免过度抑制,从而在显著降低后门触发成功率的同时保持高质量的正常文本生成能力。
链接: https://arxiv.org/abs/2604.04488
作者: Tianmeng Fang,Yong Wang,Zetai Kong,Zengzhen Su,Jun Wang,Chengjin Yu,Wei Wang
机构: University of Melbourne (墨尔本大学); China Electric Power Research Institute (中国电力科学研究院); Anhui University (安徽大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages, 3 figures. Subjects: Machine Learning (cs.LG)
Abstract:Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker’s predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model’s normal generation ability. These two objectives are inherently conflicting. Strong suppression often degrades benign performance, whereas weak regularization fails to mitigate backdoor behaviors. To this end, we propose a unified defense framework based on patch augmentation and cross-view regularity, which simultaneously constrains the model’s anomalous behaviors in response to triggered patterns from both the feature representation and output distribution levels. Specifically, patch-level data augmentation is combined with cross-view output difference regularization to exploit the fact that backdoor responses are abnormally invariant to non-semantic perturbations and to proactively pull apart the output distributions of the original and perturbed views, thereby significantly suppressing the success rate of backdoor triggering. At the same time, we avoid over-suppression of the model during defense by imposing output entropy constraints, ensuring the quality of normal command generation. Experimental results across three models, two tasks, and six attacks show that our proposed defense method effectively reduces the attack success rate while maintaining a high level of normal text generation capability. Our work enables the secure, controlled deployment of large-scale multimodal models in realistic low-frequency poisoning and covert triggering scenarios.
[CV-51] raining-Free Image Editing with Visual Context Integration and Concept Alignment
【速读】:该论文旨在解决图像编辑中如何高效、一致地注入视觉上下文(visual context)的问题,特别是现有基于训练的方法存在数据收集和训练成本高的缺陷,而无训练(training-free)方法依赖扩散反演(diffusion inversion)时又面临一致性差与灵活性不足的挑战。其解决方案的关键在于提出 VicoEdit,一种无需训练且无需反演的方法,通过直接将源图像转换为目标图像来注入视觉上下文,从而避免因反演导致的轨迹偏移;同时设计了一种基于概念对齐引导的后验采样策略,显著提升编辑的一致性。实验证明,该方法在性能上甚至优于当前最先进的训练型模型。
链接: https://arxiv.org/abs/2604.04487
作者: Rui Song,Guo-Hua Wang,Qing-Guo Chen,Weihua Luo,Tongda Xu,Zhening Liu,Yan Wang,Zehong Lin,Jun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In image editing, it is essential to incorporate a context image to convey the user’s precise requirements, such as subject appearance or image style. Existing training-based visual context-aware editing methods incur data collection effort and training cost. On the other hand, the training-free alternatives are typically established on diffusion inversion, which struggles with consistency and flexibility. In this work, we propose VicoEdit, a training-free and inversion-free method to inject the visual context into the pretrained text-prompted editing model. More specifically, VicoEdit directly transforms the source image into the target one based on the visual context, thereby eliminating the need for inversion that can lead to deviated trajectories. Moreover, we design a posterior sampling approach guided by concept alignment to enhance the editing consistency. Empirical results demonstrate that our training-free method achieves even better editing performance than the state-of-the-art training-based models.
[CV-52] MVis-Fold: A Three-Dimensional Microvascular Structure Inference Model for Super-Resolution Ultrasound
【速读】:该论文旨在解决超分辨率超声(Super-resolution ultrasound, SRUS)技术在三维微血管重建方面的挑战,即如何从二维SRUS图像中实现高保真度的三维微血管网络重建。其解决方案的关键在于提出了一种名为微血管可视化折叠模型(MVis-Fold)的新方法,该模型采用跨尺度网络架构,能够精确计算传统二维SRUS难以获取的三维空间关键参数,从而实现对实体瘤微血管结构的高精度三维定量分析。
链接: https://arxiv.org/abs/2604.04477
作者: Jincao Yao(1, 2, 3, 4),Ke Zhang(1),Yahan Zhou(1),Jiafei Shen(1),Jie Liu(1),Mudassar Ali(5),Bojian Feng(1),Jiye Chen(1),Jinlong Fan(2),Ping Liang(6),Dong Xu(1, 2, 3, 4) ((1) Department of Diagnostic Ultrasound Imaging amp; Interventional Therapy, Zhejiang Cancer Hospital, Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, China, (2) Research Center of Interventional Medicine and Engineering, Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, China, (3) Wenling Institute of Big Data and Artificial Intelligence in Medicine, Taizhou, China, (4) Zhejiang Provincial Research Center for Innovative Technology and Equipment in Interventional Oncology, Zhejiang Cancer Hospital, Hangzhou, China, (5) College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China, (6) Department of Ultrasound, Chinese PLA General Hospital, Chinese PLA Medical School, Beijing, China)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Super-resolution ultrasound (SRUS) technology has overcome the resolution limitations of conventional ultrasound, enabling micrometer-scale imaging of microvasculature. However, due to the nature of imaging principles, three-dimensional reconstruction of microvasculature from SRUS remains an open challenge. We developed microvascular visualization fold (MVis-Fold), an innovative three-dimensional microvascular reconstruction model that integrates a cross-scale network architecture. This model can perform high-fidelity inference and reconstruction of three-dimensional microvascular networks from two-dimensional SRUS images. It precisely calculates key parameters in three-dimensional space that traditional two-dimensional SRUS cannot readily obtain. We validated the model’s accuracy and reliability in three-dimensional microvascular reconstruction of solid tumors. This study establishes a foundation for three-dimensional quantitative analysis of microvasculature. It provides new tools and methods for diagnosis and monitoring of various diseases.
[CV-53] Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Models Robustness to Natural Semantic Variation Across Diverse Tasks ICPR2026
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在自然对抗场景下评估不足的问题,尤其关注其在零样本图像分类、语义分割和视觉问答等下游任务中的鲁棒性与实际应用潜力。现有研究多依赖标准基准测试,缺乏对真实世界中自然对抗扰动(如字体攻击、ImageNet-A 和由自然语言诱导的对抗样本)的系统性评估。解决方案的关键在于构建一个涵盖多种自然对抗场景的系统性评估框架,并对主流VLMs(包括CLIP、鲁棒CLIP、BLIP2和SigLIP2)进行实证分析,揭示不同模型在面对自然对抗样本时的性能退化规律及失败模式,从而为未来鲁棒且公平的多模态模式识别研究提供方向指引。
链接: https://arxiv.org/abs/2604.04473
作者: Jia Chengyu,AprilPyone MaungMaung,Huy H. Nguyen,Jinyin Chen,Isao Echizen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICPR 2026
Abstract:Recent advances in vision-language models (VLMs) trained on web-scale image-text pairs have enabled impressive zero-shot transfer across a diverse range of visual tasks. However, comprehensive and independent evaluation beyond standard benchmarks is essential to understand their robustness, limitations, and real-world applicability. This paper presents a systematic evaluation framework for VLMs under natural adversarial scenarios for diverse downstream tasks, which has been overlooked in previous evaluation works. We evaluate a wide range of VLMs (CLIP, robust CLIP, BLIP2, and SigLIP2) on curated adversarial datasets (typographic attacks, ImageNet-A, and natural language-induced adversarial examples). We measure the natural adversarial performance of selected VLMs for zero-shot image classification, semantic segmentation, and visual question answering. Our analysis reveals that robust CLIP models can amplify natural adversarial vulnerabilities, and CLIP models significantly reduce performance for natural language-induced adversarial examples. Additionally, we provide interpretable analyses to identify failure modes. We hope our findings inspire future research in robust and fair multimodal pattern recognition.
[CV-54] Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning CVPR2026
【速读】:该论文旨在解决无监督学习群体活动特征(Group Activity Feature, GAF)的问题,即在缺乏群体活动标注的情况下,如何有效提取能够表征群体动态行为的特征。传统方法依赖低层次静态局部特征进行GAF学习,难以捕捉群体间的复杂互动与场景上下文信息。本文的关键解决方案是设计两个动态感知且群体感知的预训练任务:一是通过人体流估计(person flow estimation)建模个体局部运动动态,作为理解群体活动的重要线索;二是通过群体相关物体位置估计(group-relevant object location estimation)引导模型学习场景上下文信息(如人与物体的空间关系),作为全局特征。该方法结合DINO提供的局部与全局特征,实现了对群体动态感知的GAF学习,在公开数据集上取得了最先进的群体活动检索与识别性能。
链接: https://arxiv.org/abs/2604.04467
作者: Ryuki Tezuka,Chihiro Nakatani,Norimichi Ukita
机构: Toyota Technological Institute (丰田技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026 Findings
Abstract:This paper proposes Group Activity Feature (GAF) learning without group activity annotations. Unlike prior work, which uses low-level static local features to learn GAFs, we propose leveraging dynamics-aware and group-aware pretext tasks, along with local and global features provided by DINO, for group-dynamics-aware GAF learning. To adapt DINO and GAF learning to local dynamics and global group features, our pretext tasks use person flow estimation and group-relevant object location estimation, respectively. Person flow estimation is used to represent the local motion of each person, which is an important cue for understanding group activities. In contrast, group-relevant object location estimation encourages GAFs to learn scene context (e.g., spatial relations of people and objects) as global features. Comprehensive experiments on public datasets demonstrate the state-of-the-art performance of our method in group activity retrieval and recognition. Our ablation studies verify the effectiveness of each component in our method. Code: this https URL.
[CV-55] Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse
【速读】:该论文旨在解决视频扩散模型(Video Diffusion Transformer, DiT)在推理阶段因迭代去噪过程导致的高计算开销问题。现有缓存方法主要利用单个请求内部的相似性来跳过冗余去噪步骤,但效果有限,尤其在工业场景中采用的4步蒸馏模型上几乎失效。解决方案的关键在于提出Chorus缓存机制,通过跨请求的相似性挖掘实现加速:其采用三阶段缓存策略,第一阶段完全复用相似请求的潜在特征;第二阶段在中间去噪步骤中对特定潜在区域实施跨请求缓存,并结合Token-Guided Attention Amplification提升生成视频与条件提示之间的语义一致性,从而将全量复用扩展至后期去噪步骤,最终实现最高达45%的加速效果。
链接: https://arxiv.org/abs/2604.04451
作者: Hao Liu,Ye Huang,Chenghuan Huang,Zhenyi Zheng,Jiangsu Du,Ziyang Ma,Jing Lyu,Yutong Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests to accelerate video diffusion model serving. Chorus achieves up to 45% speedup on industrial 4-step distilled models, where prior intra-request caching approaches are ineffective. Particularly, Chorus employs a three-stage caching strategy along the denoising process. Stage 1 performs full reuse of latent features from similar requests. Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps. This stage is combined with Token-Guided Attention Amplification to improve semantic alignment between the generated video and the conditional prompts, thereby extending the applicability of full reuse to later denoising steps.
[CV-56] Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection CVPR2026
【速读】:该论文旨在解决开放词汇目标检测(Open-vocabulary object detection, OVOD)模型在迁移到存在显著领域偏移的下游任务时性能严重下降的问题。其核心挑战在于领域特定任务中类别标签稀缺且语义薄弱,同时现有模型难以捕获超出粗粒度类别标签的辅助语义信息。解决方案的关键在于提出一种参数高效的语言增强框架HSA-DINO,其创新点包括:(1)设计多尺度提示库(multi-scale prompt bank),利用图像特征金字塔捕捉层次化语义,并选择领域特定的局部语义提示,逐步从粗到细地丰富文本表示;(2)引入语义感知路由器(semantic-aware router),在推理阶段动态选择合适的语义增强策略,避免参数更新损害预训练OVOD模型的泛化能力。该方法在OV-COCO、多个垂直领域数据集及改进的基准设置上均取得优于现有最先进方法的性能,在领域适应性和开放词汇泛化之间实现了更优平衡。
链接: https://arxiv.org/abs/2604.04444
作者: Weihao Cao,Runqi Wang,Xiaoyue Duan,Jinchao Zhang,Ang Yang,Liping Jing
机构: Beijing Jiaotong University (北京交通大学); WeChat AI (微信AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Open-vocabulary object detection (OVOD) enables models to detect any object category, including unseen ones. Benefiting from large-scale pre-training, existing OVOD methods achieve strong detection performance on general scenarios (e.g., OV-COCO) but suffer severe performance drops when transferred to downstream tasks with substantial domain shifts. This degradation stems from the scarcity and weak semantics of category labels in domain-specific task, as well as the inability of existing models to capture auxiliary semantics beyond coarse-grained category label. To address these issues, we propose HSA-DINO, a parameter-efficient semantic augmentation framework for enhancing open-vocabulary object detection. Specifically, we propose a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels. Furthermore, we introduce a semantic-aware router that dynamically selects the appropriate semantic augmentation strategy during inference, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model. We evaluate HSA-DINO on OV-COCO, several vertical domain datasets, and modified benchmark settings. The results show that HSA-DINO performs favorably against previous state-of-the-art methods, achieving a superior trade-off between domain adaptability and open-vocabulary generalization.
[CV-57] Estimating Central Peripheral and Temporal Visual Contributions to Human Decision Making in Atari Games
【速读】:该论文旨在解决人类在动态视觉环境中决策过程中,不同视觉信息来源(如周边视觉信息、显式注视信息和历史状态信息)各自贡献程度的问题。其解决方案的关键在于构建了一个受控的消融框架(ablation framework),利用Atari-HEAD这一大规模同步眼动追踪的Atari游戏数据集,训练动作预测网络在六种不同信息组合条件下进行比较,从而量化各信息源对决策准确性的影响。实验表明,周边视觉信息贡献最大(移除后准确率下降35.27–43.90%),而注视信息与历史状态信息的贡献相对较小且波动范围更广,揭示了人类决策不仅依赖当前注视点,还高度依赖非焦点区域的信息。
链接: https://arxiv.org/abs/2604.04439
作者: Henrik Krauss,Takehisa Yairi
机构: Department of Advanced Interdisciplinary Studies, The University of Tokyo (东京大学高级跨学科研究系); Research Center for Advanced Science and Technology, The University of Tokyo (东京大学先进科学与技术研究中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We study how different visual information sources contribute to human decision making in dynamic visual environments. Using Atari-HEAD, a large-scale Atari gameplay dataset with synchronized eye-tracking, we introduce a controlled ablation framework as a means to reverse-engineer the contribution of peripheral visual information, explicit gaze information in form of gaze maps, and past-state information from human behavior. We train action-prediction networks under six settings that selectively include or exclude these information sources. Across 20 games, peripheral information shows by far the strongest contribution, with median prediction-accuracy drops in the range of 35.27-43.90% when removed. Gaze information yields smaller drops of 2.11-2.76%, while past-state information shows a broader range of 1.52-15.51%, with the upper end likely more informative due to reduced peripheral-information leakage. To complement aggregate accuracies, we cluster states by true-action probabilities assigned by the different model configurations. This analysis identifies coarse behavioral regimes, including focus-dominated, periphery-dominated, and more contextual decision situations. These results suggest that human decision making in Atari depends strongly on information beyond the current focus of gaze, while the proposed framework provides a way to estimate such information-source contributions from behavior.
[CV-58] HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance CVPR
【速读】:该论文旨在解决当前基于文本的3D手部模型生成方法在零样本场景下存在的问题,如结构不自然、视角不一致以及细节丢失等。现有基于Score Distillation Sampling (SDS) 的方法难以有效泛化至手部建模,主要原因是文本提示在概率空间中存在歧义,导致不同视角收敛到分布的不同模式。为解决此问题,作者提出HandDreamer,其关键在于引入MANO手部模型初始化和骨骼引导的扩散过程,以提供强先验结构约束并确保姿态与视角一致性;同时设计了一种新颖的矫正性手形引导损失函数,促使所有视角下的3D手模型收敛至一致的模式,避免几何畸变,从而显著提升生成质量与稳定性。
链接: https://arxiv.org/abs/2604.04425
作者: Green Rosh,Prateek Kukreja,Vishakha SR,Pawan Prasad B H
机构: Samsung RD Institute India Bangalore(三星研究院印度班加罗尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Abstract:The emergence of virtual reality has necessitated the generation of detailed and customizable 3D hand models for interaction in the virtual world. However, the current methods for 3D hand model generation are both expensive and cumbersome, offering very little customizability to the users. While recent advancements in zero-shot text-to-3D synthesis have enabled the generation of diverse and customizable 3D models using Score Distillation Sampling (SDS), they do not generalize very well to 3D hand model generation, resulting in unnatural hand structures, view-inconsistencies and loss of details. To address these limitations, we introduce HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. Our findings suggest that view-inconsistencies in SDS is primarily caused due to the ambiguity in the probability landscape described by the text prompt, resulting in similar views converging to different modes of the distribution. This is particularly aggravated for hands due to the large variations in articulations and poses. To alleviate this, we propose to use MANO hand model based initialization and a hand skeleton guided diffusion process to provide a strong prior for the hand structure and to ensure view and pose consistency. Further, we propose a novel corrective hand shape guidance loss to ensure that all the views of the 3D hand model converges to view-consistent modes, without leading to geometric distortions. Extensive evaluations demonstrate the superiority of our method over the state-of-the-art methods, paving a new way forward in 3D hand model generation.
[CV-59] BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成搏击类体育赛事解说时存在的显著不足,特别是现有基准测试仅聚焦于团队运动(如足球和篮球),忽视了搏击运动中瞬时、细微但语义关键的动作识别与战术分析能力缺失的问题。解决方案的关键在于:首先构建了一个大规模数据集BoxComm,包含445场世界拳击锦标赛视频及超过5.2万句专业广播解说;其次提出了一种结构化的解说分类体系,将解说内容细分为“逐帧描述”、“战术分析”和“背景信息”三类,实现首个面向体育解说的类别级标注;进而设计两种新颖互补的评估指标——类别条件生成(category-conditioned generation)和解说节奏评估(commentary rhythm assessment),分别衡量模型是否能根据视频上下文生成指定类型的解说,并检测自由生成解说在时间节奏与类型分布上的合理性。实验表明,现有主流MLLMs在上述两项评估中均表现不佳,而引入拳击动作检测作为结构化动作提示的改进基线EIC-Gen则显著提升了性能,凸显了捕捉短暂且细微事件对搏击解说生成的重要性。
链接: https://arxiv.org/abs/2604.04419
作者: Kaiwen Wang,Kaili Zheng,Rongrong Deng,Yiming Shi,Chenyi Guo,Ji Wu
机构: Tsinghua University (清华大学); Beijing Sport University (北京体育大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent multimodal large language models (MLLMs) have shown strong capabilities in general video understanding, driving growing interest in automatic sports commentary generation. However, existing benchmarks for this task focus exclusively on team sports such as soccer and basketball, leaving combat sports entirely unexplored. Notably, combat sports present distinct challenges: critical actions unfold within milliseconds with visually subtle yet semantically decisive differences, and professional commentary contains a substantially higher proportion of tactical analysis compared to team sports. In this paper, we present BoxComm, a large-scale dataset comprising 445 World Boxing Championship match videos with over 52K commentary sentences from professional broadcasts. We propose a structured commentary taxonomy that categorizes each sentence into play-by-play, tactical, or contextual, providing the first category-level annotation for sports commentary benchmarks. Building on this taxonomy, we introduce two novel and complementary evaluations tailored to sports commentary generation: (1) category-conditioned generation, which evaluates whether models can produce accurate commentary of a specified type given video context; and (2) commentary rhythm assessment, which measures whether freely generated commentary exhibits appropriate temporal pacing and type distribution over continuous video segments, capturing a dimension of commentary competence that prior benchmarks have not addressed. Experiments on multiple state-of-the-art MLLMs reveal that current models struggle on both evaluations. We further propose EIC-Gen, an improved baseline incorporating detected punch events to supply structured action cues, yielding consistent gains and highlighting the importance of perceiving fleeting and subtle events for combat sports commentary.
[CV-60] 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image CVPR2026
【速读】:该论文旨在解决从单视角图像中进行组合式3D场景生成时存在的两大难题:一是现有前馈生成方法在复杂场景下泛化能力差,二是基于逐实例优化的方法因姿态优化耗时而效率低下。其核心解决方案是提出一种新颖的“原位补全”范式——3D-Fixer,关键在于利用从几何估计方法获得的碎片化几何作为空间锚点(spatial anchor),直接在原始位置生成完整的3D资产,从而无需显式姿态对齐即可保持场景布局保真度;同时引入粗粒度到细粒度的生成策略、双分支条件网络及抗遮挡特征对齐(Occlusion-Robust Feature Alignment, ORFA)机制,有效缓解遮挡下的边界模糊问题并提升训练稳定性。
链接: https://arxiv.org/abs/2604.04406
作者: Ze-Xin Yin,Liu Liu,Xinjie Wang,Wei Sui,Zhizhong Su,Jian Yang,Jin Xie
机构: Nankai University (南开大学); Nanjing University (南京大学); Horizon Robotics ( horizon 机器人); D-Robotics (D-机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures, CVPR 2026, project page: this https URL
Abstract:Compositional 3D scene generation from a single view requires the simultaneous recovery of scene layout and 3D assets. Existing approaches mainly fall into two categories: feed-forward generation methods and per-instance generation methods. The former directly predict 3D assets with explicit 6DoF poses through efficient network inference, but they generalize poorly to complex scenes. The latter improve generalization through a divide-and-conquer strategy, but suffer from time-consuming pose optimization. To bridge this gap, we introduce 3D-Fixer, a novel in-place completion paradigm. Specifically, 3D-Fixer extends 3D object generative priors to generate complete 3D assets conditioned on the partially visible point cloud at the original locations, which are cropped from the fragmented geometry obtained from the geometry estimation methods. Unlike prior works that require explicit pose alignment, 3D-Fixer uses fragmented geometry as a spatial anchor to preserve layout fidelity. At its core, we propose a coarse-to-fine generation scheme to resolve boundary ambiguity under occlusion, supported by a dual-branch conditioning network and an Occlusion-Robust Feature Alignment (ORFA) strategy for stable training. Furthermore, to address the data scarcity bottleneck, we present ARSG-110K, the largest scene-level dataset to date, comprising over 110K diverse scenes and 3M annotated images with high-fidelity 3D ground truth. Extensive experiments show that 3D-Fixer achieves state-of-the-art geometric accuracy, which significantly outperforms baselines such as MIDI and Gen3DSR, while maintaining the efficiency of the diffusion process. Code and data will be publicly available at this https URL.
[CV-61] UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining
【速读】:该论文旨在解决夜间视频去雨(nighttime video deraining)任务中模型泛化能力差的问题,其核心挑战在于夜间雨水会与人工光源发生复杂交互,呈现出多种颜色且局部受光,而现有小规模合成数据集仅使用二维雨滴叠加方法,无法真实还原这一物理现象,导致模型在真实夜间降雨场景下性能显著下降。解决方案的关键在于构建一个大规模、物理真实的夜间去雨数据集UENR-600K,该数据集包含600,000对1080p帧图像,通过Unreal Engine模拟三维雨滴粒子在虚拟环境中的光学行为,精确捕捉色散、遮挡和雨帘等物理特性;同时,基于此高质量数据训练了一个以视频到视频生成为基础的新基线模型(基于Wan 2.2架构),利用强大的生成先验几乎完全弥合了仿真到现实的差距,显著提升了模型在真实夜间视频上的泛化性能。
链接: https://arxiv.org/abs/2604.04402
作者: Pei Yang,Hai Ci,Beibei Lin,Yiren Song,Mike Zheng Shou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Nighttime video deraining is uniquely challenging because raindrops interact with artificial lighting. Unlike daytime white rain, nighttime rain takes on various colors and appears locally illuminated. Existing small-scale synthetic datasets rely on 2D rain overlays and fail to capture these physical properties, causing models to generalize poorly to real-world night rain. Meanwhile, capturing real paired nighttime videos remains impractical because rain effects cannot be isolated from other degradations like sensor noise. To bridge this gap, we introduce UENR-600K, a large-scale, physically grounded dataset containing 600,000 1080p frame pairs. We utilize Unreal Engine to simulate rain as 3D particles within virtual environments. This approach guarantees photorealism and physically real raindrops, capturing correct details like color refractions, scene occlusions, rain curtains. Leveraging this high-quality data, we establish a new state-of-the-art baseline by adapting the Wan 2.2 video generation model. Our baseline treat deraining as a video-to-video generation task, exploiting strong generative priors to almost entirely bridge the sim-to-real gap. Extensive benchmarking demonstrates that models trained on our dataset generalize significantly better to real-world videos. Project page: this https URL.
[CV-62] BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion
【速读】:该论文旨在解决3D导引动作生成(3D conducting motion generation)任务中长期序列建模与高质量动作合成的双重挑战,即缺乏大规模细粒度3D导引动作数据集以及现有方法难以兼顾生成质量与效率的问题。解决方案的关键在于构建首个公开的细粒度SMPL-X格式3D导引动作数据集CM-Data,并提出BiTDiff框架:该框架基于BiMamba-Transformer混合架构实现高效长序列建模(BiMamba用于内存优化的时间建模,Transformer用于跨模态语义对齐),并引入基于物理一致性约束和手部/身体特定正向运动学设计的扩散生成策略,从而实现高保真、细粒度的导引动作合成;此外,其支持训练-free的关节级动作编辑,为下游人机协同交互设计提供灵活性。
链接: https://arxiv.org/abs/2604.04395
作者: Tianzhi Jia,Kaixing Yang,Xiaole Yang,Xulong Tang,Ke Qiu,Shikui Wei,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); Renmin University of China (中国人民大学); ADVANCE.AI; Malou Tech Inc (马洛科技公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 10 pages, 5 figures
Abstract:3D conducting motion generation aims to synthesize fine-grained conductor motions from music, with broad potential in music education, virtual performance, digital human animation, and human-AI co-creation. However, this task remains underexplored due to two major challenges: (1) the lack of large-scale fine-grained 3D conducting datasets and (2) the absence of effective methods that can jointly support long-sequence generation with high quality and efficiency. To address the data limitation, we develop a quality-oriented 3D conducting motion collection pipeline and construct CM-Data, a fine-grained SMPL-X dataset with about 10 hours of conducting motion data. To the best of our knowledge, CM-Data is the first and largest public dataset for 3D conducting motion generation. To address the methodological limitation, we propose BiTDiff, a novel framework for 3D conducting motion generation, built upon a BiMamba-Transformer hybrid model architecture for efficient long-sequence modeling and a Diffusion-based generative strategy with human-kinematic decomposition for high-quality motion synthesis. Specifically, BiTDiff introduces auxiliary physical-consistency losses and a hand-/body-specific forward-kinematics design for better fine-grained motion modeling, while leveraging BiMamba for memory-efficient long-sequence temporal modeling and Transformer for cross-modal semantic alignment. In addition, BiTDiff supports training-free joint-level motion editing, enabling downstream human-AI interaction design. Extensive quantitative and qualitative experiments demonstrate that BiTDiff achieves state-of-the-art (SOTA) performance for 3D conducting motion generation on the CM-Data dataset. Code will be available upon acceptance.
[CV-63] Reinforce to Learn Elect to Reason : A Dual Paradigm for Video Reasoning CVPR2026
【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在视频推理中缺乏证据一致性验证的问题,即模型通常采用单次推理直接输出答案,而未确保推理过程与视频证据对齐。解决方案的关键在于提出一种双范式框架——Reinforce to Learn, Elect to Reason (RLER),其核心机制包括两个阶段:在训练阶段(RLER-Training)通过组相对强化学习(group-relative reinforcement learning)和三种任务驱动奖励信号(帧敏感奖励、思维透明度奖励、反重复奖励)优化策略,使模型生成结构化且机器可验证的证据;在推理阶段(RLER-Inference)引入无需训练的编排器(orchestrator),基于证据一致性、置信度、透明度和非冗余性对候选答案进行加权选举,从而实现从生成到使用证据的闭环,显著提升视频推理的可靠性与可解释性,且不依赖增大模型规模。
链接: https://arxiv.org/abs/2604.04379
作者: Songyuan Yang,Weijiang Yu,Jilin Ma,Ziyu Liu,Guijian Tang,Wenjing Yang,Huibin Tan,Nong Xiao
机构: National University of Defense Technology (国防科技大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026. Camera-ready version
Abstract:Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.
[CV-64] Graph-to-Frame RAG : Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning CVPR2026
【速读】:该论文旨在解决视频推理中因缺乏外部知识而导致的性能瓶颈问题,尤其针对当前基于大模型(Large Multimodal Models, LMMs)的系统在引入检索增强(Retrieval-Augmented Generation, RAG)时,将异构信息(如文本或多片段证据)强行映射至单一注意力空间所引发的注意力稀释与认知负荷增加的问题。解决方案的关键在于提出一种无需训练且可审计的图到帧检索增强方法(Graph-to-Frame RAG, G2F-RAG),其核心创新是将外部知识以视觉空间中的结构化知识图谱形式表示,并通过分层多智能体控制器动态选择、提取最小必要子图,将其渲染为单个推理帧并入视频流,使LMM能在统一的视觉域中进行联合推理,从而降低认知负担并提供可解释的证据链。
链接: https://arxiv.org/abs/2604.04372
作者: Songyuan Yang,Weijiang Yu,Ziyu Liu,Guijian Tang,Wenjing Yang,Huibin Tan,Nong Xiao
机构: National University of Defense Technology (国防科技大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026. Camera-ready version
Abstract:When video reasoning requires external knowledge, many systems with large multimodal models (LMMs) adopt retrieval augmentation to supply the missing context. Appending textual or multi-clip evidence, however, forces heterogeneous signals into a single attention space. We observe diluted attention and higher cognitive load even on non-long videos. The bottleneck is not only what to retrieve but how to represent and fuse external knowledge with the video this http URL present Graph-to-Frame RAG (G2F-RAG), a training free and auditable paradigm that delivers knowledge in the visual space. On the offline stage, an agent builds a problem-agnostic video knowledge graph that integrates entities, events, spatial relations, and linked world knowledge. On the online stage, a hierarchical multi-agent controller decides whether external knowledge is needed, retrieves a minimal sufficient subgraph, and renders it as a single reasoning frame appended to the video. LMMs then perform joint reasoning in a unified visual domain. This design reduces cognitive load and leaves an explicit, inspectable evidence trail.G2F-RAG is plug-and-play across backbones and scales. It yields consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings. Ablations further confirm that knowledge representation and delivery matter. G2F-RAG reframes retrieval as visual space knowledge fusion for robust and interpretable video reasoning.
[CV-65] Integer-Only Operations on Extreme Learning Machine Test Time Classification
【速读】:该论文旨在解决基于极限学习机(Extreme Learning Machine, ELM)的网络分类器在测试阶段计算成本过高的问题,尤其针对嵌入式系统和数据中心中对能效敏感的应用场景。其解决方案的关键在于通过理论分析与实证验证,提出三项核心技术:(i) 输入权重可从三值集合(ternary set)中采样,从而避免浮点乘法运算;(ii) 证明归一化与非归一化测试信号的分类精度一致,简化预处理流程;(iii) 构建整数化的输出权重表示,在保持分类准确率损失最小的前提下实现纯整数运算。实验表明,这些方法可在不影响性能的前提下显著降低FPGA等硬件平台上的测试时计算复杂度,提升能效比。
链接: https://arxiv.org/abs/2604.04363
作者: Emerson Lopes Machadoa,Cristiano Jacques Miosso,Ricardo Pezzuol Jacobi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages. Originally written in 2015; archived in 2026
Abstract:We present a theoretical analysis and empirical evaluations of a novel set of techniques for computational cost reduction of test time operations of network classifiers based on extreme learning machine (ELM). By exploring some characteristics we derived from these models, we show that the classification at test time can be performed using solely integer operations without compromising the classification accuracy. Our contributions are as follows: (i) We show empirical evidence that the input weights values can be drawn from the ternary set with limited reduction of the classification accuracy. This has the computational advantage of dismissing multiplications; (ii) We prove the classification accuracy of normalized and non-normalized test signals are the same; (iii) We show how to create an integer version of the output weights that results in a limited reduction of the classification accuracy. We tested our techniques on 5 computer vision datasets commonly used in the literature and the results indicate that our techniques can allow the reduction of the computational cost of the operations necessary for the classification at test time in FPGAs. This is important in embedded applications, where power consumption is limited, and crucial in data centers of large corporations, where power consumption is expensive.
[CV-66] Spatially-Weighted CLIP for Street-View Geo-localization
【速读】:该论文旨在解决基于CLIP(Contrastive Language–Image Pretraining)的街景地理定位方法在处理非匹配样本时存在的局限性,即传统方法将所有非匹配样本视为同等负样本,忽略了地理空间中的邻近关系。其解决方案的关键在于引入空间权重机制,利用Tobler第一地理定律(Tobler’s First Law of Geography)建模地理相关性:通过构建“位置即文本”(location-as-text representation)来编码地理位置,并用基于测地距离(geodesic distance)计算的空间加权软标签替代原始的一热InfoNCE损失目标;同时引入邻域一致性正则项以保持嵌入空间中的局部空间结构。这一设计使模型从语义对齐转向地理对齐,显著提升了定位精度与空间一致性,尤其在长尾分布场景下表现更优。
链接: https://arxiv.org/abs/2604.04357
作者: Ting Han,Fengjiao Li,Chunsong Chen,Haoling Huang,Yiping Chen,Meiliu Wu
机构: Sun Yat-Sen University (中山大学); University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper proposes Spatially-Weighted CLIP (SW-CLIP), a novel framework for street-view geo-localization that explicitly incorporates spatial autocorrelation into vision-language contrastive learning. Unlike conventional CLIP-based methods that treat all non-matching samples as equally negative, SW-CLIP leverages Tobler’s First Law of Geography to model geographic relationships through distance-aware soft supervision. Specifically, we introduce a location-as-text representation to encode geographic positions and replace one-hot InfoNCE targets with spatially weighted soft labels derived from geodesic distance. Additionally, a neighborhood-consistency regularization is employed to preserve local spatial structure in the embedding space. Experiments on a multi-city dataset demonstrate that SW-CLIP significantly improves geo-localization accuracy, reduces long-tail errors, and enhances spatial coherence compared to standard CLIP. The results highlight the importance of shifting from semantic alignment to geographic alignment for robust geo-localization and provide a general paradigm for integrating spatial principles into multimodal representation learning.
[CV-67] OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text CVPR2026
【速读】:该论文旨在解决现有视频条件音频生成模型在合成完整听觉场景时的局限性,即多数方法仅关注与可见声源对应的屏幕内(on-screen)环境音,忽视了屏幕外(off-screen)声音事件;而近期的联合文本-视频到音频生成模型虽尝试涵盖屏幕内外声音,但无法处理人类语音。为此,作者提出OmniSonic,其核心创新在于基于流匹配(flow-matching)的扩散框架,并采用TriAttn-DiT架构实现对屏幕内环境声、屏幕外环境声和语音三类条件的并行跨注意力处理,同时引入Mixture-of-Experts(MoE)门控机制动态调节各模态贡献权重,从而实现统一且全面的音频生成。
链接: https://arxiv.org/abs/2604.04348
作者: Weiguo Pian,Saksham Singh Kushwaha,Zhimin Chen,Shijian Deng,Kai Wang,Yunhui Guo,Yapeng Tian
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校); Clemson University (克莱姆森大学); University of Toronto (多伦多大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: CVPR 2026
Abstract:In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: this https URL
[CV-68] GA-GS: Generation-Assisted Gaussian Splatting for Static Scene Reconstruction
【速读】:该论文旨在解决单目视频中动态物体遮挡导致的静态场景重建难题,尤其在缺乏背景信息的遮挡区域难以恢复的问题。现有方法依赖背景信息进行重建,无法有效处理被动态物体长期遮挡的区域。解决方案的关键在于提出GA-GS(Generation-Assisted Gaussian Splatting)方法,其核心创新是利用生成式模型辅助重建被遮挡区域:首先通过运动感知模块分割并移除动态区域,再使用扩散模型对遮挡区域进行图像修复(inpainting),生成伪真值监督信号;同时引入可学习的真实性标量(authenticity scalar)对每个高斯原语(Gaussian primitive)的不透明度进行动态调制,实现真实性感知的渲染与监督,从而平衡真实背景与生成内容的贡献。
链接: https://arxiv.org/abs/2604.04331
作者: Yedong Shen,Shiqi Zhang,Sha Zhang,Yifan Duan,Xinran Zhang,Wenhao Yu,Lu Zhang,Jiajun Deng,Yanyong Zhang
机构: University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Reconstructing static 3D scene from monocular video with dynamic objects is important for numerous applications such as virtual reality and autonomous driving. Current approaches typically rely on background for static scene reconstruction, limiting the ability to recover regions occluded by dynamic objects. In this paper, we propose GA-GS, a Generation-Assisted Gaussian Splatting method for Static Scene Reconstruction. The key innovation of our work lies in leveraging generation to assist in reconstructing occluded regions. We employ a motion-aware module to segment and remove dynamic regions, and thenuse a diffusion model to inpaint the occluded areas, providing pseudo-ground-truth supervision. To balance contributions from real background and generated region, we introduce a learnable authenticity scalar for each Gaussian primitive, which dynamically modulates opacity during splatting for authenticity-aware rendering and supervision. Since no existing dataset provides ground-truth static scene of video with dynamic objects, we construct a dataset named Trajectory-Match, using a fixed-path robot to record each scene with/without dynamic objects, enabling quantitative evaluation in reconstruction of occluded regions. Extensive experiments on both the DAVIS and our dataset show that GA-GS achieves state-of-the-art performance in static scene reconstruction, especially in challenging scenarios with large-scale, persistent occlusions.
[CV-69] HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data
【速读】:该论文旨在解决当前地球观测(Earth Observation, EO)领域中,由于依赖高分辨率但低重访率卫星影像导致的难以实时监测快速演变灾害事件的问题。现有基础模型(Foundation Models, FMs)虽在遥感任务中表现优异,但受限于时间分辨率不足,难以满足应急响应对时效性的要求。其解决方案的关键在于提出HighFM——首个面向高时间分辨率多光谱遥感数据的基础模型架构,通过利用来自Meteosat Second Generation(MSG)平台的超过2TB SEVIRI影像数据,基于SatMAE掩码自编码框架进行预训练,并引入细粒度的时间编码机制以捕捉短期时空变化特征;在此基础上进一步微调用于云掩膜和活跃火点检测任务,在平衡准确率与IoU指标上显著优于传统基线和近期地理空间基础模型,验证了高时间密度静止轨道数据在实时地球观测中的潜力。
链接: https://arxiv.org/abs/2604.04306
作者: Stella Girtsou,Konstantinos Alexis,Giorgos Giannopoulos,Harris Kontoes
机构: National Observatory of Athens (国家天文台); National Technical University of Athens (国立技术大学雅典); National and Kapodistrian University of Athens (雅典国立卡波迪斯特里安大学); Athena Research Center (阿thena研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing frequency and severity of climate related disasters have intensified the need for real time monitoring, early warning, and informed decision-making. Earth Observation (EO), powered by satellite data and Machine Learning (ML), offers powerful tools to meet these challenges. Foundation Models (FMs) have revolutionized EO ML by enabling general-purpose pretraining on large scale remote sensing datasets. However most existing models rely on high-resolution satellite imagery with low revisit rates limiting their suitability for fast-evolving phenomena and time critical emergency response. In this work, we present HighFM, a first cut approach towards a FM for high temporal resolution, multispectral EO data. Leveraging over 2 TB of SEVIRI imagery from the Meteosat Second Generation (MSG) platform, we adapt the SatMAE masked autoencoding framework to learn robust spatiotemporal representations. To support real time monitoring, we enhance the original architecture with fine grained temporal encodings to capture short term variability. The pretrained models are then finetuned on cloud masking and active fire detection tasks. We benchmark our SEVIRI pretrained Vision Transformers against traditional baselines and recent geospatial FMs, demonstrating consistent gains across both balanced accuracy and IoU metrics. Our results highlight the potential of temporally dense geostationary data for real-time EO, offering a scalable path toward foundation models for disaster detection and tracking.
[CV-70] A Persistent Homology Design Space for 3D Point Cloud Deep Learning
【速读】:该论文旨在解决将持久同调(Persistent Homology, PH)有效整合进3D点云深度学习框架中的问题,当前PH的应用多为非系统性、架构边缘化的处理方式,缺乏统一的设计范式。其解决方案的关键在于提出一个统一的“基于持久同调的3D点云学习设计空间”(3DPHDL),明确界定从复杂结构构建、滤波策略、持久性表示到神经网络主干和预测任务之间的交互机制,并识别出六个可被拓扑结构作为结构归纳偏置(structural inductive bias)注入的路径:采样、邻域图构建、优化动态、自监督、输出校准及内部网络正则化。通过在ModelNet40分类与ShapeNetPart分割任务上的受控实证研究,验证了该框架能显著提升拓扑敏感判别能力和部件一致性,同时揭示了表达能力与组合复杂度之间的权衡关系。
链接: https://arxiv.org/abs/2604.04299
作者: Prachi Kudeshia,Jiju Poovvancheri,Amr Ghoneim,Dong Chen
机构: Saint Mary’s University (圣玛丽大学); Nanjing Forestry University (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 12 figures, 5 tables
Abstract:Persistent Homology (PH) offers stable, multi-scale descriptors of intrinsic shape structure by capturing connected components, loops, and voids that persist across scales, providing invariants that complement purely geometric representations of 3D data. Yet, despite strong theoretical guarantees and increasing empirical adoption, its integration into deep learning for point clouds remains largely ad hoc and architecturally peripheral. In this work, we introduce a unified design space for Persistent-Homology driven learning in 3D point clouds (3DPHDL), formalizing the interplay between complex construction, filtration strategy, persistence representation, neural backbone, and prediction task. Beyond the canonical pipeline of diagram computation and vectorization, we identify six principled injection points through which topology can act as a structural inductive bias reshaping sampling, neighborhood graphs, optimization dynamics, self-supervision, output calibration, and even internal network regularization. We instantiate this framework through a controlled empirical study on ModelNet40 classification and ShapeNetPart segmentation, systematically augmenting representative backbones (PointNet, DGCNN, and Point Transformer) with persistence diagrams, images, and landscapes, and analyzing their impact on accuracy, robustness to noise and sampling variation, and computational scalability. Our results demonstrate consistent improvements in topology-sensitive discrimination and part consistency, while revealing meaningful trade-offs between representational expressiveness and combinatorial complexity. By viewing persistent homology not merely as an auxiliary feature but as a structured component within the learning pipeline, this work provides a systematic framework for incorporating topological reasoning into 3D point cloud learning.
[CV-71] Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning ICME2026
【速读】:该论文旨在解决从弱配对、无标签的多模态语料中学习对齐的多模态嵌入(multimodal embeddings)这一挑战,具体难点包括:预提取特征的限制、视频片段包含多个事件以及虚假共现关系。其解决方案的关键在于提出一种双路径教师-学生框架 HSC-MAE(Hierarchical Semantic Correlation-Aware Masked Autoencoder),通过在三个互补层次上强制语义一致性来实现:(i) 全局层面基于 DCCA(Deep Canonical Correlation Analysis)的规范几何相关性,将音视频嵌入映射到共享模态不变子空间;(ii) 局部层面基于教师挖掘的软 top-k 相似度的邻域语义相关性,保留语义相似实例间的多正例结构;(iii) 样本层面基于掩码自编码的条件充分性相关性,确保在部分观测下个体嵌入仍保留判别性语义内容。该方法通过学生路径进行掩码特征重建与加权软 top-k InfoNCE 训练,并由教师路径提供稳定的规范几何和软正例,从而实现高质量、结构化的音频-视觉表示学习。
链接: https://arxiv.org/abs/2604.04229
作者: Donghuo Zeng,Hao Niu,Masato Taya
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 6 pages, 2 tables, 4 figures. Accepted by IEEE ICME 2026
Abstract:Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.
[CV-72] DriveVA: Video Action Models are Zero-Shot Drivers
【速读】:该论文旨在解决自动驾驶中世界模型(world model)在未见场景、传感器域和环境条件下的泛化能力不足,以及规划与场景演化之间视频-轨迹一致性差的问题。解决方案的关键在于提出DriveVA,一种基于DiT(Diffusion Transformer)架构的联合解码框架,通过共享潜在生成过程同时预测未来视觉画面与动作序列(轨迹),从而实现规划与场景演化的紧密对齐;此外,引入视频续写策略以增强长时间滚动预测的一致性,显著提升了闭环性能与跨域泛化能力。
链接: https://arxiv.org/abs/2604.04198
作者: Mengmeng Liu,Diankun Zhang,Jiuming Liu,Jianfeng Cui,Hongwei Xie,Guang Chen,Hangjun Ye,Michael Ying Yang,Francesco Nex,Hao Cheng
机构: University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.
[CV-73] Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks
【速读】:该论文旨在解决当前人工智能模型在专业图形设计任务中评估不足的问题,尤其是缺乏针对布局结构、字体准确性、图层编辑、矢量图形生成及动画逻辑等核心设计能力的系统性测评基准。解决方案的关键在于构建首个专注于专业图形设计全流程的综合性评测基准——GraphicDesignBench (GDB),其包含50项任务,覆盖布局(layout)、排版(typography)、信息图表(infographics)、模板语义(template design semantics)和动画(animation)五大维度,并在理解与生成两种模式下进行评估,所有任务均基于真实设计模板(来自LICA数据集)。通过标准化指标体系(涵盖空间精度、感知质量、文本保真度、语义对齐度和结构有效性),GDB为前沿闭源模型提供了可复现、高严谨性的性能测试平台,揭示了现有模型在复杂布局空间推理、矢量代码生成、细粒度字体感知和动画时序分解等方面的显著短板,从而推动AI向具备设计协作能力的方向发展。
链接: https://arxiv.org/abs/2604.04192
作者: Adrienne Deganutti,Elad Hirsch,Haonan Zhu,Jaejung Seol,Purvanshi Mehta
机构: Google(谷歌); OpenAI; Anthropic; LICA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.
[CV-74] AURA: Always-On Understanding and Real-Time Assistance via Video Streams
【速读】:该论文旨在解决现有视频大语言模型(Video Large Language Models, VideoLLMs)在处理实时视频流时存在的局限性,即大多数系统为离线模式,难以支持连续观察与及时响应,且现有流式方案多依赖解耦的触发-响应流水线或仅限于字幕式叙述,限制了其在开放问题问答和长时交互中的有效性。解决方案的关键在于提出AURA(Always-On Understanding and Real-Time Assistance),这是一个端到端的流式视觉交互框架,通过整合上下文管理、数据构建、训练目标与部署优化,使统一的VideoLLM能够持续处理视频流并支持实时问答与主动响应,从而实现稳定、高效的长时交互能力。
链接: https://arxiv.org/abs/2604.04184
作者: Xudong Lu,Yang Bo,Jinpeng Chen,Shuhan Li,Xintong Guo,Huankang Guan,Fang Liu,Dunyuan Xu,Peiwen Sun,Heyang Sun,Rui Liu,Hongsheng Li
机构: Huawei Research (华为研究); CUHK MMLab (香港中文大学多媒体实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.
[CV-75] Scale-Aware Vision-Language Adaptation for Extreme Far-Distance Video Person Re-identification
【速读】:该论文旨在解决极端远距离视频行人重识别(Extreme far-distance video person re-identification)中的性能下降问题,其核心挑战包括尺度压缩、分辨率退化、运动模糊以及航拍与地面视角不匹配等。解决方案的关键在于:首先,将基于CLIP的基线模型视觉主干从ViT-B/16升级为更强大的ViT-L/14,并引入**骨干感知的选择性微调(backbone-aware selective fine-tuning)**以稳定大Transformer模型的适应过程;其次,设计一种轻量级的时间注意力池化机制,用于抑制低质量帧并增强信息丰富帧的权重;同时保留适配器驱动和提示条件交叉视图学习以缓解航拍到地面的域偏移,并通过改进优化策略与k-互惠重排序进一步提升检索精度。这些方法共同提升了模型在极端远距离场景下的鲁棒性与准确性。
链接: https://arxiv.org/abs/2604.04183
作者: Ashwat Rajbhandari,Bharatesh Chakravarthi
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial-ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), corresponding to an overall mAP of 35.73. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.
[CV-76] GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
【速读】:该论文旨在解决科学文献中图形可视化表达的挑战,即如何让生成式 AI 模型(如视觉-语言模型)根据论文的标题、摘要、引言和图注等文本信息,自动生成能够清晰传达核心研究思想的图表。这一任务不仅要求模型具备图像生成能力,更需融合科学理解与视觉合成能力,实现从文本到结构化视觉内容的高保真映射。解决方案的关键在于构建 GENFIG1 基准数据集,该数据集由顶级深度学习会议论文中的高质量图文对组成,并引入与专家判断高度一致的自动化评估指标,从而系统性地评测模型在概念理解、关键信息提取和视觉叙事一致性方面的综合能力。
链接: https://arxiv.org/abs/2604.04172
作者: Yaohan Guan,Pristina Wang,Najim Dehak,Alan Yuille,Jieneng Chen,Daniel Khashabi
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In many science papers, “Figure 1” serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most salient ones, and (iii) design a coherent and aesthetically effective graphic that conveys those concepts visually and is faithful to the input. We curate the benchmark from papers published at top deep-learning conferences, apply stringent quality control, and introduce an automatic evaluation metric that correlates well with expert human judgments. We evaluate a suite of representative models on GENFIG1 and demonstrate that the task presents significant challenges, even for the best-performing systems. We hope this benchmark serves as a foundation for future progress in multimodal AI.
[CV-77] Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation
【速读】:该论文旨在解决多视图多标签学习中双缺失(dual-missing)场景下的挑战,即同时存在视图缺失和标签缺失的情况。现有方法主要依赖对比学习或信息瓶颈理论来学习一致表示,但缺乏显式的结构约束,难以捕获稳定且判别性强的共享语义。其解决方案的关键在于引入一种更结构化的机制:通过多视图共享码本(shared codebook)和跨视图重建学习离散的一致表示,从而在有限的共享码本嵌入空间内自然对齐不同视图并减少特征冗余;同时,在决策层设计权重估计方法以评估各视图保留标签相关结构的能力,并据此加权融合预测结果;此外,提出融合教师自蒸馏框架(fused-teacher self-distillation),利用融合预测指导视图特定分类器训练,并将全局知识反馈至单视图分支,提升模型在标签缺失条件下的泛化能力。
链接: https://arxiv.org/abs/2604.04170
作者: Xu Yan,Jun Yin,Shiliang Sun,Minghua Wan
机构: Shanghai Maritime University (上海海事大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Although multi-view multi-label learning has been extensively studied, research on the dual-missing scenario, where both views and labels are incomplete, remains largely unexplored. Existing methods mainly rely on contrastive learning or information bottleneck theory to learn consistent representations under missing-view conditions, but loss-based alignment without explicit structural constraints limits the ability to capture stable and discriminative shared semantics. To address this issue, we introduce a more structured mechanism for consistent representation learning: we learn discrete consistent representations through a multi-view shared codebook and cross-view reconstruction, which naturally align different views within the limited shared codebook embeddings and reduce feature redundancy. At the decision level, we design a weight estimation method that evaluates the ability of each view to preserve label correlation structures, assigning weights accordingly to enhance the quality of the fused prediction. In addition, we introduce a fused-teacher self-distillation framework, where the fused prediction guides the training of view-specific classifiers and feeds the global knowledge back into the single-view branches, thereby enhancing the generalization ability of the model under missing-label conditions. The effectiveness of our proposed method is thoroughly demonstrated through extensive comparative experiments with advanced methods on five benchmark datasets. Code is available at this https URL.
[CV-78] Hierarchical Co-Embedding of Font Shapes and Impression Tags
【速读】:该论文旨在解决字体(font)与印象描述(impression description)之间对应关系的非一一映射问题,即某些印象可兼容多种字体风格,而另一些则强烈限制字体选择,这种约束强度的差异被称为风格特异性(style specificity)。为解决此问题,作者提出了一种双 entailment 约束的双曲共嵌入框架(hyperbolic co-embedding framework),其关键在于将字体图像和印象描述(单标签或标签集)嵌入到共享的双曲空间中,并引入两个互补的语义蕴含约束:从印象到字体的蕴含(impression-to-font entailment)以及印象间低到高风格特异性的蕴含(low-to-high style-specificity entailment)。该设计在几何上诱导出径向结构——低风格特异性印象靠近原点,高风格特异性印象远离原点,从而提供了一个可解释的几何度量来量化印象对字体风格的约束强度。
链接: https://arxiv.org/abs/2604.04158
作者: Yugo Kubota,Kaito Shiku,Seiichi Uchida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Font shapes can evoke a wide range of impressions, but the correspondence between fonts and impression descriptions is not one-to-one: some impressions are broadly compatible with diverse styles, whereas others strongly constrain the set of plausible fonts. We refer to this graded constraint strength as style specificity. In this paper, we propose a hyperbolic co-embedding framework that models font–impression correspondence through entailment rather than simple paired alignment. Font images and impression descriptions, represented as single tags or tag sets, are embedded in a shared hyperbolic space with two complementary entailment constraints: impression-to-font entailment and low-to-high style-specificity entailment among impressions. This formulation induces a radial structure in which low style-specificity impressions lie near the origin and high style-specificity impressions lie farther away, yielding an interpretable geometric measure of how strongly an impression constrains font style. Experiments on the MyFonts dataset demonstrate improved bidirectional retrieval over strong one-to-one baselines. In addition, traversal and tag-level analyses show that the learned space captures a coherent progression from ambiguous to more style-specific impressions and provides a meaningful, data-driven quantification of style specificity.
[CV-79] Uncertainty-Aware Test-Time Adaptation for Cross-Region Spatio-Temporal Fusion of Land Surface Temperature
【速读】:该论文旨在解决深度学习模型在遥感应用中因域偏移(domain shift)导致的泛化能力不足问题,尤其针对时空融合(spatio-temporal fusion, STF)任务中的地表温度估计回归任务。现有测试时适应(test-time adaptation, TTA)方法多适用于分类任务,难以直接迁移至回归场景。其解决方案的关键在于提出一种不确定性感知的TTA框架,仅更新预训练STF模型中的融合模块,通过认知不确定性(epistemic uncertainty)、土地利用/覆被一致性(land use and land cover consistency)以及偏差校正(bias correction)三个机制进行引导,无需源数据或目标标签样本即可实现有效适应。实验表明,在四个气候差异显著的目标区域(罗马、开罗、马德里和蒙彼利埃),该方法在均方根误差(RMSE)和平均绝对误差(MAE)上分别平均提升24.2%和27.9%,验证了其有效性与鲁棒性。
链接: https://arxiv.org/abs/2604.04153
作者: Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to IGARSS 2026
Abstract:Deep learning models have shown great promise in diverse remote sensing applications. However, they often struggle to generalize across geographic regions unseen during training due to domain shifts. Domain shifts occur when data distributions differ between the training region and new target regions, due to variations in land cover, climate, and environmental conditions. Test-time adaptation (TTA) has emerged as a solution to such shifts, but existing methods are primarily designed for classification and are not directly applicable to regression tasks. In this work, we address the regression task of spatio-temporal fusion (STF) for land surface temperature estimation. We propose an uncertainty-aware TTA framework that updates only the fusion module of a pre-trained STF model, guided by epistemic uncertainty, land use and land cover consistency, and bias correction, without requiring source data or labeled target samples. Experiments on four target regions with diverse climates, namely Rome in Italy, Cairo in Egypt, Madrid in Spain, and Montpellier in France, show consistent improvements in RMSE and MAE for a pre-trained model in Orléans, France. The average gains are 24.2% and 27.9%, respectively, even with limited unlabeled target data and only 10 TTA epochs.
[CV-80] OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
【速读】:该论文旨在解决基于流匹配(flow-matching)模型的生成式AI在使用GRPO(Generalized Reward Policy Optimization)进行后训练时存在的样本效率低下问题,其根本原因在于GRPO采用的在线策略(on-policy)训练范式导致大量轨迹无法复用。解决方案的关键在于提出OP-GRPO(Off-Policy GRPO),通过三个核心机制实现:首先,主动选择高质量轨迹并动态存入回放缓冲区以供后续迭代重用;其次,引入序列级重要性采样修正方法,在保留GRPO裁剪机制稳定性的同时缓解离策略样本带来的分布偏移;最后,理论与实证表明晚期去噪步骤会产生病态的离策略比值,因此通过截断轨迹至早期去噪步骤来缓解此问题。实验表明,OP-GRPO在图像和视频生成任务上仅需Flow-GRPO平均34.2%的训练步数即可达到相当或更优性能,显著提升训练效率且不牺牲生成质量。
链接: https://arxiv.org/abs/2604.04142
作者: Liyu Zhang,Kehan Li,Tingrui Han,Tao Zhao,Yuxuan Sheng,Shibo He,Chao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO’s clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.
[CV-81] Rethinking Exposure Correction for Spatially Non-uniform Degradation
【速读】:该论文旨在解决真实场景下图像曝光校正中普遍存在的空间非均匀退化问题,即同一幅图像内可能同时存在多种类型的曝光错误(如过曝、欠曝等),而现有方法多基于全局均匀假设,难以有效应对这种复杂性。其解决方案的关键在于提出一种面向空间非均匀性的全新校正范式:首先设计了一个空间信号编码器(Spatial Signal Encoder),用于预测空间自适应的调制权重,以指导多个查找表(Look-up Tables)进行局部化的图像变换;其次引入基于HSL的颜色保真补偿模块,提升色彩还原质量;此外,还提出了一种基于不确定性的非均匀损失函数,能够根据局部恢复不确定性动态调整优化焦点,从而更贴合真实曝光错误的空间异质特性。
链接: https://arxiv.org/abs/2604.04136
作者: Ao Li,Jiawei Sun,Le Dong,Zhenyu Wang,Weisheng Dong
机构: Xidian University (西安电子科技大学); Hangzhou Institute of Technology (杭州研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world exposure correction is fundamentally challenged by spatially non-uniform degradations, where diverse exposure errors frequently coexist within a single image. However, existing exposure correction methods are still largely developed under a predominantly uniform assumption. Architecturally, they typically rely on globally aggregated modulation signals that capture only the overall exposure trend. From the optimization perspective, conventional reconstruction losses are usually derived under a shared global scale, thus overlooking the spatially varying correction demands across regions. To address these limitations, we propose a new exposure correction paradigm explicitly designed for spatial non-uniformity. Specifically, we introduce a Spatial Signal Encoder to predict spatially adaptive modulation weights, which are used to guide multiple look-up tables for image transformation, together with an HSL-based compensation module for improved color fidelity. Beyond the architectural design, we propose an uncertainty-inspired non-uniform loss that dynamically allocates the optimization focus based on local restoration uncertainties, better matching the heterogeneous nature of real-world exposure errors. Extensive experiments demonstrate that our method achieves superior qualitative and quantitative performance compared with state-of-the-art methods. Code is available at this https URL.
[CV-82] NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results
【速读】:该论文旨在解决在极端低光照和烟雾退化等真实世界不利条件下,三维场景重建(3D Reconstruction)的鲁棒性问题。为应对这一挑战,作者构建了RealX3D基准数据集,并组织了NTIRE 2026 3D Restoration and Reconstruction (3DRR) 挑战赛,共吸引279名参与者、33支队伍提交有效结果。解决方案的关键在于识别出能够适应恶劣环境的鲁棒重建流程,研究发现表现优异的方法普遍遵循若干共享的设计原则,如多尺度特征融合、物理感知先验建模以及对退化模式的显式建模,这些策略显著提升了在复杂环境下的重建质量与稳定性。
链接: https://arxiv.org/abs/2604.04135
作者: Shuhong Liu,Chenyu Bao,Ziteng Cui,Xuangeng Chu,Bin Ren,Lin Gu,Xiang Chen,Mingrui Li,Long Ma,Marcos V. Conde,Radu Timofte,Yun Liu,Ryo Umagami,Tomohiro Hashimoto,Zijian Hu,Yuan Gan,Tianhan Xu,Yusuke Kurose,Tatsuya Harada,Junwei Yuan,Gengjia Chang,Xining Ge,Mache You,Qida Cao,Zeliang Li,Xinyuan Hu,Hongde Gu,Changyue Shi,Jiajun Ding,Zhou Yu,Jun Yu,Seungsang Oh,Fei Wang,Donggun Kim,Zhiliang Wu,Seho Ahn,Xinye Zheng,Kun Li,Yanyan Wei,Weisi Lin,Dizhe Zhang,Yuchao Chen,Meixi Song,Hanqing Wang,Haoran Feng,Lu Qi,Jiaao Shan,Yang Gu,Jiacheng Liu,Shiyu Liu,Kui Jiang,Junjun Jiang,Runyu Zhu,Sixun Dong,Qingxia Ye,Zhiqiang Zhang,Zhihua Xu,Zhiwei Wang,Phan The Son,Zhimiao Shi,Zixuan Guo,Xueming Fu,Lixia Han,Changhe Liu,Zhenyu Zhao,Manabu Tsukada,Zheng Zhang,Zihan Zhai,Tingting Li,Ziyang Zheng,Yuhao Liu,Dingju Wang,Jeongbin You,Younghyuk Kim,Il-Youp Kwak,Mingzhe Lyu,Junbo Yang,Wenhan Yang,Hongsen Zhang,Jinqiang Cui,Hong Zhang,Haojie Guo,Hantang Li,Qiang Zhu,Bowen He,Xiandong Meng,Debin Zhao,Xiaopeng Fan,Wei Zhou,Linzhe Jiang,Linfeng Li,Louzhe Xu,Qi Xu,Hang Song,Chenkun Guo,Weizhi Nie,Yufei Li,Xingan Zhan,Zhanqi Shi,Dufeng Zhang
机构: Nanjing University of Science and Technology; Korea University; Insta360 research; Harbin Institute of Technology; China University of Mining and Technology (Beijing); National University of Defense Technology; The University of Tokyo; Hefei University of Technology; Hefei Comprehensive National Science Center; Nanyang Technological University; United Arab Emirates University; Shanghai Jiao Tong University; Southern University of Science and Technology; Pengcheng Laboratory; Huazhong University of Science and Technology; National University of Singapore; Xi’an Jiaotong University; Fujian Normal University; Xidian University; Tianjin University; INSAIT; KAUST; Information Engineering University; Hangzhou Dianzi University; Harbin Institute of Technology (Shenzhen); Nanjing University of Aeronautics and Astronautics; University of Science and Technology of China; Intelligent Perception and Image Understanding Lab, Xidian University; Korea University; Chung-Ang University; Harbin Institute of Technology (Shenzhen); Pengcheng Laboratory; Xidian University; Hunan University; Hainan University; Tianjin University; Hunan University; Harbin Institute of Technology (Shenzhen)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participants registered for the competition, of whom 33 teams submitted valid results. We thoroughly evaluate the submitted approaches against state-of-the-art baselines, revealing significant progress in 3D reconstruction under adverse conditions. Our analysis highlights shared design principles among top-performing methods and provides insights into effective strategies for handling 3D scene degradation.
[CV-83] Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks
【速读】:该论文旨在解决当前CT(计算机断层扫描)基础模型在临床任务中应用时存在的两大问题:一是缺乏大规模的图像-文本配对数据以训练可靠的视觉-语言系统;二是下游任务通常依赖于对模型主干网络(backbone)的全量或部分微调,计算成本高且难以普及。解决方案的关键在于提出一种无需语言监督的3D CT基础模型VoxelFM,其通过自蒸馏(self-distillation)机制结合DINO框架学习语义丰富的视觉表征,从而实现仅用轻量级探测器(lightweight probes)即可在冻结主干网络的情况下高效迁移至多种下游任务。实验表明,VoxelFM在七类临床相关任务中表现优异,甚至优于显式进行语言对齐训练的模型,证明了高质量视觉特征提取比复杂视觉-语言建模更适用于当前CT基础模型的实际应用。
链接: https://arxiv.org/abs/2604.04133
作者: Rubén Moreno-Aguado,Alba Magallón,Victor Moreno,Yingying Fang,Guang Yang
机构: Imperial College London (帝国理工学院); University of Manchester (曼彻斯特大学); Catalan Institute of Oncology (加泰罗尼亚肿瘤研究所); Institut d’Investigació Biomèdica de Bellvitge (贝尔维特生物医学研究所); Consortium for Biomedical Research in Epidemiology and Public Health (流行病学与公共卫生生物医学研究联盟); University of Barcelona (巴塞罗那大学); King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:There is substantial interest in developing artificial intelligence systems to support radiologists across tasks ranging from segmentation to report generation. Existing computed tomography (CT) foundation models have largely focused on building generalist vision-language systems capable of tasks such as question answering and report generation. However, training reliable vision-language systems requires paired image-text data at a scale that remains unavailable in CT. Moreover, adapting the underlying visual representations to downstream tasks typically requires partial or full backbone fine-tuning, a computationally demanding process inaccessible to many research groups. Instead, foundation models should prioritise learning robust visual representations that enable efficient transfer to new tasks with minimal labelled data and without backbone fine-tuning. We present VoxelFM, a 3D CT foundation model trained with self-distillation using the DINO framework, which learns semantically rich features without language supervision. We evaluated VoxelFM across seven categories of clinically relevant downstream tasks using frozen backbone representations with lightweight probes: classification, regression, survival analysis, instance retrieval, localisation, segmentation, and report generation. VoxelFM matched or outperformed four existing CT foundation models across all task categories. Despite receiving no language supervision during pre-training, VoxelFM surpassed models explicitly trained with language-alignment objectives, including on report generation. Our results indicate that current CT foundation models perform significantly better as feature extractors for lightweight probes rather than as vision encoders for vision-language models. Model weights and training code are publicly available.
[CV-84] SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection
【速读】:该论文旨在解决合成孔径雷达(SAR)图像中舰船检测面临的三大挑战:固有的相干斑点噪声、复杂的海岸杂波干扰以及小尺度目标的特征易丢失问题。传统基于光学图像设计的目标检测器在应对SAR特有的退化现象时鲁棒性不足,且在空间下采样过程中会损失细粒度的舰船特征。解决方案的关键在于提出SARES-DEIM框架,其核心创新包括两个模块:一是SARESMoE(SAR-aware Expert Selection Mixture-of-Experts),通过稀疏门控机制将特征路由至专门处理频域与小波域的专家网络,实现对斑点噪声和语义杂波的有效过滤并保持高计算效率;二是Space-to-Depth Enhancement Pyramid(SDEP)颈部结构,用于保留浅层阶段的高分辨率空间信息,显著提升小目标定位精度。
链接: https://arxiv.org/abs/2604.04127
作者: Fenghao Song,Shaojing Yang,Xi Zhou
机构: Yunnan Normal University (云南师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, published to JSTARS(IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing)
Abstract:Ship detection in Synthetic Aperture Radar (SAR) imagery is fundamentally challenged by inherent coherent speckle noise, complex coastal clutter, and the prevalence of small-scale targets. Conventional detectors, primarily designed for optical imagery, often exhibit limited robustness against SAR-specific degradation and suffer from the loss of fine-grained ship signatures during spatial downsampling. To address these limitations, we propose SARES-DEIM, a domain-aware detection framework grounded in the DEtection TRansformer (DETR) paradigm. Central to our approach is SARESMoE (SAR-aware Expert Selection Mixture-of-Experts), a module leveraging a sparse gating mechanism to selectively route features toward specialized frequency and wavelet experts. This sparsely-activated architecture effectively filters speckle noise and semantic clutter while maintaining high computational efficiency. Furthermore, we introduce the Space-to-Depth Enhancement Pyramid (SDEP) neck to preserve high-resolution spatial cues from shallow stages, significantly improving the localization of small targets. Extensive experiments on two benchmark datasets demonstrate the superiority of SARES-DEIM. Notably, on the challenging HRSID dataset, our model achieves a mAP50:95 of 76.4% and a mAP50 of 93.8%, outperforming state-of-the-art YOLO-series and specialized SAR detectors.
[CV-85] Efficient Onboard Spacecraft Pose Estimation with Event Cameras and Neuromorphic Hardware CVPR2026
【速读】:该论文旨在解决航天器在复杂空间环境中进行高精度六自由度(6-DoF)位姿估计的难题,尤其针对传统帧式图像在极端光照、高对比度和快速目标运动下易饱和或模糊的问题。解决方案的关键在于将事件相机(event camera)的异步变化驱动数据与BrainChip Akida类脑神经形态处理器相结合,构建端到端的低延迟、低功耗感知流水线:首先基于SPADES数据集训练轻量级MobileNet风格的关键点回归网络,采用量化感知训练(8/4比特)并转换为Akida兼容的脉冲神经网络(spiking neural network, SNN),进而实现在Akida V1硬件上的实时推理;同时设计基于热图的模型用于Akida V2,并在Akida Cloud上验证其更高的位姿估计精度。这一方案首次实现了在类脑硬件上完成航天器位姿估计的全流程部署,为未来自主空间任务提供了可行的低功耗感知路径。
链接: https://arxiv.org/abs/2604.04117
作者: Arunkumar Rathinam,Jules Lecomte,Jost Reelsen,Gregor Lenz,Axel von Arnim,Djamila Aouada
机构: University of Luxembourg (卢森堡大学); Fortiss GmbH (Fortiss有限公司); Technical University of Munich (慕尼黑工业大学); Paddington Robotics (帕丁顿机器人公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AI4SPACE workshop at CVPR 2026
Abstract:Reliable relative pose estimation is a key enabler for autonomous rendezvous and proximity operations, yet space imagery is notoriously challenging due to extreme illumination, high contrast, and fast target motion. Event cameras provide asynchronous, change-driven measurements that can remain informative when frame-based imagery saturates or blurs, while neuromorphic processors can exploit sparse activations for low-latency, energy-efficient inferences. This paper presents a spacecraft 6-DoF pose-estimation pipeline that couples event-based vision with the BrainChip Akida neuromorphic processor. Using the SPADES dataset, we train compact MobileNet-style keypoint regression networks on lightweight event-frame representations, apply quantization-aware training (8/4-bit), and convert the models to Akida-compatible spiking neural networks. We benchmark three event representations and demonstrate real-time, low-power inference on Akida V1 hardware. We additionally design a heatmap-based model targeting Akida V2 and evaluate it on Akida Cloud, yielding improved pose accuracy. To our knowledge, this is the first end-to-end demonstration of spacecraft pose estimation running on Akida hardware, highlighting a practical route to low-latency, low-power perception for future autonomous space missions.
[CV-86] Hypothesis Graph Refinement: Hypothesis-Driven Exploration with Cascade Error Correction for Embodied Navigation
【速读】:该论文旨在解决具身智能体在部分可观测环境中进行长期导航时,因依赖不准确的语义预测而导致记忆结构错误累积的问题。现有基于图的导航系统通常将未探索区域视为语义未知,导致探索效率低下;而视觉语言模型(VLM)虽能提供前沿语义预测,但错误预测一旦被纳入记忆便会传播并引发结构性误差,仅靠置信度衰减无法有效纠正。解决方案的关键在于提出假设图精化(Hypothesis Graph Refinement, HGR)框架,其核心机制包括:(1)语义假设模块,通过上下文感知的语义分布估计与目标相关性、旅行成本及不确定性综合排序探索目标;(2)验证驱动的级联修正机制,在现场观测与预测语义不一致时,回溯并删除被证伪的假设节点及其所有下游依赖节点,从而实现图结构的收缩与错误修正,保持长期记忆可靠性。
链接: https://arxiv.org/abs/2604.04108
作者: Peixin Chen,Guoxi Zhang,Jianwei Ma,Qing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Embodied agents must explore partially observed environments while maintaining reliable long-horizon memory. Existing graph-based navigation systems improve scalability, but they often treat unexplored regions as semantically unknown, leading to inefficient frontier search. Although vision-language models (VLMs) can predict frontier semantics, erroneous predictions may be embedded into memory and propagate through downstream inferences, causing structural error accumulation that confidence attenuation alone cannot resolve. These observations call for a framework that can leverage semantic predictions for directed exploration while systematically retracting errors once new evidence contradicts them. We propose Hypothesis Graph Refinement (HGR), a framework that represents frontier predictions as revisable hypothesis nodes in a dependency-aware graph memory. HGR introduces (1) semantic hypothesis module, which estimates context-conditioned semantic distributions over frontiers and ranks exploration targets by goal relevance, travel cost, and uncertainty, and (2) verification-driven cascade correction, which compares on-site observations against predicted semantics and, upon mismatch, retracts the refuted node together with all its downstream dependents. Unlike additive map-building, this allows the graph to contract by pruning erroneous subgraphs, keeping memory reliable throughout long episodes. We evaluate HGR on multimodal lifelong navigation (GOAT-Bench) and embodied question answering (A-EQA, EM-EQA). HGR achieves 72.41% success rate and 56.22% SPL on GOAT-Bench, and shows consistent improvements on both QA benchmarks. Diagnostic analysis reveals that cascade correction eliminates approximately 20% of structurally redundant hypothesis nodes and reduces revisits to erroneous regions by 4.5x, with specular and transparent surfaces accounting for 67% of corrected prediction errors.
[CV-87] A Physics-Informed Behavior-Aware Digital Twin for Robust Multimodal Forecasting of Core Body Temperature in Precision Livestock Farming
【速读】:该论文旨在解决精准畜牧业中动物热应激(heat stress)预测的准确性与及时性问题,以保障动物福利并优化养殖管理。其核心解决方案是构建一个融合物理信息的数字孪生(digital twin, DT)框架,并结合不确定性感知、专家加权的堆叠集成模型进行多模态生理温度(Core Body Temperature, CBT)预测。关键在于:1)DT通过常微分方程(ODE)建模代谢产热与散热过程,引入高斯过程捕捉个体差异、卡尔曼滤波实现数据同化、马尔可夫链模拟行为状态转移;2)将DT输出的生理指标与原始传感器数据融合,经多尺度时序分析和跨模态特征工程形成综合特征集;3)采用三阶段堆叠集成策略,第一阶段训练不同模态的LightGBM“专家”模型,第二阶段提取元特征,第三阶段使用Optuna调参的LightGBM元模型生成最终CBT预测,并通过自助法量化预测不确定性,验证其覆盖概率(PICP)。该方法在2小时前瞻预测中达到R²=0.783、F1=84.25%、PICP=92.38%,实现了物理机制驱动与数据驱动协同的鲁棒、可解释且具备不确定性估计的热应激预警系统。
链接: https://arxiv.org/abs/2604.04098
作者: Riasad Alvi,Mohaimenul Azam Khan Raiaan,Sadia Sultana Chowa,Arefin Ittesafun Abian,Reem E Mohamed,Md Rafiqul Islam,Yakub Sebastian,Sheikh Izzal Azid,Sami Azam
机构: Applied Artificial INtelligence and Intelligent Systems (AAIINS) Laboratory, Dhaka 1217, Bangladesh; Department of Computer Science and Engineering, United International University, Dhaka 1212, Bangladesh; Department of Data Science and Artificial Intelligence, Monash University, Melbourne 3800, Australia; Department of Software Systems Cybersecurity, Monash University, Melbourne 3800, Australia; Energy and Resources Institute, Faculty of Science and Technology, Charles Darwin University, NSW 2000, Australia; Faculty of Science and Technology, Charles Darwin University, Casuarina, NT, 0810 Australia; School of Engineering and Energy, Murdoch University, Murdoch, WA 6150, Australia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Precision livestock farming requires accurate and timely heat stress prediction to ensure animal welfare and optimize farm management. This study presents a physics-informed digital twin (DT) framework combined with an uncertainty-aware, expert-weighted stacked ensemble for multimodal forecasting of Core Body Temperature (CBT) in dairy cattle. Using the high-frequency, heterogeneous MmCows dataset, the DT integrates an ordinary differential equation (ODE)-based thermoregulation model that simulates metabolic heat production and dissipation, a Gaussian process for capturing cow-specific deviations, a Kalman filter for aligning predictions with real-time sensor data, and a behavioral Markov chain that models activity-state transitions under varying environmental conditions. The DT outputs key physiological indicators, such as predicted CBT, heat stress probability, and behavioral state distributions are fused with raw sensor data and enriched through multi-scale temporal analysis and cross-modal feature engineering to form a comprehensive feature set. The predictive methodology is designed in a three-stage stacked ensemble, where stage 1 trains modality-specific LightGBM ‘expert’ models on distinct feature groups, stage 2 collects their predictions as meta-features, and at stage 3 Optuna-tuned LightGBM meta-model yields the final CBT forecast. Predictive uncertainty is quantified via bootstrapping and validated using Prediction Interval Coverage Probability (PICP). Ablation analysis confirms that incorporating DT-derived features and multimodal fusion substantially enhances performance. The proposed framework achieves a cross-validated R2 of 0.783, F1 score of 84.25% and PICP of 92.38% for 2-hour ahead forecasting, providing a robust, uncertainty-aware, and physically principled system for early heat stress detection and precision livestock management.
[CV-88] LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection CVPR2024
【速读】:该论文旨在解决深度伪造(deepfake)检测中模型泛化能力不足的问题,即现有方法在面对高质量伪造内容或未见过的篡改类型时性能显著下降。其关键解决方案是提出一种显式注意力机制——Localized Artifact Attention X (LAA-X),该框架通过多任务学习设计辅助任务,引导模型聚焦于易产生伪造痕迹的局部区域;同时结合基于混合的数据合成策略增强训练多样性,从而提升对未知篡改方式的鲁棒性与泛化能力。LAA-X兼容CNN和Transformer骨干网络,形成LAA-Net与LAA-Former两个版本,在多个基准测试上达到或超越当前最优水平。
链接: https://arxiv.org/abs/2604.04086
作者: Dat Nguyen,Enjie Ghorbel,Anis Kacem,Marcella Astrid,Djamila Aouada
机构: CVI2, SnT, University of Luxembourg, Luxembourg (CVI2, SnT, 卢森堡大学, 卢森堡); CRISTAL laboratory, ENSI, University of Manouba, Tunisia (CRISTAL 实验室, ENSI, 马努巴大学, 突尼斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal version of LAA-Net (CVPR 2024)
Abstract:In this paper, we propose Localized Artifact Attention X (LAA-X), a novel deepfake detection framework that is both robust to high-quality forgeries and capable of generalizing to unseen manipulations. Existing approaches typically rely on binary classifiers coupled with implicit attention mechanisms, which often fail to generalize beyond known manipulations. In contrast, LAA-X introduces an explicit attention strategy based on a multi-task learning framework combined with blending-based data synthesis. Auxiliary tasks are designed to guide the model toward localized, artifact-prone (i.e., vulnerable) regions. The proposed framework is compatible with both CNN and transformer backbones, resulting in two different versions, namely, LAA-Net and LAA-Former, respectively. Despite being trained only on real and pseudo-fake samples, LAA-X competes with state-of-the-art methods across multiple benchmarks. Code and pre-trained weights for LAA-Net\footnotethis https URL and LAA-Former\footnotethis https URL are publicly available.
[CV-89] Intelligent Traffic Monitoring with YOLOv11: A Case Study in Real-Time Vehicle Detection
【速读】:该论文旨在解决传统交通监控系统在实时性、准确性及部署灵活性方面的不足,特别是在无云依赖环境下实现高效车辆检测与计数的问题。其解决方案的关键在于构建一个离线、实时的交通监控系统,该系统结合预训练的YOLOv11目标检测模型与BoT-SORT/ByteTrack多目标跟踪算法,在PyTorch/OpenCV框架下实现,并通过Qt开发桌面用户界面以提升可用性。该方案利用轻量级深度学习模型和本地化处理流程,在多样场景中实现了高达95.83%的计数准确率和高F1分数(车辆类0.90–1.00,卡车类0.82–1.00),验证了其在典型天气条件下具备良好的鲁棒性和实用性,为智慧城市中的AI驱动交通管理提供了可落地的技术路径。
链接: https://arxiv.org/abs/2604.04080
作者: Shkelqim Sherifi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2025 International Conference on Computer and Applications (ICCA)
Abstract:Recent advancements in computer vision, driven by artificial intelligence, have significantly enhanced monitoring systems. One notable application is traffic monitoring, which leverages computer vision alongside deep learning-based object detection and counting. We present an offline, real-time traffic monitoring system that couples a pre-trained YOLOv11 detector with BoT-SORT/ByteTrack for multi-object tracking, implemented in PyTorch/OpenCV and wrapped in a Qt-based desktop UI. The CNN pipeline enables efficient vehicle detection and counting from video streams without cloud dependencies. Across diverse scenes, the system achieves (66.67-95.83%) counting accuracy. Class-wise detection yields high precision (cars: 0.97-1.00; trucks: 1.00) with strong recall (cars: 0.82-1.00; trucks: 0.70-1.00), resulting in F1 scores of (0.90-1.00 for cars and 0.82-1.00 for trucks). While adverse weather conditions may negatively impact this performance, results remain robust in typical conditions. By integrating lightweight models with an accessible, cloud-independent interface, this paper contributes to the modernization and development of future smart cities by showing the capacity of AI-driven traffic monitoring systems.
[CV-90] Detecting Media Clones in Cultural Repositories Using a Positive Unlabeled Learning Approach
【速读】:该论文旨在解决文物数字资源库(AtticPOT)中跨记录重复项(cross-record duplicates)的自动发现问题,尤其在缺乏明确负样本的情况下实现高效、可解释的去重。其核心解决方案是将该任务建模为正例-未标记(Positive-Unlabeled, PU)学习问题:利用每个文物的单个锚点样本,通过增强视图训练轻量级的Clone Encoder,并基于潜在空间中的l₂范数设定可解释阈值对整个库进行评分,从而筛选出需人工审核的候选重复项。该方法避免了显式负样本的构建,同时提供透明的操作点,适用于 curator-in-the-loop 的工作流,显著优于传统方法(如SVDD),在AtticPOT数据集上F1提升7.70点至90.79。
链接: https://arxiv.org/abs/2604.04071
作者: V. Sevetlidis,V. Arampatzakis,M. Karta,I. Mourthos,D. Tsiafaki,G. Pavlidis
机构: Archimedes; ATHENA RC
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CAA 2026 International Conference
Abstract:We formulate curator-in-the-loop duplicate discovery in the AtticPOT repository as a Positive-Unlabeled (PU) learning problem. Given a single anchor per artefact, we train a lightweight per-query Clone Encoder on augmented views of the anchor and score the unlabeled repository with an interpretable threshold on the latent l_2 norm. The system proposes candidates for curator verification, uncovering cross-record duplicates that were not verified a priori. On CIFAR-10 we obtain F1=96.37 (AUROC=97.97); on AtticPOT we reach F1=90.79 (AUROC=98.99), improving F1 by +7.70 points over the best baseline (SVDD) under the same lightweight backbone. Qualitative “find-similar” panels show stable neighbourhoods across viewpoint and condition. The method avoids explicit negatives, offers a transparent operating point, and fits de-duplication, record linkage, and curator-in-the-loop workflows.
[CV-91] 4C4D: 4 Camera 4D Gaussian Splatting CVPR2026
【速读】:该论文旨在解决从极稀疏视角(仅需4台便携式相机)视频中恢复4D动态场景的问题,以实现时序一致的新型视角渲染。此前方法通常依赖于密集多视角采集(数十甚至上百个相机视点),而本文提出4C4D框架,通过引入一种基于神经衰减函数(Neural Decaying Function)的高斯不透明度建模机制,显著提升了4D高斯溅射(4D Gaussian Splatting, 4DGS)在几何建模方面的性能。其核心创新在于认识到在稀疏设置下几何学习比外观建模更具挑战性,并通过该设计引导梯度更聚焦于几何信息的学习,从而有效缓解了4DGS中几何与外观建模之间的固有失衡问题。
链接: https://arxiv.org/abs/2604.04063
作者: Junsheng Zhou,Zhifan Yang,Liang Han,Wenyuan Zhang,Kanle Shi,Shenkun Xu,Yu-Shen Liu
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project page: this https URL
Abstract:This paper tackles the challenge of recovering 4D dynamic scenes from videos captured by as few as four portable cameras. Learning to model scene dynamics for temporally consistent novel-view rendering is a foundational task in computer graphics, where previous works often require dense multi-view captures using camera arrays of dozens or even hundreds of views. We propose \textbf4C4D, a novel framework that enables high-fidelity 4D Gaussian Splatting from video captures of extremely sparse cameras. Our key insight lies that the geometric learning under sparse settings is substantially more difficult than modeling appearance. Driven by this observation, we introduce a Neural Decaying Function on Gaussian opacities for enhancing the geometric modeling capability of 4D Gaussians. This design mitigates the inherent imbalance between geometry and appearance modeling in 4DGS by encouraging the 4DGS gradients to focus more on geometric learning. Extensive experiments across sparse-view datasets with varying camera overlaps show that 4C4D achieves superior performance over prior art. Project page at: this https URL.
[CV-92] DINO-VO: Learning Where to Focus for Enhanced State Estimation
【速读】:该论文旨在解决传统单目视觉里程计(Visual Odometry, VO)系统在复杂场景下因依赖启发式特征提取策略而导致精度和鲁棒性下降的问题,尤其是在大规模室外环境中表现不佳。其核心解决方案是提出一种端到端的单目视觉里程计系统——DINO Patch Visual Odometry (DINO-VO),关键创新在于引入一个可微分的自适应补丁选择器(differentiable adaptive patch selector),以提升所提取补丁的质量,并结合多任务特征提取模块与可微分束调整(differentiable bundle adjustment, BA)模块,利用逆深度先验来协同学习外观与几何信息,从而实现特征学习与状态估计的有效融合,显著增强系统在合成、室内及室外等多种场景下的泛化能力。
链接: https://arxiv.org/abs/2604.04055
作者: Qi Chen,Guanghao Li,Sijia Hu,Xin Gao,Junpeng Ma,Xiangyang Xue,Jian Pu
机构: Fudan University(复旦大学); Shanghai Innovation Institute(上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We present DINO Patch Visual Odometry (DINO-VO), an end-to-end monocular visual odometry system with strong scene generalization. Current Visual Odometry (VO) systems often rely on heuristic feature extraction strategies, which can degrade accuracy and robustness, particularly in large-scale outdoor environments. DINO-VO addresses these limitations by incorporating a differentiable adaptive patch selector into the end-to-end pipeline, improving the quality of extracted patches and enhancing generalization across diverse datasets. Additionally, our system integrates a multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors, enabling the system to learn and utilize appearance and geometric information effectively. This integration bridges the gap between feature learning and state estimation. Extensive experiments on the TartanAir, KITTI, Euroc, and TUM datasets demonstrate that DINO-VO exhibits strong generalization across synthetic, indoor, and outdoor environments, achieving state-of-the-art tracking accuracy.
[CV-93] ORA: Topological Representation Alignment for 3D Shape Assembly
【速读】:该论文旨在解决3D形状组装中流匹配(flow-matching)方法缺乏显式跨部件交互引导的问题,即模型在学习点级速度场以推动部件向装配配置移动时,未能有效利用部件间的拓扑关系信息。解决方案的关键在于提出一种拓扑优先的表示对齐框架TORA(Topology-First Representation Alignment),其核心机制是在训练过程中将冻结的预训练3D编码器(teacher)所提取的几何结构关系蒸馏到流匹配主干网络(student)中:首先通过token-wise余弦匹配注入几何描述符,进而引入中心核对齐(Centered Kernel Alignment, CKA)损失来对齐学生与教师表示之间的相似性结构,从而增强拓扑一致性。实验表明,这种基于几何和接触特性的对齐策略在Transformer后期层效果最佳,且不增加推理开销,显著提升收敛速度(最高达6.9倍)和分布内精度,并增强域迁移下的鲁棒性。
链接: https://arxiv.org/abs/2604.04050
作者: Nahyuk Lee,Zhiang Chen,Marc Pollefeys,Sunghwan Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Flow-matching methods for 3D shape assembly learn point-wise velocity fields that transport parts toward assembled configurations, yet they receive no explicit guidance about which cross-part interactions should drive the motion. We introduce TORA, a topology-first representation alignment framework that distills relational structure from a frozen pretrained 3D encoder into the flow-matching backbone during training. We first realize this via simple instantiation, token-wise cosine matching, which injects the learned geometric descriptors from the teacher representation. We then extend to employ a Centered Kernel Alignment (CKA) loss to match the similarity structure between student and teacher representations for enhanced topological alignment. Through systematic probing of diverse 3D encoders, we show that geometry- and contact-centric teacher properties, not semantic classification ability, govern alignment effectiveness, and that alignment is most beneficial at later transformer layers where spatial structure naturally emerges. TORA introduces zero inference overhead while yielding two consistent benefits: faster convergence (up to 6.9 \times ) and improved accuracy in-distribution, along with greater robustness under domain shift. Experiments on five benchmarks spanning geometric, semantic, and inter-object assembly demonstrate state-of-the-art performance, with particularly pronounced gains in zero-shot transfer to unseen real-world and synthetic datasets. Project page: this https URL.
[CV-94] ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity
【速读】:该论文旨在解决生成式 AI 视频(AIGV)检测中因缺乏对全局时间演化逻辑建模而导致的性能瓶颈问题。现有方法多依赖局部伪影或短期时序不一致,难以捕捉 AIGV 中由文本或图像提示驱动的确定性轨迹所引发的异常时序自相似性(Anomalous Temporal Self-Similarity, ATSS)。解决方案的关键在于提出一种多模态检测框架 ATSS,其核心创新是通过帧级描述构建视觉、文本及跨模态相似性矩阵,并利用双方向交叉注意力融合模块联合量化内在时序异常,从而有效建模模态内与模态间的动态关系,显著提升对多种生成模型的泛化能力。
链接: https://arxiv.org/abs/2604.04029
作者: Hang Wang,Chao Shen,Lei Zhang,Zhi-Qi Cheng
机构: Xi’an Jiaotong University (西安交通大学); The Hong Kong Polytechnic University (香港理工大学); University of Washington, Tacoma (华盛顿大学塔科马分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 4 figures
Abstract:AI-generated videos (AIGVs) have achieved unprecedented photorealism, posing severe threats to digital forensics. Existing AIGV detectors focus mainly on localized artifacts or short-term temporal inconsistencies, thus often fail to capture the underlying generative logic governing global temporal evolution, limiting AIGV detection performance. In this paper, we identify a distinctive fingerprint in AIGVs, termed anomalous temporal self-similarity (ATSS). Unlike real videos that exhibit stochastic natural dynamics, AIGVs follow deterministic anchor-driven trajectories (e.g., text or image prompts), inducing unnaturally repetitive correlations across visual and semantic domains. To exploit this, we propose the ATSS method, a multimodal detection framework that exploits this insight via a triple-similarity representation and a cross-attentive fusion mechanism. Specifically, ATSS reconstructs semantic trajectories by leveraging frame-wise descriptions to construct visual, textual, and cross-modal similarity matrices, which jointly quantify the inherent temporal anomalies. These matrices are encoded by dedicated Transformer encoders and integrated via a bidirectional cross-attentive fusion module to effectively model intra- and inter-modal dynamics. Extensive experiments on four large-scale benchmarks, including GenVideo, EvalCrafter, VideoPhy, and VidProM, demonstrate that ATSS significantly outperforms state-of-the-art methods in terms of AP, AUC, and ACC metrics, exhibiting superior generalization across diverse video generation models. Code and models of ATSS will be released at this https URL.
[CV-95] 1.x-Distill: Breaking the Diversity Quality and Efficiency Barrier in Distribution Matching Distillation
【速读】:该论文旨在解决扩散模型(Diffusion Models)在少步数蒸馏(few-step distillation)过程中存在的多样性崩溃(diversity collapse)和保真度下降(fidelity degradation)问题,尤其是在压缩至两步或更少时性能显著退化的问题。其核心解决方案是提出1.x-Distill框架,首次突破了传统少步蒸馏方法对整数步长的限制,引入了分数步生成(1.x-step generation)这一实用范式;关键创新包括:(1)重新分析教师模型中条件引导因子(CFG)的作用并提出简单有效的改进以抑制模式崩溃;(2)设计分阶段聚焦蒸馏(Stagewise Focused Distillation),通过两阶段策略先保留多样性地学习粗结构,再利用推理一致的对抗蒸馏优化细节;(3)构建轻量级补偿模块实现Distill–Cache协同训练,自然融合块级缓存机制提升蒸馏效率。实验表明,该方法在SD3-Medium和SD3.5-Large上分别实现了1.67和1.74有效非分数步数(Effective NFE),相比原始28×2 NFE采样提速高达33倍,同时保持更优的质量与多样性。
链接: https://arxiv.org/abs/2604.04018
作者: Haoyu Li,Tingyan Wen,Lin Qi,Zhe Wu,Yihuang Chen,Xing Zhou,Lifei Zhu,Xueqian Wang,Kai Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Diffusion models produce high-quality text-to-image results, but their iterative denoising is computationally this http URL Matching Distillation (DMD) emerges as a promising path to few-step distillation, but suffers from diversity collapse and fidelity degradation when reduced to two steps or fewer. We present 1.x-Distill, the first fractional-step distillation framework that breaks the integer-step constraint of prior few-step methods and establishes 1.x-step generation as a practical regime for distilled diffusion this http URL, we first analyze the overlooked role of teacher CFG in DMD and introduce a simple yet effective modification to suppress mode collapse. Then, to improve performance under extreme steps, we introduce Stagewise Focused Distillation, a two-stage strategy that learns coarse structure through diversity-preserving distribution matching and refines details with inference-consistent adversarial distillation. Furthermore, we design a lightweight compensation module for Distill–Cache co-Training, which naturally incorporates block-level caching into our distillation this http URL on SD3-Medium and SD3.5-Large show that 1.x-Distill surpasses prior few-step methods, achieving better quality and diversity at 1.67 and 1.74 effective NFEs, respectively, with up to 33x speedup over original 28x2 NFE sampling.
[CV-96] HOIGS: Human-Object Interaction Gaussian Splatting
【速读】:该论文旨在解决动态场景中复杂人-物交互(Human-Object Interaction, HOI)的高保真重建问题,现有高斯点绘(Gaussian Splatting)方法要么依赖人体姿态先验而忽略动态物体,要么将所有运动近似为单一场,难以捕捉交互驱动的形变。解决方案的关键在于提出Human-Object Interaction Gaussian Splatting (HOIGS),其核心创新是引入基于交叉注意力(cross-attention)的HOI模块,显式建模人与物体间的交互诱导形变;同时采用异构特征提取策略:对人类使用HexPlane,对物体使用三次埃尔米特样条(Cubic Hermite Spline, CHS),从而有效融合异构特征以提升遮挡、接触及物体操作等复杂场景下的运动耦合建模与形变估计精度。
链接: https://arxiv.org/abs/2604.04016
作者: Taewoo Kim,Suwoong Yeom,Jaehyun Pyun,Geonho Cha,Dongyoon Wee,Joonsik Nam,Yun-Seong Jeong,Kyeongbo Kong,Suk-Ju Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 9 figures
Abstract:Reconstructing dynamic scenes with complex human-object interactions is a fundamental challenge in computer vision and graphics. Existing Gaussian Splatting methods either rely on human pose priors while neglecting dynamic objects, or approximate all motions within a single field, limiting their ability to capture interaction-rich dynamics. To address this gap, we propose Human-Object Interaction Gaussian Splatting (HOIGS), which explicitly models interaction-induced deformation between humans and objects through a cross-attention-based HOI module. Distinct deformation baselines are employed to extract features: HexPlane for humans and Cubic Hermite Spline (CHS) for objects. By integrating these heterogeneous features, HOIGS effectively captures interdependent motions and improves deformation estimation in scenarios involving occlusion, contact, and object manipulation. Comprehensive experiments on multiple datasets demonstrate that our method consistently outperforms state-of-the-art human-centric and 4D Gaussian approaches, highlighting the importance of explicitly modeling human-object interactions for high-fidelity reconstruction.
[CV-97] OASIC: Occlusion-Agnostic and Severity-Informed Classification
【速读】:该论文旨在解决严重遮挡(occlusion)对计算机视觉任务带来的挑战,其核心问题在于:一方面,遮挡导致目标物体可见信息丢失;另一方面,遮挡物本身会引入干扰性模式(distracting patterns)。解决方案的关键在于同时应对这两个根源问题:首先,在测试阶段通过基于视觉异常检测的掩码机制去除遮挡物的干扰模式,该方法不依赖于具体遮挡类型;其次,在训练阶段采用随机掩码策略模拟不同严重程度的遮挡,从而增强模型鲁棒性。进一步发现,测试时可估计遮挡严重程度,并且针对特定严重程度优化的模型在对应严重程度下表现最优。基于此,作者提出OASIC(Occlusion Agnostic Severity Informed Classification)模型,通过估计测试图像的遮挡严重度、掩蔽遮挡区域并选择最适配的模型进行分类,显著提升性能(AUCocc提升达+18.5 vs 标准训练,+23.7 vs 无遮挡微调)。
链接: https://arxiv.org/abs/2604.04012
作者: Kay Gijzen(1, 2),Gertjan J. Burghouts(2),Daniël M. Pelt(1) ((1) Leiden University, (2) TNO)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 5 figures
Abstract:Severe occlusions of objects pose a major challenge for computer vision. We show that two root causes are (1) the loss of visible information and (2) the distracting patterns caused by the occluders. Our approach addresses both causes at the same time. First, the distracting patterns are removed at test-time, via masking of the occluding patterns. This masking is independent of the type of occlusion, by handling the occlusion through the lens of visual anomalies w.r.t. the object of interest. Second, to deal with less visual details, we follow standard practice by masking random parts of the object during training, for various degrees of occlusions. We discover that (a) it is possible to estimate the degree of the occlusion (i.e. severity) at test-time, and (b) that a model optimized for a specific degree of occlusion also performs best on a similar degree during test-time. Combining these two insights brings us to a severity-informed classification model called OASIC: Occlusion Agnostic Severity Informed Classification. We estimate the severity of occlusion for a test image, mask the occluder, and select the model that is optimized for the degree of occlusion. This strategy performs better than any single model optimized for any smaller or broader range of occlusion severities. Experiments show that combining gray masking with adaptive model selection improves \textAUC_\textocc by +18.5 over standard training on occluded images and +23.7 over finetuning on unoccluded images.
[CV-98] A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning
【速读】:该论文旨在解决音频-视觉多模态大语言模型(Multi-Modal Large Language Models, MLLMs)在安全关键应用中的脆弱性问题,特别是针对跨模态协同攻击的潜在威胁尚未被充分研究。解决方案的关键在于提出并系统分析“多模态排版攻击”(Multi-Modal Typography),即通过在音频、视觉和文本模态上协同施加扰动,揭示MLLMs在多模态交互下的显著跨模态脆弱性;实验表明,这种协同攻击的攻击成功率高达83.43%,远超单一模态攻击的34.93%,从而证明多模态协同扰动是更有效的攻击策略,亟需纳入多模态推理系统的安全评估体系。
链接: https://arxiv.org/abs/2604.03995
作者: Tianle Chen,Deepti Ghadiyaram
机构: Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:
Abstract:As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = 83.43% vs 34.93% ).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.
[CV-99] High-Fidelity Mural Restoration via a Unified Hybrid Mask-Aware Transformer
【速读】:该论文旨在解决古代壁画数字修复中的两大核心挑战:一是如何在大面积缺失区域中重建结构合理的纹理,二是如何严格保护未受损区域的真实性。解决方案的关键在于提出一种统一的混合掩码感知变压器(Hybrid Mask-Aware Transformer, HMAT)框架,其核心创新包括:1)引入掩码感知动态滤波模块(Mask-Aware Dynamic Filtering),增强局部纹理建模的鲁棒性;2)采用Transformer瓶颈结构实现长距离结构推理;3)设计掩码条件风格融合模块(mask-conditional style fusion module),根据损伤形态动态引导生成过程;4)构建教师强制解码器(Teacher-Forcing Decoder)与硬门控跳跃连接机制,确保有效区域高保真度并聚焦于缺失区域的重建。该方法在DHMural和九色鹿数据集上验证了其在结构一致性与视觉真实性方面的优越性能。
链接: https://arxiv.org/abs/2604.03984
作者: Jincheng Jiang,Qianhao Han,Chi Zhang,Zheng Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 3 figures
Abstract:Ancient murals are valuable cultural artifacts, but many have suffered severe degradation due to environmental exposure, material aging, and human activity. Restoring these artworks is challenging because it requires both reconstructing large missing structures and strictly preserving authentic, undamaged regions. This paper presents the Hybrid Mask-Aware Transformer (HMAT), a unified framework for high-fidelity mural restoration. HMAT integrates Mask-Aware Dynamic Filtering for robust local texture modeling with a Transformer bottleneck for long-range structural inference. To further address the diverse morphology of degradation, we introduce a mask-conditional style fusion module that dynamically guides the generative process. In addition, a Teacher-Forcing Decoder with hard-gated skip connections is designed to enforce fidelity in valid regions and focus reconstruction on missing areas. We evaluate HMAT on the DHMural dataset and a curated Nine-Colored Deer dataset under varying degradation levels. Experimental results demonstrate that the proposed method achieves competitive performance compared to state-of-the-art approaches, while producing more structurally coherent and visually faithful restorations. These findings suggest that HMAT provides an effective solution for the digital restoration of cultural heritage murals. Comments: 13 pages, 3 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.03984 [cs.CV] (or arXiv:2604.03984v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.03984 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-100] Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics
【速读】:该论文旨在解决当前参数高效提示学习(Parameter-efficient Prompt Learning)在视觉-语言模型(Vision-Language Models, VLMs)下游任务适应中,仅依赖一阶视觉特征(即空间特征图)导致的鲁棒性不足问题。现有方法虽能实现细粒度语义区分,但因一阶特征易受领域偏移和局部噪声干扰,难以应对跨域场景下的性能下降。解决方案的关键在于提出Gram锚定提示学习(Gram-Anchored Prompt Learning, GAPL),通过引入基于二阶统计量(Second-Order Statistics)的Gram矩阵流,增强标准一阶空间交互,使提示锚定于全局结构一致性之上,从而实现语言表示对不同领域统计分布变化的动态适应能力。
链接: https://arxiv.org/abs/2604.03980
作者: Minglei Chen,Weilong Wang,Jiang Duan,Ye Deng
机构: Southwestern University of Finance and Economics (西南财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Parameter-efficient prompt learning has become the de facto standard for adapting Vision-Language Models (VLMs) to downstream tasks. Existing approaches predominantly focus on aligning text prompts with first-order visual features (i.e., spatial feature maps). While effective for fine-grained semantic discrimination, we argue that relying solely on first-order information is insufficient for robust adaptation, as these spatially entangled features are highly susceptible to domain shifts and local noise. In this work, we propose \textbfGram-Anchored Prompt Learning (GAPL) for Vision-Language Models via Second-Order Statistics, a framework that synergizes local semantic alignment with global structural consistency. Methodologically, we introduce an additional second-order statistical stream via \textbfGram matrices that augments the standard first-order spatial interaction. By anchoring prompts to these second-order priors, our approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains. Extensive experiments indicate the effectiveness of the second-order features, and show compelling performances of GAPL on various benchmarks.
[CV-101] Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
【速读】:该论文旨在解决3D形状异常检测中现有深度学习方法在跨异常类型与尺度上泛化能力差的问题,尤其是在处理全局几何误差(如平面偏移、角度错位)时表现不佳,且对训练过程中噪声或不完整的局部点云敏感。其解决方案的关键在于提出一种分层点-块异常评分网络(hierarchical point-patch anomaly scoring network),通过联合建模区域部件特征与局部点特征实现鲁棒的异常推理;同时引入自监督分解的自适应块划分模块(adaptive patchification module),以捕捉复杂的结构偏差,从而显著提升模型在多种异常类型下的检测性能与泛化能力。
链接: https://arxiv.org/abs/2604.03972
作者: Xueyang Kang,Zizhao Li,Tian Lan,Dong Gong,Kourosh Khoshelham,Liangliang Nan
机构: The University of Melbourne (墨尔本大学); Tsinghua University (清华大学); The University of New South Wales (新南威尔士大学); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 6 tables
Abstract:3D shape anomaly detection is a crucial task for industrial inspection and geometric analysis. Existing deep learning approaches typically learn representations of normal shapes and identify anomalies via out-of-distribution feature detection or decoder-based reconstruction. They often fail to generalize across diverse anomaly types and scales, such as global geometric errors (e.g., planar shifts, angle misalignments), and are sensitive to noisy or incomplete local points during training. To address these limitations, we propose a hierarchical point-patch anomaly scoring network that jointly models regional part features and local point features for robust anomaly reasoning. An adaptive patchification module integrates self-supervised decomposition to capture complex structural deviations. Beyond evaluations on public benchmarks (Anomaly-ShapeNet and Real3D-AD), we release an industrial test set with real CAD models exhibiting planar, angular, and structural defects. Experiments on public and industrial datasets show superior AUC-ROC and AUC-PR performance, including over 40% point-level improvement on the new industrial anomaly type and average object-level gains of 7% on Real3D-AD and 4% on Anomaly-ShapeNet, demonstrating strong robustness and generalization.
[CV-102] VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models ACL-2026
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中部署时面临的“遗忘”挑战,即如何移除不安全、虚假或涉及隐私的行为,同时避免对感知能力、语言对齐和动作控制性能的负面影响。传统方法通常仅针对视觉模块或语言骨干网络进行局部遗忘,但VLA模型中的不良知识分布于感知、跨模态对齐与推理/动作生成等多个层次,导致此类策略效果有限;而为单一模态设计的遗忘基线在具身场景中可能引发残留遗忘或不必要的性能损失。解决方案的关键在于提出VLA-Forget框架,其核心是结合两种机制:一是基于比例感知的选择性编辑(ratio-aware selective editing),用于感知和跨模态特异性保留;二是层选择性的推理/动作遗忘(layer-selective reasoning/action unlearning),实现效用保持下的遗忘。该框架通过分阶段优化目标函数——定向遗忘、感知保留和推理维持——在视觉编码器、跨模态投影器及高层动作生成Transformer块上协同更新,显著提升了遗忘效率(提升10%)、感知特异性保留(提高22%)、任务成功率维持(提升9%),并大幅降低量化后行为恢复率(减少55%)。
链接: https://arxiv.org/abs/2604.03956
作者: Ravi Ranjan,Agoritsa Polyzou
机构: Florida International University (佛罗里达国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures, submitted to ACL-2026
Abstract:Vision-language-action (VLA) models are emerging as embodied foundation models for robotic manipulation, but their deployment introduces a new unlearning challenge: removing unsafe, spurious, or privacy-sensitive behaviors without degrading perception, language grounding, and action control. In OpenVLA-style policies, behavior is produced through a fused visual encoder, a cross-modal projector, and a language backbone that predicts tokenized robot actions, so undesirable knowledge can be distributed across perception, alignment, and reasoning/action layers rather than confined to a single module. Consequently, partial unlearning applied only to the vision stack or only to the language backbone is often insufficient, while conventional unlearning baselines designed for standalone vision or language models may leave residual forgetting or incur unnecessary utility loss in embodied settings. We propose VLA-Forget, a hybrid unlearning framework that combines ratio-aware selective editing for perception and cross-modal specificity with layer-selective reasoning/action unlearning for utility-preserving forgetting. VLA-Forget jointly optimizes three objectives: targeted forgetting, perceptual preservation, and reasoning retention, through staged updates over the visual encoder, projector, and upper action-generating transformer blocks. Across forget-set behavior probes and retain-task evaluations, VLA-Forget improves forgetting efficacy by 10%, preserves perceptual specificity by 22%, retains reasoning and task success by 9%, and reduces post-quantization recovery by 55% relative to strong unlearning baselines.
[CV-103] Multimodal Structure Learning: Disentangling Shared and Specific Topology via Cross-Modal Graphical Lasso
【速读】:该论文旨在解决多模态表示学习中如何准确建模视觉与语言特征间条件依赖关系的问题,尤其针对高维噪声、模态对齐偏差以及共享结构与类别特异性拓扑混淆等挑战。其解决方案的关键在于提出Cross-Modal Graphical Lasso (CM-GLasso),通过新颖的文本-视觉策略与统一的视觉-语言编码器将多模态特征严格对齐至共享潜在空间,并引入交叉注意力蒸馏机制以提取空间感知的跨模态先验;同时,将定制化的图模型估计与Common-Specific Structure Learning (CSSL) 融合进联合优化目标,利用交替方向乘子法(ADMM)求解,从而在单步优化中实现不变精度矩阵与类别特异性精度矩阵的解耦,避免多步误差累积。
链接: https://arxiv.org/abs/2604.03953
作者: Fei Wang,Yutong Zhang,Xiong Wang
机构: Stony Brook University (石溪大学); Sichuan University (四川大学); USTC (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to a conference
Abstract:Learning interpretable multimodal representations inherently relies on uncovering the conditional dependencies between heterogeneous features. However, sparse graph estimation techniques, such as Graphical Lasso (GLasso), to visual-linguistic domains is severely bottlenecked by high-dimensional noise, modality misalignment, and the confounding of shared versus category-specific topologies. In this paper, we propose Cross-Modal Graphical Lasso (CM-GLasso) that overcomes these fundamental limitations. By coupling a novel text-visualization strategy with a unified vision-language encoder, we strictly align multimodal features into a shared latent space. We introduce a cross-attention distillation mechanism that condenses high-dimensional patches into explicit semantic nodes, naturally extracting spatial-aware cross-modal priors. Furthermore, we unify tailored GLasso estimation and Common-Specific Structure Learning (CSSL) into a joint objective optimized via the Alternating Direction Method of Multiplier (ADMM). This formulation guarantees the simultaneous disentanglement of invariant and class-specific precision matrices without multi-step error accumulation. Extensive experiments across eight benchmarks covering both natural and medical domains demonstrate that CM-GLasso establishes a new state-of-the-art in generative classification and dense semantic segmentation tasks.
[CV-104] SafeCtrl: Region-Aware Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress ICME
【速读】:该论文旨在解决文本到图像扩散模型在生成过程中产生视觉有害内容(如色情、暴力和恐怖图像)的问题,尤其针对现有安全干预措施存在的两大缺陷:一是安全性和上下文保真度之间的严重权衡,即移除不安全概念会损害安全内容的细节 fidelity;二是对对抗性提示攻击的脆弱性,使得安全机制易被绕过。解决方案的关键在于提出 SafeCtrl 框架,其基于“检测-抑制”范式,首先通过注意力引导的 Detect 模块精确定位风险区域,随后利用图像级直接偏好优化(image-level Direct Preference Optimization, DPO)训练的局部 Suppress 模块,在仅抑制检测区域内有害语义的同时保留周围场景完整性,从而实现高精度且鲁棒的安全控制。
链接: https://arxiv.org/abs/2604.03941
作者: Lingyun Zhang,Yu Xie,Zhongli Fang,Yu Liu,Ping Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures, accepted to 2026 IEEE International Conference on Multimedia and Expo (ICME)
Abstract:The widespread deployment of text-to-image diffusion models is significantly challenged by the generation of visually harmful content, such as sexually explicit content, violence, and horror imagery. Common safety interventions, ranging from input filtering to model concept erasure, often suffer from two critical limitations: (1) a severe trade-off between safety and context preservation, where removing unsafe concepts degrades the fidelity of the safe content, and (2) vulnerability to adversarial attacks, where safety mechanisms are easily bypassed. To address these challenges, we propose SafeCtrl, a Region-Aware safety control framework operating on a Detect-Then-Suppress paradigm. Unlike global safety interventions, SafeCtrl first employs an attention-guided Detect module to precisely localize specific risk regions. Subsequently, a localized Suppress module, optimized via image-level Direct Preference Optimization (DPO), neutralizes harmful semantics only within the detected areas, effectively transforming unsafe objects into safe alternatives while leaving the surrounding context intact. Extensive experiments across multiple risk categories demonstrate that SafeCtrl achieves a superior trade-off between safety and fidelity compared to state-of-the-art methods. Crucially, our approach exhibits improved resilience against adversarial prompt attacks, offering a precise and robust solution for responsible generation.
[CV-105] Supervised Dimensionality Reduction Revisited: Why LDA on Frozen CNN Features Deserves a Second Look
【速读】:该论文旨在解决网约车调度中因需求模式随时间、季节及特殊事件显著变化而导致的效率低下问题,核心挑战在于如何精准预测并动态调整运力分配以降低乘客等待时间。解决方案的关键在于提出一种制度校准(regime-calibrated)方法:首先基于六维相似性度量(Kolmogorov-Smirnov距离、Wasserstein-1距离、特征距离、方差比、事件模式匹配和时间邻近度)将历史行程数据聚类为多个需求制度(demand regimes),进而通过匹配当前运营时段与最相似的历史模拟场景来构建校准后的先验需求分布;该先验用于驱动两种策略——线性规划(LP)驱动的车队再定位政策和基于匈牙利算法的批量派单策略,二者协同作用实现了平均等待时间下降31.1%(p < 0.001),且在尾部等待时间(P95)和等待时间不平等性(Gini系数)上均有显著改善。该方法无需训练、确定性强、可解释,并具备跨城市泛化能力(如芝加哥验证)。
链接: https://arxiv.org/abs/2604.03928
作者: Indar Kumar,Girish Karhana,Sai Krishna Jasti,Ankit Hemant Lade
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, 6 tables. Code available at this https URL
Abstract:Effective ride-hailing dispatch requires anticipating demand patterns that vary substantially across time-of-day, day-of-week, season, and special events. We propose a regime-calibrated approach that (i) segments historical trip data into demand regimes, (ii) matches the current operating period to the most similar historical analogues via a six-metric similarity ensemble (Kolmogorov-Smirnov, Wasserstein-1, feature distance, variance ratio, event pattern, temporal proximity), and (iii) uses the resulting calibrated demand prior to drive both an LP-based fleet repositioning policy and batch dispatch with Hungarian matching. In ablation, a distributional-only subset is strongest on mean wait, while the full ensemble is retained as a robustness-oriented default. Evaluated on 5.2 million NYC TLC trips across 8 diverse scenarios (winter/summer, weekday/weekend/holiday, morning/evening/night) with 5 random seeds each, our method reduces mean rider wait times by 31.1% (bootstrap 95% CI: [26.5, 36.6]%; Friedman chi-sq = 80.0, p = 4.25e-18; Cohen’s d = 7.5-29.9 across scenarios). The improvement extends to the tail: P95 wait drops 37.6% and the Gini coefficient of wait times improves from 0.441 to 0.409 (7.3% relative). The two contributions compose multiplicatively and are independently validated: calibration provides 16.9% reduction; LP repositioning adds a further 15.5%. The approach requires no training, is deterministic and explainable, generalizes to Chicago (23.3% wait reduction via NYC-built regime library), and is robust across fleet sizes (32-47% improvement for 0.5-2x fleet scaling). We provide comprehensive ablation studies, formal statistical tests, and routing-fidelity validation with OSRM. Comments: 9 pages, 4 figures, 6 tables. Code available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T10, 62H30 ACMclasses: I.5.2; I.4.7; I.2.6 Cite as: arXiv:2604.03928 [cs.LG] (or arXiv:2604.03928v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.03928 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-106] Interpreting Video Representations with Spatio-Temporal Sparse Autoencoders
【速读】:该论文旨在解决标准稀疏自编码器(Sparse Autoencoders, SAEs)在视频表示学习中因硬TopK选择导致的时间一致性破坏问题,即SAEs虽能提取可解释的单义特征(monosemantic features),但会显著降低帧间自相关性(autocorrelation),降幅达36%。其解决方案的关键在于引入时空对比损失(spatio-temporal contrastive objectives)和马特罗什卡层次分组(Matryoshka hierarchical grouping),通过对比学习机制在重建精度与时间一致性之间实现可控权衡,并有效恢复甚至超越原始视频的时间连贯性。实验表明,该方法不仅提升了动作分类性能(+3.9%)和文本-视频检索效果(R@1提升达2.8倍),还验证了对比训练能将预测信号集中于少数可识别特征中,从而增强模型的可解释性与任务适应性。
链接: https://arxiv.org/abs/2604.03919
作者: Atahan Dokme,Sriram Vishwanath
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 5 tables. Submitted to ACM Multimedia 2026
Abstract:We present the first systematic study of Sparse Autoencoders (SAEs) on video representations. Standard SAEs decompose video into interpretable, monosemantic features but destroy temporal coherence: hard TopK selection produces unstable feature assignments across frames, reducing autocorrelation by 36%. We propose spatio-temporal contrastive objectives and Matryoshka hierarchical grouping that recover and even exceed raw temporal coherence. The contrastive loss weight controls a tunable trade-off between reconstruction and temporal coherence. A systematic ablation on two backbones and two datasets shows that different configurations excel at different goals: reconstruction fidelity, temporal coherence, action discrimination, or interpretability. Contrastive SAE features improve action classification by +3.9% over raw features and text-video retrieval by up to 2.8xR@1. A cross-backbone analysis reveals that standard monosemanticity metrics contain a backbone-alignment artifact: both DINOv2 and VideoMAE produce equally monosemantic features under neutral (CLIP) similarity. Causal ablation confirms that contrastive training concentrates predictive signal into a small number of identifiable features.
[CV-107] Learning 3D Reconstruction with Priors in Test Time CVPR2026
【速读】:该论文旨在解决多视图Transformer(Multiview Transformers, MVTs)在3D视觉任务中缺乏对先验信息(如相机位姿、内参和深度等)有效利用的问题,尤其是在不重新训练或修改预训练图像模型的前提下提升性能。解决方案的关键在于提出一种测试时约束优化(Test-Time Constrained Optimization, TCO)框架:不将先验信息直接注入网络结构,而是将其建模为对预测结果的约束条件,并在推理阶段通过优化损失函数来实现。该损失函数包含自监督目标(基于多视角预测的一致性,如光度或几何损失)和先验惩罚项(将可用先验转化为对输出模态的约束),从而在不改变模型结构的情况下显著提升点云估计和相机位姿估计等任务的精度。
链接: https://arxiv.org/abs/2604.03878
作者: Lei Zhou,Haoyu Wu,Akshat Dave,Dimitris Samaras
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026. Code link: this https URL
Abstract:We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks without retraining or modifying pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference time. The optimization loss consists of a self-supervised objective and prior penalty terms. The self-supervised objective captures the compatibility among multi-view predictions and is implemented using photometric or geometric loss between renderings from other views and each view itself. Any available priors are converted into penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On the ETH3D, 7-Scenes, and NRGBD datasets, our method reduces the point-map distance error by more than half compared with the base image-only models. Our method also outperforms retrained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework for incorporating priors into 3D vision tasks.
[CV-108] raining a Student Expert via Semi-Supervised Foundation Model Distillation CVPR
【速读】:该论文旨在解决预训练视觉基础模型(Vision Foundation Models, VFMs)在部署时计算资源消耗大、且适配过程依赖昂贵标注数据的问题。其核心解决方案是提出一种半监督知识蒸馏(Semi-Supervised Knowledge Distillation, SSKD)框架,通过有限的标注数据与大量未标注数据实现对VFMs的有效压缩与性能提升。关键创新在于引入一种实例感知的像素级对比损失(instance-aware pixel-wise contrastive loss),该损失融合掩码和类别得分以提取有信息量的负样本并强化不同实例间的边界间隔,从而在领域适应和知识蒸馏两个阶段持续保持对比信号一致性,有效对齐师生模型嵌入空间,并更充分地利用未标注图像中的语义信息。
链接: https://arxiv.org/abs/2604.03841
作者: Pardis Taghavi,Tian Liu,Renjie Li,Reza Langari,Zhengzhong Tu
机构: Texas AM University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 14 pages, 9 figures
Abstract:Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our \approx 11\times smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.
[CV-109] Beyond Task-Driven Features for Object Detection
【速读】:该论文旨在解决现代目标检测器学习到的特征虽然优化了特定任务损失,但往往捕捉到的是“捷径相关性”(shortcut correlations),这些相关性未能反映标注数据的底层结构,从而限制了模型在任务定义变化或监督信号稀疏时的迁移能力、可解释性和鲁棒性。解决方案的关键在于提出一种注释引导的特征增强框架(annotation-guided feature augmentation framework),通过从注释引导的潜在空间构建密集的空间特征网格,并将其与特征金字塔表示融合,从而影响区域建议和检测头,使特征更好地对齐于标注几何结构,进而提升对象聚焦能力、降低背景敏感度,并增强对未见或弱监督任务的泛化性能。
链接: https://arxiv.org/abs/2604.03839
作者: Meilun Zhou,Alina Zare
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for Oral Presentation at the 46th IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2026, Washington D.C., United States. 4 pages and 4 figures
Abstract:Task-driven features learned by modern object detectors optimize end task loss yet often capture shortcut correlations that fail to reflect underlying annotation structure. Such representations limit transfer, interpretability, and robustness when task definitions change or supervision becomes sparse. This paper introduces an annotation-guided feature augmentation framework that injects embeddings into an object detection backbone. The method constructs dense spatial feature grids from annotation-guided latent spaces and fuses them with feature pyramid representations to influence region proposal and detection heads. Experiments across wildlife and remote sensing datasets evaluate classification, localization, and data efficiency under multiple supervision regimes. Results show consistent improvements in object focus, reduced background sensitivity, and stronger generalization to unseen or weakly supervised tasks. The findings demonstrate that aligning features with annotation geometry yields more meaningful representations than purely task optimized features.
[CV-110] ask-Guided Multi-Annotation Triplet Learning for Remote Sensing Representations
【速读】:该论文旨在解决多任务三元组损失(multi-task triplet loss)方法中因依赖静态权重而无法动态适应任务间交互关系的问题。现有方法通常采用固定权重平衡不同标注类型的监督信号,但这种策略不仅需要手动调参,且忽视了任务之间在构建共享表示时的协同效应。解决方案的关键在于提出一种任务引导的多标注三元组损失(task-guided multi-annotation triplet loss),其通过互信息准则选择跨任务最具信息量的三元组样本,从而动态调整哪些样本影响表示学习过程,而非简单调节损失权重。这种方法显著提升了下游分类与回归任务的性能,证明了任务感知的三元组选择能够生成更有效的共享表示。
链接: https://arxiv.org/abs/2604.03837
作者: Meilun Zhou,Alina Zare
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for Oral Presentation at the 46th IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2026, Washington D.C., United States. 4 pages and 2 figures
Abstract:Prior multi-task triplet loss methods relied on static weights to balance supervision between various types of annotation. However, static weighting requires tuning and does not account for how tasks interact when shaping a shared representation. To address this, the proposed task-guided multi-annotation triplet loss removes this dependency by selecting triplets through a mutual-information criteria that identifies triplets most informative across tasks. This strategy modifies which samples influence the representation rather than adjusting loss magnitudes. Experiments on an aerial wildlife dataset compare the proposed task-guided selection against several triplet loss setups for shaping a representation in an effective multi-task manner. The results show improved classification and regression performance and demonstrate that task-aware triplet selection produces a more effective shared representation for downstream tasks.
[CV-111] SPARK-IL: Spectral Retrieval-Augmented RAG for Knowledge-driven Deepfake Detection via Incremental Learning
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像的检测问题,特别是现有检测器在面对未见过的生成模型时泛化能力差的问题。其关键解决方案是提出 SPARK-IL 框架,该框架结合双路径频域分析与增量学习机制:一方面利用冻结的 ViT-L/14 编码器提取语义特征,另一方面通过原始 RGB 像素嵌入进行频域建模;两者均经多带傅里叶分解后由 Kolmogorov-Arnold Networks(KAN)处理,并通过交叉注意力融合,最终借助 Milvus 数据库检索相似频域签名并基于多数投票预测,同时采用弹性权重巩固策略实现知识保留与数据库扩展。
链接: https://arxiv.org/abs/2604.03833
作者: Hessen Bougueffa Eutamene,Abdellah Zakaria Sellam,Abdelmalik Taleb-Ahmed,Abdenour Hadid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting AI-generated images remains a significant challenge because detectors trained on specific generators often fail to generalize to unseen models; however, while pixel-level artifacts vary across models, frequency-domain signatures exhibit greater consistency, providing a promising foundation for cross-generator detection. To address this, we propose SPARK-IL, a retrieval-augmented framework that combines dual-path spectral analysis with incremental learning by utilizing a partially frozen ViT-L/14 encoder for semantic representations alongside a parallel path for raw RGB pixel embeddings. Both paths undergo multi-band Fourier decomposition into four frequency bands, which are individually processed by Kolmogorov-Arnold Networks (KAN) with mixture-of-experts for band-specific transformations before the resulting spectral embeddings are fused via cross-attention with residual connections. During inference, this fused embedding retrieves the k nearest labeled signatures from a Milvus database using cosine similarity to facilitate predictions via majority voting, while an incremental learning strategy expands the database and employs elastic weight consolidation to preserve previously learned transformations. Evaluated on the UniversalFakeDetect benchmark across 19 generative models – including GANs, face-swapping, and diffusion methods – SPARK-IL achieves a 94.6% mean accuracy, with the code to be publicly released at this https URL.
[CV-112] ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos CVPR2026
【速读】:该论文旨在解决活动级伪造(activity-level forgery)在视频中的时序定位问题,即识别出被篡改的人类行为片段。这类伪造通过修改人类动作来扭曲事件语义,具有高度欺骗性,严重威胁媒体真实性与公众信任。解决方案的关键在于提出首个大规模基准数据集 ActivityForensics,包含超过6000个无缝融合于原视频场景中的伪造活动片段,具备高视觉一致性;同时设计了基于扩散模型的特征正则化方法 Temporal Artifact Diffuser (TADiff),通过暴露潜在伪造痕迹实现有效定位。
链接: https://arxiv.org/abs/2604.03819
作者: Peijun Bao,Anwei Luo,Gang Pan,Alex C. Kot,Xudong Jiang
机构: Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学); Jiangxi University of Finance and Economics (江西财经大学); Jiangxi Provincial Key Laboratory of Multimedia Intelligent Processing (江西省多媒体智能处理重点实验室); Shenzhen MSU-BIT University (深圳北理莫斯科大学); VinUniversity (VinUniversity)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: [CVPR 2026] The first benchmark for action-level deepfake localization
Abstract:Temporal forgery localization aims to temporally identify manipulated segments in videos. Most existing benchmarks focus on appearance-level forgeries, such as face swapping and object removal. However, recent advances in video generation have driven the emergence of activity-level forgeries that modify human actions to distort event semantics, resulting in highly deceptive forgeries that critically undermine media authenticity and public trust. To overcome this issue, we introduce ActivityForensics, the first large-scale benchmark for localizing manipulated activity in videos. It contains over 6K forged video segments that are seamlessly blended into the video context, rendering high visual consistency that makes them almost indistinguishable from authentic content to the human eye. We further propose Temporal Artifact Diffuser (TADiff), a simple yet effective baseline that exposes artifact cues through a diffusion-based feature regularizer. Based on ActivityForensics, we introduce comprehensive evaluation protocols covering intra-domain, cross-domain, and open-world settings, and benchmark a wide range of state-of-the-art forgery localizers to facilitate future research. The dataset and code are available at this https URL.
[CV-113] InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset CVPR2026
【速读】:该论文旨在解决在受限且高度畸变环境(如车载舱内监控,ICAM)中精确估计相机相对位姿的难题,这直接影响到自动驾驶系统中安全感知所需的物理尺度距离准确性。解决方案的关键在于提出一种基于Transformer架构的InCaRPose模型,其通过冻结预训练骨干网络(如DINOv3)提取特征,并利用Transformer解码器捕捉参考视图与目标视图间的几何关系,从而在单次推理中实现符合舱内相机安装物理约束的绝对度量尺度平移估计,同时借助合成数据训练使模型具备对真实舱内环境的泛化能力,且不依赖完全相同的相机内参,最终在有限训练数据下仍保持高精度的旋转与平移估计性能,支持实时推理需求。
链接: https://arxiv.org/abs/2604.03814
作者: Felix Stillger,Lukas Hahn,Frederik Hasecke,Tobias Meisen
机构: University of Wuppertal (伍珀塔尔大学); Aptiv (艾普特)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the CVPR 2026 Workshop on Autonomous Driving (WAD)
Abstract:Camera extrinsic calibration is a fundamental task in computer vision. However, precise relative pose estimation in constrained, highly distorted environments, such as in-cabin automotive monitoring (ICAM), remains challenging. We present InCaRPose, a Transformer-based architecture designed for robust relative pose prediction between image pairs, which can be used for camera extrinsic calibration. By leveraging frozen backbone features such as DINOv3 and a Transformer-based decoder, our model effectively captures the geometric relationship between a reference and a target view. Unlike traditional methods, our approach achieves absolute metric-scale translation within the physically plausible adjustment range of in-cabin camera mounts in a single inference step, which is critical for ICAM, where accurate real-world distances are required for safety-relevant perception. We specifically address the challenges of highly distorted fisheye cameras in automotive interiors by training exclusively on synthetic data. Our model is capable of generalization to real-world cabin environments without relying on the exact same camera intrinsics and additionally achieves competitive performance on the public 7-Scenes dataset. Despite having limited training data, InCaRPose maintains high precision in both rotation and translation, even with a ViT-Small backbone. This enables real-time performance for time-critical inference, such as driver monitoring in supervised autonomous driving. We release our real-world In-Cabin-Pose test dataset consisting of highly distorted vehicle-interior images and our code at this https URL.
[CV-114] Bridging Restoration and Diagnosis: A Comprehensive Benchmark for Retinal Fundus Enhancement
【速读】:该论文旨在解决当前生成式AI在眼底图像增强(fundus image enhancement)领域中缺乏科学、临床相关且统一的评估标准的问题。具体而言,现有评价指标如PSNR和SSIM无法反映临床关键特征(如病灶保留与血管形态一致性),且缺少针对成对与非成对增强方法的统一评估协议,同时缺乏可指导未来模型改进的 actionable insights。解决方案的关键在于提出EyeBench-V2基准测试框架,其核心包括:(1) 通过下游临床任务(如血管分割、糖尿病视网膜病变分级、未知噪声泛化能力及病灶分割)实现多维临床对齐评估;(2) 构建专家引导的评估设计,包含新收集的数据集和由医学专家执行的结构化人工评估流程,以量化病灶结构变化、背景色偏移及伪影引入等临床敏感指标;(3) 提供任务导向的严谨分析结果,为临床研究人员提供决策依据,并揭示当前方法局限性以推动下一代临床对齐增强模型的发展。
链接: https://arxiv.org/abs/2604.03806
作者: Xuanzhao Dong,Wenhui Zhu,Xiwen Chen,Hao Wang,Xin Li,Yujian Xiong,Jiajun Cheng,Zhipeng Wang,Shao Tang,Oana Dumitrascu,Yalin Wang
机构: Arizona State University (亚利桑那州立大学); Clemson University (克莱姆森大学); LinkedIn Corporation (领英公司); Mayo Clinic (梅奥诊所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Over the past decade, generative models have demonstrated success in enhancing fundus images. However, the evaluation of these models remains a challenge. A benchmark for fundus image enhancement is needed for three main reasons:(1) Conventional denoising metrics such as PSNR and SSIM fail to capture clinically relevant features, such as lesion preservation and vessel morphology consistency, limiting their applicability in real-world settings; (2) There is a lack of unified evaluation protocols that address both paired and unpaired enhancement methods, particularly those guided by clinical expertise; and (3) An evaluation framework should provide actionable insights to guide future advancements in clinically aligned enhancement models. To address these gaps, we introduce EyeBench-V2, a benchmark designed to bridge the gap between enhancement model performance and clinical utility. Our work offers three key contributions:(1) Multi-dimensional clinical-alignment through downstream evaluations: Beyond standard enhancement metrics, we assess performance across clinically meaningful tasks including vessel segmentation, diabetic retinopathy (DR) grading, generalization to unseen noise patterns, and lesion segmentation. (2) Expert-guided evaluation design: We curate a novel dataset enabling fair comparisons between paired and unpaired enhancement methods, accompanied by a structured manual assessment protocol by medical experts, which evaluates clinically critical aspects such as lesion structure alterations, background color shifts, and the introduction of artificial structures. (3) Actionable insights: Our benchmark provides a rigorous, task-oriented analysis of existing generative models, equipping clinical researchers with the evidence needed to make informed decisions, while also identifying limitations in current methods to inform the design of next-generation enhancement models.
[CV-115] Rényi Attention Entropy for Patch Pruning ICPR2026
【速读】:该论文旨在解决Transformer模型中自注意力机制计算复杂度随token数量呈二次增长的问题,尤其是在视觉任务中,patch数量庞大导致计算成本过高。解决方案的关键在于提出一种基于信息熵的patch重要性评估准则:利用Shannon熵衡量注意力分布的集中程度,低熵patch(注意力聚焦于少数位置)被视为重要信息保留,高熵patch(注意力分散)则被判定为冗余并移除;进一步扩展至Rényi熵以强化对尖锐注意力峰值的敏感性,从而支持根据任务需求和计算约束动态调整剪枝策略,实现精度与计算效率之间的更优权衡。
链接: https://arxiv.org/abs/2604.03803
作者: Hiroaki Aizawa,Yuki Igaue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICPR2026
Abstract:Transformers are strong baselines in both vision and language because self-attention captures long-range dependencies across tokens. However, the cost of self-attention grows quadratically with the number of tokens. Patch pruning mitigates this cost by estimating per-patch importance and removing redundant patches. To identify informative patches for pruning, we introduce a criterion based on the Shannon entropy of the attention distribution. Low-entropy patches, which receive selective and concentrated attention, are kept as important, while high-entropy patches with attention spread across many locations are treated as redundant. We also extend the criterion from Shannon to Rényi entropy, which emphasizes sharp attention peaks and supports pruning strategies that adapt to task needs and computational limits. In experiments on fine-grained image recognition, where patch selection is critical, our method reduced computation while preserving accuracy. Moreover, adjusting the pruning policy through the Rényi entropy measure yields further gains and improves the trade-off between accuracy and computation.
[CV-116] HistoFusionNet: Histogram-Guided Fusion and Frequency-Adaptive Refinement for Nighttime Image Dehazing
【速读】:该论文旨在解决夜间图像去雾(nighttime image dehazing)这一低层次视觉难题,其挑战在于夜间场景中同时存在雾霾、光晕(glow)、非均匀光照、色彩失真及传感器噪声等复杂退化因素,这些因素常使白天去雾方法中常用的假设失效。解决方案的关键在于提出HistoFusionNet——一种基于Transformer增强的架构,通过结合直方图引导的表征学习(histogram-guided representation learning)与频域自适应特征精炼(frequency-adaptive feature refinement)实现高效恢复。具体而言,模型引入直方图Transformer模块,依据动态范围特性对特征分组以建模长程依赖,从而更有效地聚合受复杂夜间光照影响的相似退化区域;同时设计频率感知精炼分支,自适应融合低频与高频互补信息,提升结构恢复能力、抑制伪影并增强局部细节。该设计形成统一框架,显著提升了对真实夜间雾霾场景中异质退化的处理能力,在NTIRE 2026夜间图像去雾挑战赛中取得第一名成绩,验证了方法的有效性与鲁棒性。
链接: https://arxiv.org/abs/2604.03800
作者: Mohammad Heydari,Wei Dong,Shahram Shirani,Jun Chen,Han Zhou
机构: McMaster University (麦克马斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Nighttime image dehazing remains a challenging low-level vision problem due to the joint presence of haze, glow, non-uniform illumination, color distortion, and sensor noise, which often invalidate assumptions commonly used in daytime dehazing. To address these challenges, we propose HistoFusionNet, a transformer-enhanced architecture tailored for nighttime image dehazing by combining histogram-guided representation learning with frequency-adaptive feature refinement. Built upon a multi-scale encoder-decoder backbone, our method introduces histogram transformer blocks that model long-range dependencies by grouping features according to their dynamic-range characteristics, enabling more effective aggregation of similarly degraded regions under complex nighttime lighting. To further improve restoration fidelity, we incorporate a frequency-aware refinement branch that adaptively exploits complementary low- and high-frequency cues, helping recover scene structures, suppress artifacts, and enhance local details. This design yields a unified framework that is particularly well suited to the heterogeneous degradations encountered in real nighttime hazy scenes. Extensive experiments and highly competitive performance of our method on the NTIRE 2026 Nighttime Image Dehazing Challenge benchmark demonstrate the effectiveness of the proposed method. Our team ranked 1st among 22 participating teams, highlighting the robustness and competitive performance of HistoFusionNet. The code is available at: this https URL
[CV-117] Next-Scale Autoregressive Models for Text-to-Motion Generation CVPR2026
【速读】:该论文旨在解决文本条件下的动作生成任务中,标准自回归(Autoregressive, AR)模型在时间结构对齐上的不足问题。现有方法基于逐词预测的AR框架难以捕捉长程动作时序结构,导致生成动作缺乏全局语义一致性。解决方案的关键在于提出MoScale框架,其采用分层自回归机制,从粗粒度到细粒度逐步生成动作序列,通过在最粗尺度提供全局语义信息并逐级细化,构建更符合人类动作时序结构的因果层次。此外,引入跨尺度层级精修与同尺度内双向重预测机制,显著提升有限数据下的鲁棒性与生成质量,从而实现高训练效率和零样本迁移能力。
链接: https://arxiv.org/abs/2604.03799
作者: Zhiwei Zheng,Shibo Jin,Lingjie Liu,Mingmin Zhao
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.
[CV-118] Confidence-Driven Facade Refinement of 3D Building Models Using MLS Point Clouds
【速读】:该论文旨在解决数字孪生(Digital Twin)中因传统CityGML建筑模型精度不足而导致的高精度地理空间数据难以满足需求的问题,特别是基于机载激光扫描(ALS)获取的建筑模型在立面(facade)几何精度上的显著缺陷。其解决方案的关键在于提出一种自动化的精细化框架,利用粗略模型作为几何先验(geometric prior),通过表面匹配识别过时面片,并采用带硬约束的二进制整数优化方法从候选数据中选择最优面片进行更新,从而在复杂城市环境中实现立面几何的精准修复,同时保证拓扑有效性与封闭性(watertight and manifold geometry)。
链接: https://arxiv.org/abs/2604.03797
作者: Xiaoyu Huang
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digital twins require continuous maintenance to meet the increasing demand for high-precision geospatial data. However, traditional coarse CityGML building models, typically derived from Airborne Laser Scanning (ALS), often exhibit significant geometric deficiencies, particularly regarding facade accuracy due to the nadir perspective of airborne sensors. Integrating these coarse models with high-precision Mobile Laser Scanning (MLS) data is essential to recover detailed facade geometry. Unlike reconstruction-from-scratch approaches that discard existing semantic information and rely heavily on complete data coverage, this work presents an automated refinement framework that utilizes the coarse model as a geometric prior. This method enables targeted updates to facade geometry even in complex urban environments. It integrates surface matching to identify outdated surfaces and employs a binary integer optimization to select optimal faces from candidate data. Crucially, hard constraints are enforced within the optimization to ensure the topological validity of the refined output. Experimental results demonstrate that the proposed approach effectively corrects facade misalignments, reducing the Cloud-to-Mesh RMSE by approximately 36% and achieving centimeter-level alignment. Furthermore, the framework guarantees strictly watertight and manifold geometry, providing a robust solution for upgrading ALS-derived city models.
[CV-119] When Does Multimodal AI Help? Diagnostic Complementarity of Vision-Language Models and CNNs for Spectrum Management in Satellite-Terrestrial Networks
【速读】:该论文旨在解决无线网络管理中视觉-语言模型(Vision-Language Models, VLMs)与轻量级卷积神经网络(Convolutional Neural Networks, CNNs)在频谱热力图理解任务上的性能边界不清晰的问题,尤其聚焦于非地面网络与地面网络(NTN-TN)协同系统中的频谱感知场景。解决方案的关键在于提出首个诊断性对比框架SpectrumQA,包含108K视觉问答对,覆盖四个粒度层级(场景分类L1、区域推理L2、空间定位L3、语义推理L4),并通过实验证明:CNN在空间定位(IoU=0.552)和严重程度分类(准确率72.9%)上表现优异,而VLM则具备CNN无法实现的语义推理能力(F1=0.576,仅需三示例),且其性能受链式思维(Chain-of-thought, CoT)提示显著提升但不影响空间任务,揭示二者互补性源于架构本质差异;进一步设计确定性任务路由机制,将监督类任务交由CNN处理、推理类任务交由VLM执行,最终实现复合得分0.616,较纯CNN提升39.1%,并验证VLM表征具有更强跨场景鲁棒性。
链接: https://arxiv.org/abs/2604.03774
作者: Yuanhang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures
Abstract:The adoption of vision-language models (VLMs) for wireless network management is accelerating, yet no systematic understanding exists of where these large foundation models outperform lightweight convolutional neural networks (CNNs) for spectrum-related tasks. This paper presents the first diagnostic comparison of VLMs and CNNs for spectrum heatmap understanding in non-terrestrial network and terrestrial network (NTN-TN) cooperative systems. We introduce SpectrumQA, a benchmark comprising 108K visual question-answer pairs across four granularity levels: scene classification (L1), regional reasoning (L2), spatial localization (L3), and semantic reasoning (L4). Our experiments on three NTN-TN scenarios with a frozen Qwen2-VL-7B and a trained ResNet-18 reveal a clear taskdependent complementarity: CNN achieves 72.9% accuracy at severity classification (L1) and 0.552 IoU at spatial localization (L3), while VLM uniquely enables semantic reasoning (L4) with F1=0.576 using only three in-context examples-a capability fundamentally absent in CNN architectures. Chain-of-thought (CoT) prompting further improves VLM reasoning by 12.6% (F1: 0.209-0.233) while having zero effect on spatial tasks, confirming that the complementarity is rooted in architectural differences rather than prompting limitations. A deterministic task-type router that delegates supervised tasks to CNN and reasoning tasks to VLM achieves a composite score of 0.616, a 39.1% improvement over CNN alone. We further show that VLM representations exhibit stronger cross-scenario robustness, with smaller performance degradation in 5 out of 6 transfer directions. These findings provide actionable guidelines: deploy CNNs for spatial localization and VLMs for semantic spectrum reasoning, rather than treating them as substitutes.
[CV-120] M2StyleGS: Multi-Modality 3D Style Transfer with Gaussian Splatting
【速读】:该论文旨在解决传统3D风格迁移方法依赖固定参考图像、缺乏灵活性的问题,尤其在虚拟现实(VR)或增强现实(AR)等应用场景中,用户更倾向于使用文本描述或多样化的图像作为风格参考。其解决方案的关键在于提出一种实时风格化技术M2StyleGS,该方法以3D高斯泼溅(3D Gaussian Splatting, 3DGS)表示三维场景,并利用CLIP模型提取的多模态知识作为风格参考;通过引入细分流(subdivisive flow)实现精确特征对齐以避免异常变换,并强化CLIP文本-视觉联合特征到VGG风格特征的投影;同时设计观察损失(observation loss)和抑制损失(suppression loss),分别优化生成视角与参考风格的一致性并抑制颜色信息偏移,从而实现基于文本或图像输入的高质量、一致性强的风格化新视图生成。
链接: https://arxiv.org/abs/2604.03773
作者: Xingyu Miao,Xueqi Qiu,Haoran Duan,Yawen Huang,Xian Wu,Jingjing Deng,Yang Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conventional 3D style transfer methods rely on a fixed reference image to apply artistic patterns to 3D scenes. However, in practical applications such as virtual or augmented reality, users often prefer more flexible inputs, including textual descriptions and diverse imagery. In this work, we introduce a novel real-time styling technique M2StyleGS to generate a sequence of precisely color-mapped views. It utilizes 3D Gaussian Splatting (3DGS) as a 3D presentation and multi-modality knowledge refined by CLIP as a reference style. M2StyleGS resolves the abnormal transformation issue by employing a precise feature alignment, namely subdivisive flow, it strengthens the projection of the mapped CLIP text-visual combination feature to the VGG style feature. In addition, we introduce observation loss, which assists in the stylized scene better matching the reference style during the generation, and suppression loss, which suppresses the offset of reference color information throughout the decoding process. By integrating these approaches, M2StyleGS can employ text or images as references to generate a set of style-enhanced novel views. Our experiments show that M2StyleGS achieves better visual quality and surpasses the previous work by up to 32.92% in terms of consistency.
[CV-121] ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLM s
【速读】:该论文旨在解决现有图像描述(image captioning)基准测试中存在的三大问题:caption长度多样性不足、缺乏最新多模态大语言模型(Multimodal Large Language Models, MLLMs)的评估以及人工标注数据不足导致的偏差。为应对这些局限,作者提出了一个新的大规模图像描述基准 ICBench,涵盖12类内容、包含由10个先进MLLM生成的2000张图像上的4万条长短不一的caption,并通过细致的人类主观评测获得细粒度的平均意见分数(Mean Opinion Scores, MOSs)。其解决方案的关键在于构建一个兼具多样性和高质量标注的数据集,并创新性地提出基于图像-文本-图像框架的自动化评估指标 ITIScore,该指标通过重建一致性衡量caption质量,实验表明其与人类判断高度一致,并具备在其他公开数据集上的零样本泛化能力。
链接: https://arxiv.org/abs/2604.03765
作者: Zitong Xu,Huiyu Duan,Shengyao Qin,Guangyu Yao,Guangji Ma,Xiongkuo Min,Ke Gu,Guangtao Zhai,Patrick Le Callet
机构: Shanghai Jiao Tong University (上海交通大学); University of Electronic and Science Technology of China (电子科技大学); Beijing University of Technology (北京工业大学); Institut Universitaire de France (法国大学学院); University of Nantes (南特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbfITIScore, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.
[CV-122] Real-time Neural Six-way Lightmaps
【速读】:该论文旨在解决实时渲染参与介质(Participating Media)如烟雾时难以兼顾视觉真实感与计算效率的问题。现有六向光贴图(six-way lightmaps)方法虽能通过预计算光照近似实现高效渲染,但受限于预模拟动画序列且无法响应相机运动。其解决方案的关键在于提出一种神经网络驱动的六向光贴图方法:首先利用射线步进(ray marching)以大采样距离生成相机视角下的引导图(guiding map),用于近似烟雾散射和轮廓;随后训练神经网络从该引导图预测对应的六向光贴图,从而支持动态烟雾与障碍物交互、相机移动及光源变化等实时交互场景,并可无缝集成至现有游戏引擎管线。
链接: https://arxiv.org/abs/2604.03748
作者: Wei Li,Hanxiao Sun,Tao Huang,Haoxiang Wang,Tongtong Wang,Zherong Pan,Kui Wu
机构: Shanghai Jiao Tong University (上海交通大学); LIGHTSPEED; Tsinghua University (清华大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 Pages, 16 Figures
Abstract:Participating media are a pervasive and intriguing visual effect in virtual environments. Unfortunately, rendering such phenomena in real-time is notoriously difficult due to the computational expense of estimating the volume rendering equation. While the six-way lightmaps technique has been widely used in video games to render smoke with a camera-oriented billboard and approximate lighting effects using six precomputed lightmaps, achieving a balance between realism and efficiency, it is limited to pre-simulated animation sequences and is ignorant of camera movement. In this work, we propose a neural six-way lightmaps method to strike a long-sought balance between dynamics and visual realism. Our approach first generates a guiding map from the camera view using ray marching with a large sampling distance to approximate smoke scattering and silhouette. Then, given a guiding map, we train a neural network to predict the corresponding six-way lightmaps. The resulting lightmaps can be seamlessly used in existing game engine pipelines. This approach supports visually appealing rendering effects while enabling real-time user interactivity, including smoke-obstacle interaction, camera movement, and light change. By conducting a series of comprehensive benchmarks, we demonstrate that our method is well-suited for real-time applications, such as games and VR/AR.
[CV-123] Shower-Aware Dual-Stream Voxel Networks for Structural Defect Detection in Cosmic-Ray Muon Tomography
【速读】:该论文旨在解决基于宇宙射线μ子层析成像(cosmic-ray muon tomography)的钢筋混凝土结构缺陷(如蜂窝状空洞、剪切裂缝、腐蚀孔洞和分层)在体素级分割中的精度不足问题。传统重建方法(如POCA、MLSD)仅依赖μ子散射角度信息,难以有效区分复杂缺陷类型。其解决方案的关键在于提出SA-DSVN架构,通过独立编码器分别处理散射运动学(9通道)与次级电磁簇射多重性(40通道),并利用交叉注意力机制融合特征;实验表明,次级簇射多重性信息本身即具备主导判别能力,使缺陷平均Dice系数从仅用散射信息的0.535提升至0.685,最终在验证集上实现96.3%的体素准确率和100%的体积级检测灵敏度。
链接: https://arxiv.org/abs/2604.03741
作者: Parthiv Dasgupta,Sambhav Agarwal,Palash Dutta,Raja Karmakar,Sudeshna Goswami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注: 8 pages, 10 figures, 4 tables. Includes supplementary data via Zenodo DOI: https://doi.org/10.5281/zenodo.19355077 . This work introduces SA-DSVN for 3D voxel segmentation in muon tomography, utilizing secondary electromagnetic shower multiplicities. (pp. 1, 3)
Abstract:We present SA-DSVN, a 3D convolutional architecture for voxel-level segmentation of structural defects in reinforced concrete using cosmic-ray muon tomography. Unlike conventional reconstruction methods (POCA, MLSD) that rely solely on muon scattering angles, our approach jointly processes scattering kinematics (9 channels) and secondary electromagnetic shower multiplicities (40 channels) through independent encoder streams fused via cross-attention. Training data were generated using Vega, a cloud-native Geant4 simulation framework, producing 4.5 million muon events across 900 volumes containing four defect types - honeycombing, shear fracture, corrosion voids, and delamination - embedded within a dense 7x7 rebar cage. A five-variant ablation study demonstrates that the shower multiplicity stream alone accounts for the majority of discriminative power, raising defect-mean Dice from 0.535 (scattering only) to 0.685 (shower only). On 60 independently simulated validation volumes, the model achieves 96.3% voxel accuracy, per-defect Dice scores of 0.59-0.81, and 100% volume-level detection sensitivity at 10 ms inference per volume. These results establish secondary shower multiplicity as a previously unexploited but highly effective feature for learned muon tomographic reconstruction.
[CV-124] Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation CVPR2026
【速读】:该论文旨在解决多参考图像条件下生成视频时出现的“参考混淆”(reference confusion)问题,即当多个参考图像视觉特征高度相似时,模型难以准确检索对应上下文,从而导致生成视频中角色控制不准确。解决方案的关键在于提出PoCo(Position Embedding as a Context Controller),通过引入位置编码作为额外的上下文控制机制,在语义检索之外提供基于位置信息的细粒度token级匹配能力,从而在保持隐式语义一致性的同时实现精准的角色控制与跨镜头一致性增强。
链接: https://arxiv.org/abs/2604.03738
作者: Binyuan Huang,Yuning Lu,Weinan Jia,Hualiang Wang,Mu Liu,Daiqing Yang
机构: Wuhan University (武汉大学); University of Science and Technology of China (中国科学技术大学); Hong Kong University of Science and Technology (香港科技大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Recent proprietary models such as Sora2 demonstrate promising progress in generating multi-shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model’s ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token-level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi-reference and multi-shot video generation model capable of reliably controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross-shot consistency and reference fidelity compared with various baselines.
[CV-125] SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation CVPR2026
【速读】:该论文旨在解决视频生成中相机运动(camera motion)与物体动态(object dynamics)协同控制的难题,现有方法通常仅能处理单一运动类型,或依赖模糊的2D线索,导致相机引起的视差(parallax)与真实物体运动混淆,难以实现结构一致且语义明确的视频生成。解决方案的关键在于提出SymphoMotion框架,其核心创新为两个机制:一是相机轨迹控制机制(Camera Trajectory Control),通过显式相机路径与几何感知线索结合,确保视角变换的稳定性与结构一致性;二是物体动态控制机制(Object Dynamics Control),融合2D视觉引导与3D轨迹嵌入,实现具有深度感知和空间一致性的物体操控。此外,作者构建了RealCOD-25K数据集,填补了统一运动控制领域缺乏大规模真实世界标注数据的空白,从而推动该方向的研究发展。
链接: https://arxiv.org/abs/2604.03723
作者: Guiyu Zhang,Yabo Chen,Xunzhi Xiang,Junchao Huang,Zhongyu Wang,Li Jiang
机构: The Chinese University of Hong Kong, Shenzhen; Shanghai Jiaotong University; Nanjing University; Beihang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Controlling both camera motion and object dynamics is essential for coherent and expressive video generation, yet current methods typically handle only one motion type or rely on ambiguous 2D cues that entangle camera-induced parallax with true object movement. We present SymphoMotion, a unified motion-control framework that jointly governs camera trajectories and object dynamics within a single model. SymphoMotion features a Camera Trajectory Control mechanism that integrates explicit camera paths with geometry-aware cues to ensure stable, structurally consistent viewpoint transitions, and an Object Dynamics Control mechanism that combines 2D visual guidance with 3D trajectory embeddings to enable depth-aware, spatially coherent object manipulation. To support large-scale training and evaluation, we further construct RealCOD-25K, a comprehensive real-world dataset containing paired camera poses and object-level 3D trajectories across diverse indoor and outdoor scenes, addressing a key data gap in unified motion control. Extensive experiments and user studies show that SymphoMotion significantly outperforms existing methods in visual fidelity, camera controllability, and object-motion accuracy, establishing a new benchmark for unified motion control in video this http URL and data are publicly available at this https URL.
[CV-126] CGHair: Compact Gaussian Hair Reconstruction with Card Clustering CVPR2026
【速读】:该论文旨在解决基于多视角图像的高保真头发重建中计算资源消耗大、存储成本高的问题。当前基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的方法虽然能实现逼真的渲染效果,但通常需要数百万个基元(primitive),导致显著的存储与渲染开销。其解决方案的关键在于:利用发型内部结构和视觉特征的一致性,将发丝聚类为代表性发卡(hair card),并进一步构建共享纹理码本(texture codebook),从而在保持视觉质量的同时大幅降低存储需求;同时引入生成先验加速方法,从图像集合中快速重建初始发丝几何结构,使整体重建时间减少4倍,且内存占用降低超过200倍。
链接: https://arxiv.org/abs/2604.03716
作者: Haimin Luo,Srinjay Sarkar,Albert Mosella-Montoro,Francisco Vicente Carrasco,Fernando De la Torre
机构: Carnegie Mellon University (卡内基梅隆大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to CVPR 2026. This arXiv version is not the final published version
Abstract:We present a compact pipeline for high-fidelity hair reconstruction from multi-view images. While recent 3D Gaussian Splatting (3DGS) methods achieve realistic results, they often require millions of primitives, leading to high storage and rendering costs. Observing that hair exhibits structural and visual similarities across a hairstyle, we cluster strands into representative hair cards and group these into shared texture codebooks. Our approach integrates this structure with 3DGS rendering, significantly reducing reconstruction time and storage while maintaining comparable visual quality. In addition, we propose a generative prior accelerated method to reconstruct the initial strand geometry from a set of images. Our experiments demonstrate a 4-fold reduction in strand reconstruction time and achieve comparable rendering performance with over 200x lower memory footprint.
[CV-127] Learning Superpixel Ensemble and Hierarchy Graphs for Melanoma Detection
【速读】:该论文旨在解决皮肤黑色素瘤(melanoma)在皮肤镜图像中自动检测的准确性问题,尤其是在复杂背景和形态变化下的特征表示与分类性能优化。其解决方案的关键在于提出一种基于图结构学习(graph structure learning)的新型方法,利用两种图论表示——超像素集合图(superpixel ensemble graph, SEG)和超像素层次图(superpixel hierarchy graph, SHG),通过多层级超像素划分生成不同节点数(20–100)的子图,并结合纹理、几何与颜色特征作为节点信号;同时探索手工设计高斯权重与基于优化的学习权重两种边权赋值方式,并引入边阈值剪枝策略(25%、50%、75%)以提升图结构的鲁棒性。实验表明,采用学习得到的超像素集合图并结合纹理节点信号时,可实现99.00%准确率和99.59% AUC,显著优于传统方法,验证了图结构学习在医学图像分析中的有效性。
链接: https://arxiv.org/abs/2604.03710
作者: Asmaa M. Elwer,Muhammad A. Rushdi,Mahmoud H. Annaby
机构: Cairo University (开罗大学); New Giza University (新吉萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Graph signal processing (GSP) is becoming a major tool in biomedical signal and image analysis. In most GSP techniques, graph structures and edge weights have been typically set via statistical and computational methods. More recently, graph structure learning methods offered more reliable and flexible data representations. In this work, we introduce a graph learning approach for melanoma detection in dermoscopic images based on two graph-theoretic representations: superpixel ensemble graphs (SEG) and superpixel hierarchy graphs (SHG). For these two types of graphs, superpixel maps of a skin lesion image are respectively generated at multiple levels without and with parentchild constraints among superpixels at adjacent levels, where each level corresponds to a subgraph with a different number of nodes (20, 40, 60, 80, or 100 nodes). Two edge weight assignment techniques are explored: handcrafted Gaussian weights and learned weights based on optimization methods. The graph nodal signals are assigned based on texture, geometric, and color superpixel features. In addition, the effect of graph edge thresholding is investigated by applying different thresholds (25%, 50%, and 75%) to prune the weakest edges and analyze the impact of pruning on the melanoma detection performance. Experimental evaluation of the proposed method is performed with different classifiers trained and tested on the publicly available ISIC2017 dataset. Data augmentation is applied to alleviate class imbalance by adding more melanoma images from the ISIC archive. The results show that learned superpixel ensemble graphs with textural nodal signals give the highest performance reaching an accuracy of 99.00% and an AUC of 99.59%.
[CV-128] XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening CVPR2026
【速读】:该论文旨在解决X射线违禁品检测中因依赖边界框标注而导致模型泛化能力弱、性能受限的问题,核心挑战在于缺乏像素级监督信号及真实世界数据的不足。解决方案的关键在于构建了目前最大规模的X射线违禁品分割数据集XSeg(包含98,644张图像和295,932个实例掩码),并提出自适应点SAM(Adaptive Point SAM, APSAM)模型以实现高效精准的掩码标注。APSAM通过引入能量感知编码器(Energy-Aware Encoder)增强掩码解码器的初始化,显著提升对叠放物品的敏感性,并设计自适应点生成器,仅需单个粗略点提示即可获得精确掩码标签,从而大幅降低人工标注成本并提高标注效率。
链接: https://arxiv.org/abs/2604.03706
作者: Hongxia Gao,Litao Li,Yixin Chen,Jiali Wen,Kaijie Zhang,Qianyun Liu
机构: Xi’an Jiaotong University (西安交通大学); South China University of Technology (华南理工大学); Shenzhen Loop Area Institute (深圳 loop 区域研究所); Pazhou Laboratory (琶洲实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures, Accepted to CVPR 2026
Abstract:X-ray contraband detection is critical for public safety. However, current methods primarily rely on bounding box annotations, which limit model generalization and performance due to the lack of pixel-level supervision and real-world data. To address these limitations, we introduce XSeg. To the best of our knowledge, XSeg is the largest X-ray contraband segmentation dataset to date, including 98,644 images and 295,932 instance masks, and contains the latest 30 common contraband categories. The images are sourced from public datasets and our synthesized data, filtered through a custom data cleaning pipeline to remove low-quality samples. To enable accurate and efficient annotation and reduce manual labeling effort, we propose Adaptive Point SAM (APSAM), a specialized mask annotation model built upon the Segment Anything Model (SAM). We address SAM’s poor cross-domain generalization and limited capability in detecting stacked objects by introducing an Energy-Aware Encoder that enhances the initialization of the mask decoder, significantly improving sensitivity to overlapping items. Additionally, we design an Adaptive Point Generator that allows users to obtain precise mask labels with only a single coarse point prompt. Extensive experiments on XSeg demonstrate the superior performance of APSAM.
[CV-129] VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在视频理解任务中缺乏真正数值推理能力的问题,即现有基准测试往往局限于单一场景或仅将计数视为浅层回归任务,未能评估模型在复杂真实世界多媒体内容中进行多步骤数值逻辑推断的能力。其解决方案的关键在于构建了一个名为VidNum-1.4K的综合性视频问答(VideoQA)基准,包含1,379对严格人工标注的视频-问题对,并采用三级层次结构设计,从直接视觉感知逐步过渡到基于时间证据的组合式数值推理(如算术运算、比较和逻辑推理),从而系统性地检验模型对现实动态的理解深度。这一设计使VidNum-1.4K成为衡量下一代视频智能模型是否具备稳定“内部世界模型”的关键诊断工具。
链接: https://arxiv.org/abs/2604.03701
作者: Shaoyang Cui,Lingbei Meng
机构: Tsinghua University (清华大学); Shenzhen Loop Area Institute (深圳 loop 区域研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures, under review at ACMMM 2026 Dataset Track
Abstract:Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly “understand” real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a three-level hierarchy that evolves from direct visual perception to video-based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state-of-the-art VLMs reveal a striking reasoning gap: while the Gemini-3.1-pro barely reaches a 60% accuracy threshold, representative open-source families struggle heavily in the 25%–45% range. These findings demonstrate that current VLMs still lack a stable “internal world model”, positioning VidNum-1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.
[CV-130] SGTA: Scene-Graph Based Multi-Modal Traffic Agent for Video Understanding
【速读】:该论文旨在解决交通视频理解中复杂场景下多模态信息融合与可解释决策的问题,尤其针对视频问答(VideoQA)任务中的语义推理难题。其解决方案的关键在于提出一种基于场景图的多模态交通代理框架(Scene-Graph Based Multi-Modal Traffic Agent, SGTA),通过从路侧视频中构建结构化的交通场景图(scene graph),结合符号化图查询与视觉输入进行工具驱动的推理,并采用ReAct机制实现大语言模型(Large Language Model, LLM)与外部工具调用的交错推理过程,从而在保证高精度的同时提供透明、可追踪的决策路径。
链接: https://arxiv.org/abs/2604.03697
作者: Xingcheng Zhou,Mingyu Liu,Walter Zimmer,Jiajie Zhang,Alois Knoll
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Scene-Graph Based Multi-Modal Traffic Agent (SGTA), a modular framework for traffic video understanding that combines structured scene graphs with multi-modal reasoning. It constructs a traffic scene graph from roadside videos using detection, tracking, and lane extraction, followed by tool-based reasoning over both symbolic graph queries and visual inputs. SGTA adopts ReAct to process interleaved reasoning traces from large language models with tool invocations, enabling interpretable decision-making for complex video questions. Experiments on selected TUMTraffic VideoQA dataset sample demonstrate that SGTA achieves competitive accuracy across multiple question types while providing transparent reasoning steps. These results highlight the potential of integrating structured scene representations with multi-modal agents for traffic video understanding.
[CV-131] FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
【速读】:该论文旨在解决当前3D场景理解方法在功能性关系建模上的局限性,即现有方法通常孤立地考虑对象对之间的功能关系,无法捕捉人类用于消除歧义的全局场景依赖性。解决方案的关键在于提出FunFact框架,该框架从带姿态的RGB-D图像中构建概率性的开放词汇功能3D场景图(functional 3D scene graph),首先建立以对象和部件为中心的3D地图,并利用基础模型(foundation models)生成语义合理的功能关系候选;随后将这些候选转化为因子图变量,并通过大语言模型(LLM)推导的常识先验与几何先验进行约束,从而实现对所有功能边及其边缘分布的联合概率推理,显著提升了置信度校准精度。
链接: https://arxiv.org/abs/2604.03696
作者: Zhengyu Fu,René Zurbrügg,Kaixian Qu,Marc Pollefeys,Marco Hutter,Hermann Blum,Zuria Bauer
机构: ETH Zürich (苏黎世联邦理工学院); Microsoft (微软); University of Bonn (波恩大学); Lamarr Institute (拉马尔研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent work in 3D scene understanding is moving beyond purely spatial analysis toward functional scene understanding. However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependence that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their marginals, yielding substantially better calibrated confidence scores. To benchmark this setting, we introduce FunThor, a synthetic dataset based on AI2-THOR with part-level geometry and rule-based functional annotations. Experiments on SceneFun3D, FunGraph3D, and FunThor show that FunFact improves node and relation discovery recall and significantly reduces calibration error for ambiguous relations, highlighting the benefits of holistic probabilistic modeling for functional scene understanding. See our project page at this https URL
[CV-132] ResGuard: Enhancing Robustness Against Known Original Attacks in Deep Watermarking
【速读】:该论文旨在解决深度学习图像水印框架中对已知原始攻击(Known Original Attack, KOA)的脆弱性问题,即当攻击者拥有多个原始图像与对应水印图像对时,可利用这些信息设计针对性的移除策略,从而有效消除水印且保持视觉质量。现有“编码器-噪声层-解码器”(END)架构因生成的嵌入残差缺乏图像依赖性,导致其在不同图像间具有可迁移性,成为KOA攻击的突破口。解决方案的关键在于提出ResGuard模块,其核心是引入残差特异性增强损失(residual specificity enhancement loss),强制残差与宿主图像紧密耦合以提升图像依赖性,并辅以一个模拟KOA噪声层在训练中注入残差风格扰动,使解码器在嵌入不一致性下仍能稳定工作。该方案显著提升了KOA场景下的水印提取准确率,从59.87%提升至99.81%。
链接: https://arxiv.org/abs/2604.03693
作者: Hanyi Wang,Han Fang,Yupeng Qiu,Shilin Wang,Ee-Chien Chang
机构: Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning-based image watermarking commonly adopts an “Encoder-Noise Layer-Decoder” (END) architecture to improve robustness against random channel distortions, yet it often overlooks intentional manipulations introduced by adversaries with additional knowledge. In this paper, we revisit this paradigm and expose a critical yet underexplored vulnerability: the Known Original Attack (KOA), where an adversary has access to multiple original-watermarked image pairs, enabling various targeted suppression strategies. We show that even a simple residual-based removal approach, namely estimating an embedding residual from known pairs and subtracting it from unseen watermarked images, can almost completely remove the watermark while preserving visual quality. This vulnerability stems from the insufficient image dependency of residuals produced by END frameworks, which makes them transferable across images. To address this, we propose ResGuard, a plug-and-play module that enhances KOA robustness by enforcing image-dependent embedding. Its core lies in a residual specificity enhancement loss, which encourages residuals to be tightly coupled with their host images and thus improves image dependency. Furthermore, an auxiliary KOA noise layer injects residual-style perturbations during training, allowing the decoder to remain reliable under stronger embedding inconsistencies. Integrated into existing frameworks, ResGuard boosts KOA robustness, improving average watermark extraction accuracy from 59.87% to 99.81%.
[CV-133] SciLT: Long-Tailed Classification in Scientific Image Domains
【速读】:该论文旨在解决基础模型(foundation models)在科学图像长尾识别任务中性能受限的问题,尤其关注在视觉特征和监督信号与自然图像存在显著差异的科学数据域下,传统微调策略效果有限的现象。其解决方案的关键在于提出SciLT框架,通过自适应特征融合机制整合中间层(penultimate-layer)与最终层(final-layer)特征,并引入双监督学习策略,从而有效提升尾部类别的识别性能,实现头尾类别间的平衡表现。
链接: https://arxiv.org/abs/2604.03687
作者: Jiahao Chen,Bing Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-tailed recognition has benefited from foundation models and fine-tuning paradigms, yet existing studies and benchmarks are mainly confined to natural image domains, where pre-training and fine-tuning data share similar distributions. In contrast, scientific images exhibit distinct visual characteristics and supervision signals, raising questions about the effectiveness of fine-tuning foundation models in such settings. In this work, we investigate scientific long-tailed recognition under a purely visual and parameter-efficient fine-tuning (PEFT) paradigm. Experiments on three scientific benchmarks show that fine-tuning foundation models yields limited gains, and reveal that penultimate-layer features play an important role, particularly for tail classes. Motivated by these findings, we propose SciLT, a framework that exploits multi-level representations through adaptive feature fusion and dual-supervision learning. By jointly leveraging penultimate- and final-layer features, SciLT achieves balanced performance across head and tail classes. Extensive experiments demonstrate that SciLT consistently outperforms existing methods, establishing a strong and practical baseline for scientific long-tailed recognition and providing valuable guidance for adapting foundation models to scientific data with substantial domain shifts.
[CV-134] DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras 4D Radar and Dual-LiDAR CVPR2026
【速读】:该论文旨在解决自动驾驶感知中因传感器数据稀缺(尤其是事件相机和4D雷达等新型传感器)导致的模型训练与评估困难问题,以及多模态传感器融合策略缺乏统一基准和系统性研究的问题。解决方案的关键在于构建了一个多模态驾驶数据集DSERT-RoLL,其包含立体事件相机、RGB相机、热成像相机、4D雷达和双LiDAR,并在多种天气与光照条件下采集,提供精确的2D/3D边界框标注及轨迹ID,从而支持跨传感器组合的公平比较;同时提出了一种统一特征空间融合框架,将各传感器特有信息映射至共享表示空间,显著提升了复杂环境下的3D目标检测鲁棒性。
链接: https://arxiv.org/abs/2604.03685
作者: Hoonhee Cho,Jae-Young Kang,Yuhwan Jeong,Yunseo Yang,Wonyoung Lee,Youngho Kim,Kuk-Jin Yoon
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:In this paper, we present DSERT-RoLL, a driving dataset that incorporates stereo event, RGB, and thermal cameras together with 4D radar and dual LiDAR, collected across diverse weather and illumination conditions. The dataset provides precise 2D and 3D bounding boxes with track IDs and ego vehicle odometry, enabling fair comparisons within and across sensor combinations. It is designed to alleviate data scarcity for novel sensors such as event cameras and 4D radar and to support systematic studies of their behavior. We establish unified 3D and 2D benchmarks that enable direct comparison of characteristics and strengths across sensor families and within each family. We report baselines for representative single modality and multimodal methods and provide protocols that encourage research on different fusion strategies and sensor combinations. In addition, we propose a fusion framework that integrates sensor specific cues into a unified feature space and improves 3D detection robustness under varied weather and lighting.
[CV-135] DiffSparse: Accelerating Diffusion Transformers with Learned Token Sparsity ICLR2026
【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像生成任务中因多步推理机制导致的高计算成本问题,尤其针对基于Transformer架构的扩散模型在少步采样时加速效果不佳的问题。现有方法依赖层或token缓存技术以降低计算开销,但受限于低效的特征缓存策略、人工设计的稀疏分配方式,以及仍需保留部分步骤完整前向计算等缺陷,难以实现高效加速。其解决方案的关键在于提出一种可微分的逐层稀疏优化框架,通过引入一个可学习网络结合动态规划求解器,端到端地优化各层稀疏度分配,并采用两阶段训练策略避免传统方法中对全步处理的依赖,从而显著减少token计算成本并提升整体效率。实验表明,该方法在保持甚至超越原始生成质量的同时,实现了高达54%的计算成本降低。
链接: https://arxiv.org/abs/2604.03674
作者: Haowei Zhu,Ji Liu,Ziqiong Liu,Dong Li,Junhai Yong,Bin Wang,Emad Barsoum
机构: Advanced Micro Devices, Inc.(超威半导体公司); Tsinghua University (清华大学); BNRist
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026
Abstract:Diffusion models demonstrate outstanding performance in image generation, but their multi-step inference mechanism requires immense computational cost. Previous works accelerate inference by leveraging layer or token cache techniques to reduce computational cost. However, these methods fail to achieve superior acceleration performance in few-step diffusion transformer models due to inefficient feature caching strategies, manually designed sparsity allocation, and the practice of retaining complete forward computations in several steps in these token cache methods. To tackle these challenges, we propose a differentiable layer-wise sparsity optimization framework for diffusion transformer models, leveraging token caching to reduce token computation costs and enhance acceleration. Our method optimizes layer-wise sparsity allocation in an end-to-end manner through a learnable network combined with a dynamic programming solver. Additionally, our proposed two-stage training strategy eliminates the need for full-step processing in existing methods, further improving efficiency. We conducted extensive experiments on a range of diffusion-transformer models, including DiT-XL/2, PixArt- \alpha , FLUX, and Wan2.1. Across these architectures, our method consistently improves efficiency without degrading sample quality. For example, on PixArt- \alpha with 20 sampling steps, we reduce computational cost by 54% while achieving generation metrics that surpass those of the original model, substantially outperforming prior approaches. These results demonstrate that our method delivers large efficiency gains while often improving generation quality.
[CV-136] Leverag ing Gaze and Set-of-Mark in VLLM s for Human-Object Interaction Anticipation from Egocentric Videos ICPR
【速读】:该论文旨在解决第一人称视觉(Egocentric Vision)场景中人类-物体交互行为的预测问题,这是构建智能辅助系统以指导用户日常活动并理解其短期与长期目标的关键技术瓶颈。解决方案的核心在于:1)通过Set-of-Mark提示(Set-of-Mark prompting)增强视觉定位能力,提升模型对关键交互对象的识别精度;2)利用用户最近注视点轨迹来推断用户意图,从而捕捉交互前的行为模式;3)引入一种新颖的逆指数采样策略(inverse exponential sampling strategy) 用于视频帧选择,有效建模交互发生前的时序动态特征。实验在HD-EPIC数据集上验证了该方法在性能上的优越性,并表明其具有模型无关性(model-agnostic nature)。
链接: https://arxiv.org/abs/2604.03667
作者: Daniele Materia,Francesco Ragusa,Giovanni Maria Farinella
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to International Conference on Pattern Recognition (ICPR) 2026
Abstract:The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user’s most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.
[CV-137] Motion-Adaptive Multi-Scale Temporal Modelling with Skeleton-Constrained Spatial Graphs for Efficient 3D Human Pose Estimation IJCNN2026
【速读】:该论文旨在解决单目视频中3D人体姿态估计任务中复杂时空依赖建模的效率与适应性问题,尤其针对现有方法在密集注意力机制或固定建模方案下难以有效捕捉异质运动动态和关节特异性空间交互的局限。其解决方案的关键在于提出MASC-Pose框架,包含两个核心模块:一是自适应多尺度时序建模(Adaptive Multi-scale Temporal Modelling, AMTM)模块,用于在不同时间尺度上自适应地捕获多样化的运动动态;二是骨骼约束的自适应图卷积网络(Skeleton-constrained Adaptive GCN, SAGCN),实现基于关节特异性的空间交互建模。通过联合优化自适应时序推理与高效空间聚合,该方法在保持高计算效率的同时显著提升了精度。
链接: https://arxiv.org/abs/2604.03652
作者: Ruochen Li,Shuang Chen,Wenke E,Farshad Arvin,Amir Atapour-Abarghouei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCNN 2026, full paper
Abstract:Accurate 3D human pose estimation from monocular videos requires effective modelling of complex spatial and temporal dependencies. However, existing methods often face challenges in efficiency and adaptability when modelling spatial and temporal dependencies, particularly under dense attention or fixed modelling schemes. In this work, we propose MASC-Pose, a Motion-Adaptive multi-scale temporal modelling framework with Skeleton-Constrained spatial graphs for efficient 3D human pose estimation. Specifically, it introduces an Adaptive Multi-scale Temporal Modelling (AMTM) module to adaptively capture heterogeneous motion dynamics at different temporal scales, together with a Skeleton-constrained Adaptive GCN (SAGCN) for joint-specific spatial interaction modelling. By jointly enabling adaptive temporal reasoning and efficient spatial aggregation, our method achieves strong accuracy with high computational efficiency. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate the effectiveness of our approach.
[CV-138] ART: Adaptive Relational Transformer for Pedestrian Trajectory Prediction with Temporal-Aware Relations
【速读】:该论文旨在解决现实场景中行人轨迹预测的准确性与计算效率之间的矛盾问题,即现有基于图结构或Transformer架构的方法要么引入不必要的计算开销,要么难以有效建模人类交互的多样性与时变特性。其解决方案的关键在于提出自适应关系Transformer(Adaptive Relational Transformer, ART),通过引入时序感知关系图(Temporal-Aware Relation Graph, TARG)显式捕捉成对交互的动态演化过程,并结合自适应交互剪枝机制(Adaptive Interaction Pruning, AIP)高效减少冗余计算,从而在ETH/UCY和NBA等多个基准数据集上实现最优预测精度与高计算效率的平衡。
链接: https://arxiv.org/abs/2604.03649
作者: Ruochen Li,Ziyi Chang,Junyan Hu,Jiannan Li,Amir Atapour-Abarghouei,Hubert P. H. Shum
机构: Durham University (杜伦大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate prediction of real-world pedestrian trajectories is crucial for a wide range of robot-related applications. Recent approaches typically adopt graph-based or transformer-based frameworks to model interactions. Despite their effectiveness, these methods either introduce unnecessary computational overhead or struggle to represent the diverse and time-varying characteristics of human interactions. In this work, we present an Adaptive Relational Transformer (ART), which introduces a Temporal-Aware Relation Graph (TARG) to explicitly capture the evolution of pairwise interactions and an Adaptive Interaction Pruning (AIP) mechanism to reduce redundant computations efficiently. Extensive evaluations on ETH/UCY and NBA benchmarks show that ART delivers state-of-the-art accuracy with high computational efficiency.
[CV-139] Stabilizing Unsupervised Self-Evolution of MLLM s via Continuous Softened Retracing reSampling
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在无监督自进化过程中,由于依赖多数投票机制选择伪黄金答案而导致的反馈信号质量下降问题,这种机制可能受模型内在偏见影响,而非客观正确性。解决方案的关键在于提出连续软化重推理采样(Continuous Softened Retracing Re-sampling, CSRS)框架:首先引入重推理再推理机制(Retracing Re-inference Mechanism, RRM),通过从锚点重新推理以拓展对长尾推理路径的探索;其次设计软化频率奖励(Softened Frequency Reward, SFR),用连续奖励信号替代二值奖励,基于采样推理集中的答案频率进行校准;最后结合视觉语义扰动(Visual Semantic Perturbation, VSP),确保模型优先关注数学逻辑而非视觉表面特征。该方案显著提升了Qwen2.5-VL-7B在MathVision等基准上的推理性能,并在几何任务中达到无监督自进化领域的最先进水平。
链接: https://arxiv.org/abs/2604.03647
作者: Yunyao Yu,Zhengxian Wu,Zhuohong Chen,Hangrui Xu,Zirui Liao,Xiangwen Deng,Zhifang Liu,Senyuan Shi,Haoqian Wang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures
Abstract:In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model’s intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose \textbfContinuous \textbfSoftened \textbfRetracing re\textbfSampling (\textbfCSRS) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (\textbfRRM) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (\textbfSFR), which replaces binary rewards with continuous signals, calibrating reward based on the answers’ frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (\textbfVSP), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution on geometric tasks. Our code is avaible at this https URL.
[CV-140] ComPrivDet: Efficient Privacy Object Detection in Compressed Domains Through Inference Reuse
【速读】:该论文旨在解决大规模视频分析场景下隐私保护与计算效率之间的矛盾问题,特别是在压缩域中高效检测隐私对象(如人脸和车牌)以降低延迟并避免全帧解码带来的资源开销。其解决方案的关键在于提出ComPrivDet方法,通过复用I帧的推理结果,并利用压缩域特征线索判断新对象的存在性,从而智能跳过或轻量级优化P帧和B帧的检测过程,实现了在保持高精度(如人脸检测准确率达99.75%)的同时显著减少推理次数(超过80%的推理被跳过),相较现有压缩域检测方法平均提升9.84%准确率且延迟降低75.95%。
链接: https://arxiv.org/abs/2604.03640
作者: Yunhao Yao,Zhiqiang Wang,Ruiqi Li,Haoran Cheng,Puhan Luo,Xiangyang Li
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 6 pages, 6 figures
Abstract:As the Internet of Things (IoT) becomes deeply embedded in daily life, users are increasingly concerned about privacy leakage, especially from video data. Since frame-by-frame protection in large-scale video analytics (e.g., smart communities) introduces significant latency, a more efficient solution is to selectively protect frames containing privacy objects (e.g., faces). Existing object detectors require fully decoded videos or per-frame processing in compressed videos, leading to decoding overhead or reduced accuracy. Therefore, we propose ComPrivDet, an efficient method for detecting privacy objects in compressed video by reusing I-frame inference results. By identifying the presence of new objects through compressed-domain cues, ComPrivDet either skips P- and B-frame detections or efficiently refines them with a lightweight detector. ComPrivDet maintains 99.75% accuracy in private face detection and 96.83% in private license plate detection while skipping over 80% of inferences. It averages 9.84% higher accuracy with 75.95% lower latency than existing compressed-domain detection methods.
[CV-141] SAGE-GAN: Towards Realistic and Robust Segmentation of Spatially Ordered Nanoparticles via Attention-Guided GANs
【速读】:该论文旨在解决电子显微镜图像中纳米颗粒特征精准分析的难题,特别是传统手动方法耗时高、自动化分割技术在复杂形貌和成像伪影下表现不佳的问题。其关键解决方案在于提出一种两步式框架:首先利用自注意力驱动的U-Net(Attention U-Net)从真实图像数据集中学习纳米颗粒的关键物理与形态学特征,忽略背景噪声;其次将训练好的Attention U-Net嵌入循环一致性生成对抗网络(CycleGAN)架构中,生成逼真的合成电子显微图像及其对应的标注掩膜(mask),从而实现无需人工干预的自主数据增强,并通过循环一致性确保合成图像与真值掩膜之间的直接对应关系,提升分割模型的泛化能力和准确性。
链接: https://arxiv.org/abs/2604.03637
作者: Anindya Pal,Varun Ajith,Saumik Bhattacharya,Sayantari Ghosh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, journal submission
Abstract:Precise analysis of nanoparticles for characterization in electron microscopy images is essential for advancing nanomaterial development. Yet it remains challenging due to the time-consuming nature of manual methods and the shortcomings of traditional automated segmentation techniques, especially when dealing with complex shapes and imaging artifacts. While conventional methods yield promising results, they depend on a large volume of labeled training data, which is both difficult to acquire and highly time-consuming to generate. In order to overcome these challenges, we have developed a two-step solution: Firstly, our system learns to segment the key features of nanoparticles from a dataset of real images using a self-attention driven U-Net architecture that focuses on important physical and morphological details while ignoring background features and noise. Secondly, this trained Attention U-Net is embedded in a cycle-consistent generative adversarial network (CycleGAN) framework, inspired by the cGAN-Seg model introduced by Abzargar et al. This integration allows for the creation of highly realistic synthetic electron microscopy image-mask pairs that naturally reflect the structural patterns learned by the Attention U-Net. Consequently, the model can accurately detect features in a diverse array of real-world nanoparticle images and autonomously augment the training dataset without requiring human input. Cycle consistency enforces a direct correspondence between synthetic images and ground-truth masks, ensuring realistic features, which is crucial for accurate segmentation training.
[CV-142] A Generative Foundation Model for Multimodal Histopathology
【速读】:该论文旨在解决复杂疾病诊断与治疗中多模态数据(如组织病理学、分子RNA谱和临床文本)不完整的问题,传统计算方法依赖于特定任务的模型,仅针对单一源-目标模态对进行训练,导致泛化能力受限。其解决方案的关键在于提出了一种名为MuPD(Multimodal Pathology Diffusion)的生成式基础模型,该模型通过带有解耦交叉模态注意力机制的扩散Transformer,将HE染色组织图像、RNA分子特征和临床文本统一嵌入到共享潜在空间中;该模型在1亿张组织切片图像块、160万组文本-组织图像对及1080万组RNA-组织图像对上预训练,覆盖34个人体器官,从而支持多种跨模态合成任务,且无需或仅需极少的任务微调,显著优于现有专用模型,在图像生成质量、分类准确率提升和虚拟染色性能等方面均实现突破。
链接: https://arxiv.org/abs/2604.03635
作者: Jinxi Xiang,Mingjie Li,Siyu Hou,Yijiang Chen,Xiangde Luo,Yuanfeng Ji,Xiang Zhou,Ehsan Adeli,Akshay Chaudhari,Curtis P. Langlotz,Kilian M. Pohl,Ruijiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 33 pages, 9 figures
Abstract:Accurate diagnosis and treatment of complex diseases require integrating histological, molecular, and clinical data, yet in practice these modalities are often incomplete owing to tissue scarcity, assay cost, and workflow constraints. Existing computational approaches attempt to impute missing modalities from available data but rely on task-specific models trained on narrow, single source-target pairs, limiting their generalizability. Here we introduce MuPD (Multimodal Pathology Diffusion), a generative foundation model that embeds hematoxylin and eosin (HE)-stained histology, molecular RNA profiles, and clinical text into a shared latent space through a diffusion transformer with decoupled cross-modal attention. Pretrained on 100 million histology image patches, 1.6 million text-histology pairs, and 10.8 million RNA-histology pairs spanning 34 human organs, MuPD supports diverse cross-modal synthesis tasks with minimal or no task-specific fine-tuning. For text-conditioned and image-to-image generation, MuPD synthesizes histologically faithful tissue architectures, reducing Fréchet inception distance (FID) scores by 50% relative to domain-specific models and improving few-shot classification accuracy by up to 47% through synthetic data augmentation. For RNA-conditioned histology generation, MuPD reduces FID by 23% compared with the next-best method while preserving cell-type distributions across five cancer types. As a virtual stainer, MuPD translates HE images to immunohistochemistry and multiplex immunofluorescence, improving average marker correlation by 37% over existing approaches. These results demonstrate that a single, unified generative model pretrained across heterogeneous pathology modalities can substantially outperform specialized alternatives, providing a scalable computational framework for multimodal histopathology.
[CV-143] L-SPINE: A Low-Precision SIMD Spiking Neural Compute Engine for Resource-efficient Edge Inference
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在边缘智能硬件部署中面临的内存开销大、缩放操作效率低及并行度有限等问题。其解决方案的关键在于提出一种低精度SIMD(单指令多数据流)增强型脉冲神经计算引擎L-SPINE,该架构采用统一的多精度数据通路支持2位、4位和8位运算,并利用无乘法器的移位-加法模型实现神经元动态和突触累积,从而显著降低资源消耗与功耗。实测表明,L-SPINE在FPGA平台上实现了极低的逻辑单元(LUT)和触发器(FF)占用、微秒级延迟及毫瓦级功耗,相较CPU/GPU平台在推理延迟上提升三个数量级,同时保持高能效比与可扩展性。
链接: https://arxiv.org/abs/2604.03626
作者: Sonu Kumar,Mukul Lokhande,Santosh Kumar Vishvakarma
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
备注:
Abstract:Spiking Neural Networks (SNNs) offer a promising solution for energy-efficient edge intelligence; however, their hardware deployment is constrained by memory overhead, inefficient scaling operations, and limited parallelism. This work proposes L-SPINE, a low-precision SIMD-enabled spiking neural compute engine for efficient edge inference. The architecture features a unified multi-precision datapath supporting 2-bit, 4-bit, and 8-bit operations, leveraging a multiplier-less shift-add model for neuron dynamics and synaptic accumulation. Implemented on an AMD VC707 FPGA, the proposed neuron requires only 459 LUTs and 408 FFs, achieving a critical delay of 0.39 ns and 4.2 mW power. At the system level, L-SPINE achieves 46.37K LUTs, 30.4K FFs, 2.38 ms latency, and 0.54 W power. Compared to CPU and GPU platforms, it reduces inference latency from seconds to milliseconds, achieving an up to three orders-of-magnitude improvement in energy efficiency. Quantisation analysis shows that INT2/INT4 configurations significantly reduce memory footprint with minimal accuracy loss. These results establish L-SPINE as a scalable and efficient solution for real-time edge SNN deployment.
[CV-144] Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling? CVPR2026
【速读】:该论文旨在解决功能磁共振成像(fMRI)中长程时空动态建模的难题,其核心挑战在于四维信号的高维度导致传统基于体素(voxel-based)模型在内存消耗上受限,难以捕捉长时间窗口内的脑活动模式。解决方案的关键在于提出一种名为TABLeT(Two-dimensionally Autoencoded Brain Latent Transformer)的新方法:首先利用预训练的二维自然图像自动编码器对fMRI体积进行分词(tokenization),将每个体积压缩为一组连续的潜在表示(continuous tokens);随后使用轻量级Transformer编码器进行长序列建模,从而在有限显存(VRAM)条件下实现高效且可扩展的时空建模。此策略显著提升了计算和内存效率,并在UK-Biobank、Human Connectome Project和ADHD-200等多个大规模基准数据集上验证了其优越性。
链接: https://arxiv.org/abs/2604.03619
作者: Peter Yongho Kim,Juhyeon Park,Jungwoo Park,Jubin Choi,Jungwoo Seo,Jiook Cha,Taesup Moon
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Modeling long-range spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of the four-dimensional signals. Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows. To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. Each 3D fMRI volume is compressed into a compact set of continuous tokens, enabling long-sequence modeling with a simple Transformer encoder with limited VRAM. Across large-scale benchmarks including the UK-Biobank (UKB), Human Connectome Project (HCP), and ADHD-200 datasets, TABLeT outperforms existing models in multiple tasks, while demonstrating substantial gains in computational and memory efficiency over the state-of-the-art voxel-based method given the same input. Furthermore, we develop a self-supervised masked token modeling approach to pre-train TABLeT, which improves the model’s performance for various downstream tasks. Our findings suggest a promising approach for scalable and interpretable spatiotemporal modeling of brain activity. Our code is available at this https URL.
[CV-145] PortraitCraft: A Benchmark for Portrait Composition Understanding and Generation
【速读】:该论文旨在解决现有数据集和基准在肖像构图分析与可控生成方面的局限性,即当前研究多集中于粗粒度美学评分、通用图像美学或无约束的肖像生成,缺乏对结构化构图理解与显式构图约束下可控生成的系统性支持。其解决方案的关键在于提出PortraitCraft——一个统一的肖像构图理解与生成基准,包含约50,000张精心标注的真实肖像图像,并提供多层级结构化监督信息,包括全局构图评分、13个构图属性标注、属性级解释文本、视觉问答对以及面向生成的构图描述文本。基于此数据集,论文构建了两个互补的任务:构图理解(通过评分预测、细粒度属性推理和图像引导的视觉问答评估)与构图感知生成(在显式构图约束下从结构化描述生成肖像),并定义标准化评估协议与代表性多模态模型基线结果,从而为精细肖像理解、可解释美学评估及可控肖像生成提供全面基准。
链接: https://arxiv.org/abs/2604.03611
作者: Yuyang Sha,Zijie Lou,Youyun Tang,Xiaochao Qu,Haoxiang Li,Ting Liu,Luoqi Liu
机构: Meitu Inc (美图公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Portrait composition plays a central role in portrait aesthetics and visual communication, yet existing datasets and benchmarks mainly focus on coarse aesthetic scoring, generic image aesthetics, or unconstrained portrait generation. This limits systematic research on structured portrait composition analysis and controllable portrait generation under explicit composition requirements. In this paper, we introduce PortraitCraft, a unified benchmark for portrait composition understanding and generation. PortraitCraft is built on a dataset of approximately 50,000 curated real portrait images with structured multi-level supervision, including global composition scores, annotations over 13 composition attributes, attribute-level explanation texts, visual question answering pairs, and composition-oriented textual descriptions for generation. Based on this dataset, we establish two complementary benchmark tasks for composition understanding and composition-aware generation within a unified framework. The first evaluates portrait composition understanding through score prediction, fine-grained attribute reasoning, and image-grounded visual question answering, while the second evaluates portrait generation from structured composition descriptions under explicit composition constraints. We further define standardized evaluation protocols and provide reference baseline results with representative multimodal models. PortraitCraft provides a comprehensive benchmark for future research on fine-grained portrait understanding, interpretable aesthetic assessment, and controllable portrait generation.
[CV-146] Stochastic Generative Plug-and-Play Priors
【速读】:该论文旨在解决如何将基于分数的扩散模型(Score-based Diffusion Models, SBDMs)作为先验知识有效融入插件式图像重建(Plug-and-play, PnP)框架中的问题,尤其是在不依赖反向扩散采样的前提下。传统PnP方法通常使用通用去噪器,而SBDMs虽在生成能力上表现优异,但其与PnP的结合缺乏理论支撑且难以直接应用。论文的关键创新在于建立了一个基于分数的PnP解释框架,证明了预训练的SBDM可直接用作PnP中的先验,并进一步提出随机生成式PnP(Stochastic Generative PnP, SGPnP)框架,通过引入噪声来增强对复杂生成先验的利用,从而提升在严重不适定逆问题中的鲁棒性;理论分析表明,这种噪声注入等价于在高斯平滑的目标函数上进行优化,有助于逃离严格的鞍点,提升收敛稳定性。
链接: https://arxiv.org/abs/2604.03603
作者: Chicago Y. Park,Edward P. Chandler,Yuyang Hu,Michael T. McCann,Cristina Garcia-Cardona,Brendt Wohlberg,Ulugbek S. Kamilov
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Washington University in St. Louis (圣路易斯华盛顿大学); Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Plug-and-play (PnP) methods are widely used for solving imaging inverse problems by incorporating a denoiser into optimization algorithms. Score-based diffusion models (SBDMs) have recently demonstrated strong generative performance through a denoiser trained across a wide range of noise levels. Despite their shared reliance on denoisers, it remains unclear how to systematically use SBDMs as priors within the PnP framework without relying on reverse diffusion sampling. In this paper, we establish a score-based interpretation of PnP that justifies using pretrained SBDMs directly within PnP algorithms. Building on this connection, we introduce a stochastic generative PnP (SGPnP) framework that injects noise to better leverage the expressive generative SBDM priors, thereby improving robustness in severely ill-posed inverse problems. We provide a new theory showing that this noise injection induces optimization on a Gaussian-smoothed objective and promotes escape from strict saddle points. Experiments on challenging inverse tasks, such as multi-coil MRI reconstruction and large-mask natural image inpainting, demonstrate consistent improvement over conventional PnP methods and achieve performance competitive with diffusion-based solvers.
[CV-147] SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition CVPR
【速读】:该论文旨在解决基于2D骨架(skeleton)的视频人体动作识别(Human Action Recognition, HAR)方法在常见场景中性能受限的问题,其根本原因在于骨架信息无法充分捕捉与动作相关的深度信息、人体轮廓以及人与物体之间的交互关系。解决方案的关键在于提出一种新的中间表示——Scale-Body-Flow (SBF),它由三个互补组件构成:关节尺度图(scale map),用于编码每个关节的尺度及深度信息;人体轮廓图(body map),用于刻画人体主体边界;以及光流图(flow map),用于表征人与物体间的像素级运动交互。为高效生成SBF,作者进一步设计了SFSNet,一种仅依赖现有骨架和光流监督的分割网络,无需额外标注开销。实验表明,该方法在保持与当前最优骨架方法相当的紧凑性和效率的同时,显著提升了HAR准确率。
链接: https://arxiv.org/abs/2604.03590
作者: Zhuoxuan Peng,Yiyi Ding,Yang Lin,S.-H. Gary Chan
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ABAW2026 (CVPR Workshop)
Abstract:Many modern video-based human action recognition (HAR) approaches use 2D skeleton as the intermediate representation in their prediction pipelines. Despite overall encouraging results, these approaches still struggle in many common scenes, mainly because the skeleton does not capture critical action-related information pertaining to the depth of the joints, contour of the human body, and interaction between the human and objects. To address this, we propose an effective approach to augment skeleton with a representation capturing action-related information in the pipeline of HAR. The representation, termed Scale-Body-Flow (SBF), consists of three distinct components, namely a scale map volume given by the scale (and hence depth information) of each joint, a body map outlining the human subject, and a flow map indicating human-object interaction given by pixel-wise optical flow values. To predict SBF, we further present SFSNet, a novel segmentation network supervised by the skeleton and optical flow without extra annotation overhead beyond the existing skeleton extraction. Extensive experiments across different datasets demonstrate that our pipeline based on SBF and SFSNet achieves significantly higher HAR accuracy with similar compactness and efficiency as compared with the state-of-the-art skeleton-only approaches.
[CV-148] HAD: Combining Hierarchical Diffusion with Metric-Decoupled RL for End-to-End Driving
【速读】:该论文旨在解决端到端自动驾驶规划中两大核心挑战:一是直接从大规模候选轨迹空间中选择最优路径难以优化,且扩散模型中的高斯扰动常引入不切实际的轨迹,增加去噪难度;二是强化学习(Reinforcement Learning, RL)训练时缺乏结构化的奖励信号,导致策略优化效率低下。解决方案的关键在于提出HAD框架,其核心创新包括:1)层级扩散策略(Hierarchical Diffusion Policy),将规划过程分解为粗粒度到细粒度的分层生成机制,提升优化稳定性;2)结构保持轨迹扩展(Structure-Preserved Trajectory Expansion),在生成候选轨迹时保留运动学约束,确保轨迹合理性;3)解耦度量策略优化(Metric-Decoupled Policy Optimization, MDPO),通过多目标奖励解耦实现结构化强化学习优化,显著提升策略学习效率。实验表明,该方法在NAVSIM和HUGSIM基准上均达到新的SOTA性能。
链接: https://arxiv.org/abs/2604.03581
作者: Wenhao Yao,Xinglong Sun,Zhenxin Li,Shiyi Lan,Zi Wang,Jose M. Alvarez,Zuxuan Wu
机构: Fudan University (复旦大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures
Abstract:End-to-end planning has emerged as a dominant paradigm for autonomous driving, where recent models often adopt a scoring-selection framework to choose trajectories from a large set of candidates, with diffusion-based decoding showing strong promise. However, directly selecting from the entire candidate space remains difficult to optimize, and Gaussian perturbations used in diffusion often introduce unrealistic trajectories that complicate the denoising process. In addition, for training these models, reinforcement learning (RL) has shown promise, but existing end-to-end RL approaches typically rely on a single coupled reward without structured signals, limiting optimization effectiveness. To address these challenges, we propose HAD, an end-to-end planning framework with a Hierarchical Diffusion Policy that decomposes planning into a coarse-to-fine process. To improve trajectory generation, we introduce Structure-Preserved Trajectory Expansion, which produces realistic candidates while maintaining kinematic structure. For policy learning, we develop Metric-Decoupled Policy Optimization (MDPO) to enable structured RL optimization across multiple driving objectives. Extensive experiments show that HAD achieves new state-of-the-art performance on both NAVSIM and HUGSIM, outperforming prior arts by a huge margin: +2.3 EPDMS on NAVSIM and +4.9 Route Completion on HUGSIM.
[CV-149] Physics-Informed Untrained Learning for RGB-Guided Superresolution Single-Pixel Hyperspectral Imaging
【速读】:该论文旨在解决单像素成像(Single-pixel Imaging, SPI)在极低采样率下难以恢复高保真空间与光谱细节的问题,这是一个严重不适定的逆问题。现有基于深度学习的方法通常依赖大规模预训练数据集,这在高光谱成像场景中往往不切实际。解决方案的关键在于提出一个端到端的物理信息驱动框架,其核心创新是利用未训练神经网络(untrained neural networks)和RGB引导信息,在无需任何外部训练数据的前提下实现高光谱重建与超分辨率的联合优化:首先通过RGB导出的灰度先验进行正则化最小二乘初始化(LS-RGP),再使用未训练的高光谱恢复网络(UHRNet)结合测量一致性与混合正则化进行精修,最后借助基于Transformer的未训练超分辨率网络(USRNet)通过跨模态注意力机制从RGB图像中迁移高频空间细节完成上采样。该方法显著提升了重建精度与光谱保真度,并在物理系统中验证了其可行性。
链接: https://arxiv.org/abs/2604.03572
作者: Hao Zhang,Bilige Xu,Lichen Wei,Xu Ma,Wenyi Ren
机构: College of Science, Northwest Agriculture and Forestry University, Yangling 712100, China; Key Laboratory of Photoelectronic Imaging Technology and System, School of Optics and Photonics, Beijing Institute of Technology, Beijing, China; State Key Laboratory of Digital Manufacturing Equipment and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: 9 pages, 13 figures, 5 tables
Abstract:Single-pixel imaging (SPI) offers a cost-effective route to hyperspectral acquisition but struggles to recover high-fidelity spatial and spectral details under extremely low sampling rates, a severely ill-posed inverse problem. While deep learning has shown potential, existing data-driven methods demand large-scale pretraining datasets that are often impractical in hyperspectral imaging. To overcome this limitation, we propose an end-to-end physics-informed framework that leverages untrained neural networks and RGB guidance for joint hyperspectral reconstruction and super-resolution without any external training data. The framework comprises three physically grounded stages: (1) a Regularized Least-Squares method with RGB-derived Grayscale Priors (LS-RGP) that initializes the solution by exploiting cross-modal structural correlations; (2) an Untrained Hyperspectral Recovery Network (UHRNet) that refines the reconstruction through measurement consistency and hybrid regularization; and (3) a Transformer-based Untrained Super-Resolution Network (USRNet) that upsamples the spatial resolution via cross-modal attention, transferring high-frequency details from the RGB guide. Extensive experiments on benchmark datasets demonstrate that our approach significantly surpasses state-of-the-art algorithms in both reconstruction accuracy and spectral fidelity. Moreover, a proof-of-concept experiment using a physical single-pixel imaging system validates the framework’s practical applicability, successfully reconstructing a 144-band hyperspectral data cube at a mere 6.25% sampling rate. The proposed method thus provides a robust, data-efficient solution for computational hyperspectral imaging.
[CV-150] LOGER: Local–Global Ensemble for Robust Deepfake Detection in the Wild
【速读】:该论文旨在解决真实场景下深度伪造(Deepfake)检测的鲁棒性问题,其挑战源于不断演进的篡改技术以及不可控的真实世界退化因素。解决方案的关键在于提出一种局部-全局集成框架(LOGER),通过双分支结构分别建模不同粒度的取证线索:全局分支采用多分辨率异构视觉基础模型提取语义与统计层面的整体异常;局部分支则基于多实例学习(Multiple Instance Learning)的top-k聚合策略聚焦可疑区域,避免正常区域对伪造痕迹的稀释,并引入双层监督确保局部响应的判别力。由于两分支在粒度和骨干网络上存在差异,其预测误差具有低相关性,从而可通过logit空间融合实现更鲁棒的最终决策。
链接: https://arxiv.org/abs/2604.03558
作者: Fei Wu,Dagong Lu,Mufeng Yao,Xinlei Xu,Fengjun Guo
机构: Shanghai Jiao Tong University (上海交通大学); INTSIG Information (安恒信息)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2nd place (out of 94 teams) in the NTIRE 2026 Robust Deepfake Detection Challenge
Abstract:Robust deepfake detection in the wild remains challenging due to the ever-growing variety of manipulation techniques and uncontrolled real-world degradations. Forensic cues for deepfake detection reside at two complementary levels: global-level anomalies in semantics and statistics that require holistic image understanding, and local-level forgery traces concentrated in manipulated regions that are easily diluted by global averaging. Since no single backbone or input scale can effectively cover both levels, we propose LOGER, a LOcal–Global Ensemble framework for Robust deepfake detection. The global branch employs heterogeneous vision foundation model backbones at multiple resolutions to capture holistic anomalies with diverse visual priors. The local branch performs patch-level modeling with a Multiple Instance Learning top- k aggregation strategy that selectively pools only the most suspicious regions, mitigating evidence dilution caused by the dominance of normal patches; dual-level supervision at both the aggregated image level and individual patch level keeps local responses discriminative. Because the two branches differ in both granularity and backbone, their errors are largely decorrelated, a property that logit-space fusion exploits for more robust prediction. LOGER achieves 2nd place in the NTIRE 2026 Robust Deepfake Detection Challenge, and further evaluation on multiple public benchmarks confirms its strong robustness and generalization across diverse manipulation methods and real-world degradation conditions.
[CV-151] HEDGE: Heterogeneous Ensemble for Detection of AI-GEnerated Images in the Wild
【速读】:该论文旨在解决真实场景中AI生成图像(AIGC)检测的鲁棒性问题,即现有方法在面对生成模型快速迭代和多种现实世界失真时性能下降的问题。其解决方案的关键在于引入结构化的异质性(structured heterogeneity),通过三个维度构建互补的检测路径:多样化的训练数据与强增强策略(Route A)、多尺度特征提取以捕捉细粒度伪造线索(Route B),以及骨干网络多样性以提升泛化能力(Route C)。最终通过日志空间加权平均融合各分支输出,并利用轻量级双门控机制优化异常值处理和多数主导融合误差,从而实现高鲁棒性的检测性能。
链接: https://arxiv.org/abs/2604.03555
作者: Fei Wu,Dagong Lu,Mufeng Yao,Xinlei Xu,Fengjun Guo
机构: Shanghai Jiao Tong University (上海交通大学); INTSIG Information (INTSIG信息)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4th place (out of 193 teams) in the NTIRE 2026 Robust AI-Generated Image Detection in the Wild Challenge
Abstract:Robust detection of AI-generated images in the wild remains challenging due to the rapid evolution of generative models and varied real-world distortions. We argue that relying on a single training regime, resolution, or backbone is insufficient to handle all conditions, and that structured heterogeneity across these dimensions is essential for robust detection. To this end, we propose HEDGE, a Heterogeneous Ensemble for Detection of AI-GEnerated images, that introduces complementary detection routes along three axes: diverse training data with strong augmentation, multi-scale feature extraction, and backbone heterogeneity. Specifically, Route~A progressively constructs DINOv3-based detectors through staged data expansion and augmentation escalation, Route~B incorporates a higher-resolution branch for fine-grained forensic cues, and Route~C adds a MetaCLIP2-based branch for backbone diversity. All outputs are fused via logit-space weighted averaging, refined by a lightweight dual-gating mechanism that handles branch-level outliers and majority-dominated fusion errors. HEDGE achieves 4th place in the NTIRE 2026 Robust AI-Generated Image Detection in the Wild Challenge and attains state-of-the-art performance with strong robustness on multiple AIGC image detection benchmarks.
[CV-152] CRAFT: Video Diffusion for Bimanual Robot Data Generation
【速读】:该论文旨在解决双臂机器人(bimanual robot)从示范学习中因真实世界数据成本高且视觉多样性不足而导致策略鲁棒性受限的问题,尤其在视角、物体配置和机器人本体(embodiment)变化下的泛化能力薄弱。其解决方案的关键在于提出Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT),一种基于视频扩散模型的可扩展双臂示范生成框架;该方法通过条件化边缘结构线索(edge-based structural cues)引导视频扩散过程,从而合成时序一致的操纵视频并自动标注动作标签,实现跨物体位姿、相机视角、光照背景、跨本体迁移及多视角合成等统一的数据增强流程,显著提升训练数据的视觉多样性和物理合理性,最终在仿真与真实任务中均优于传统增强策略和单纯的数据扩增方法。
链接: https://arxiv.org/abs/2604.03552
作者: Jason Chen,I-Chun Arthur Liu,Gaurav Sukhatme,Daniel Seita
机构: University of Southern California (南加州大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, camera viewpoints, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert simulated videos, along with action labels from the simulation trajectories, into action-consistent demonstrations. Starting from only a few real-world demonstrations, CRAFT generates a large, visually diverse set of photorealistic training data, bypassing the need to replay demonstrations on the real robot (Sim2Real). Across simulated and real-world bimanual tasks, CRAFT improves success rates over existing augmentation strategies and straightforward data scaling, demonstrating that diffusion-based video generation can substantially expand demonstration diversity and improve generalization for dual-arm manipulation tasks. Our project website is available at: this https URL
[CV-153] Determined by User Needs: A Salient Object Detection Rationale Beyond Conventional Visual Stimuli
【速读】:该论文旨在解决现有显著性目标检测(Salient Object Detection, SOD)方法仅依赖被动视觉刺激而忽略用户主动需求的问题。传统SOD方法假设最强烈视觉刺激的物体即为用户关注焦点,但忽略了用户在看到图像前已有的特定意图(如“白色苹果”),这种忽视导致检测结果无法满足实际应用需求,并限制了下游任务(如显著性排序)的准确性。论文提出用户显著性目标检测(User Salient Object Detection, UserSOD)新任务,其核心在于基于用户预先存在的主动需求来识别与之匹配的显著目标,从而更贴近真实场景中的认知机制。该方案的关键挑战在于缺乏用于训练和测试该任务的数据集,亟需构建面向用户意图驱动的标注数据集以支持模型发展。
链接: https://arxiv.org/abs/2604.03526
作者: Chenglizhao Chen,Shujian Zhang,Luming Li,Wenfeng Song,Shuai Li
机构: China University of Petroleum (East China); Beijing Information Science and Technology University; Beihang University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing \textbfsalient \textbfobject \textbfdetection (SOD) methods adopt a \textbfpassive visual stimulus-based rationale–objects with the strongest visual stimuli are perceived as the user’s primary focus (i.e., salient objects). They ignore the decisive role of users’ \textbfproactive needs in segmenting salient objects–if a user has a need before seeing an image, the user’s salient objects align with their needs, e.g., if a user’s need is white apple'', when this user sees an image, the user's primary focus is on the white apple’’ or ``the most white apple-like’’ objects in the image. Such an oversight not only \textbffails to satisfy users, but also \textbflimits the development of downstream tasks. For instance, in salient object ranking tasks, focusing solely on visual stimuli-based salient objects is insufficient for conducting an analysis of fine-grained relationships between users’ viewing order (usually determined by user’s needs) and scenes, which may result in wrong ranking results. Clearly, it is essential to detect salient objects based on user needs. Thus, we advocate a \textbfUser \textbfSalient \textbfObject \textbfDetection (UserSOD) task, which focuses on \textbfdetecting salient objects align with users’ proactive needs when user have needs. The main challenge for this new task is the lack of datasets for model training and testing.
[CV-154] Optimizing Neurorobot Policy under Limited Demonstration Data through Preference Regret
【速读】:该论文旨在解决机器人强化学习从演示(Reinforcement Learning from Demonstrations, RLfD)中面临的两个关键问题:一是真实场景下专家数据稀缺且采集成本高,二是传统模仿学习算法假设数据独立同分布(i.i.d.),导致测试时轨迹中误差逐渐累积并恶化性能。解决方案的核心在于提出“掌握自身专长”(Master Your Own Expertise, MYOE)框架,其创新性地引入可查询的偏好混合状态空间模型(Queryable Mixture-of-Preferences State Space Model, QMoP-SSM),该模型在每个时间步估计期望目标,并基于此计算“偏好遗憾”(preference regret)以优化机器人控制策略,从而实现从有限演示数据中学习复杂行为,显著提升泛化能力与鲁棒性。
链接: https://arxiv.org/abs/2604.03523
作者: Viet Dung Nguyen,Yuhang Song,Anh Nguyen,Jamison Heard,Reynold Bailey,Alexander Ororbia
机构: Rochester Institute of Technology (罗彻斯特理工学院); Advanced Micro Devices, Inc. (超威半导体公司); University of Liverpool (利物浦大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 4 tables
Abstract:Robot reinforcement learning from demonstrations (RLfD) assumes that expert data is abundant; this is usually unrealistic in the real world given data scarcity as well as high collection cost. Furthermore, imitation learning algorithms assume that the data is independently and identically distributed, which ultimately results in poorer performance as gradual errors emerge and compound within test-time trajectories. We address these issues by introducing the “master your own expertise” (MYOE) framework, a self-imitation framework that enables robotic agents to learn complex behaviors from limited demonstration data samples. Inspired by human perception and action, we propose and design what we call the queryable mixture-of-preferences state space model (QMoP-SSM), which estimates the desired goal at every time step. These desired goals are used in computing the “preference regret”, which is used to optimize the robot control policy. Our experiments demonstrate the robustness, adaptability, and out-of-sample performance of our agent compared to other state-of-the-art RLfD schemes. The GitHub repository that supports this work can be found at: this https URL.
[CV-155] Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies
【速读】:该论文旨在解决城市树木精准测绘在自动化与规模化应用中面临的两大核心挑战:一是传统人工调查方式成本高、效率低,难以满足大范围环境监测和灾后评估需求;二是现有自动检测方法在不同城市场景下泛化能力差,且依赖大量标注数据导致标注瓶颈。其解决方案的关键在于提出一种多模态融合框架,结合高分辨率卫星影像与地面级谷歌街景图像,通过卫星影像初步定位树候选区域并检索对应街景视图进行精细化检测,从而显著减少无效的地面采样;同时引入领域自适应技术迁移已有标注数据的知识,并对比半监督学习、主动学习及二者结合的混合策略,最终证明混合策略在有限标注条件下实现最优性能(F1-score达0.90),有效缓解了标注成本问题并提升了模型准确性。
链接: https://arxiv.org/abs/2604.03505
作者: In Seon Kim,Ali Moghimi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Beyond the immediate biophysical benefits, urban trees play a foundational role in environmental sustainability and disaster mitigation. Precise mapping of urban trees is essential for environmental monitoring, post-disaster assessment, and strengthening policy. However, the transition from traditional, labor-intensive field surveys to scalable automated systems remains limited by high annotation costs and poor generalization across diverse urban scenarios. This study introduces a multimodal framework that integrates high-resolution satellite imagery with ground-level Google Street View to enable scalable and detailed urban tree detection under limited-annotation conditions. The framework first leverages satellite imagery to localize tree candidates and then retrieves targeted ground-level views for detailed detection, significantly reducing inefficient street-level sampling. To address the annotation bottleneck, domain adaptation is used to transfer knowledge from an existing annotated dataset to a new region of interest. To further minimize human effort, we evaluated three learning strategies: semi-supervised learning, active learning, and a hybrid approach combining both, using a transformer-based detection model. The hybrid strategy achieved the best performance with an F1-score of 0.90, representing a 12% improvement over the baseline model. In contrast, semi-supervised learning exhibited progressive performance degradation due to confirmation bias in pseudo-labeling, while active learning steadily improved results through targeted human intervention to label uncertain or incorrect predictions. Error analysis further showed that active and hybrid strategies reduced both false positives and false negatives. Our findings highlight the importance of a multimodal approach and guided annotation for scalable, annotation-efficient urban tree mapping to strengthen sustainable city planning.
[CV-156] Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving
【速读】:该论文旨在解决将基于仿真环境(CARLA)训练的视觉语言模型引导的强化学习(VLM-guided RL)策略零样本迁移至真实自动驾驶车辆的问题,其核心挑战在于仿真与现实之间观测空间(如单目图像)和动作语义(如模拟器耦合的动作指令)的不匹配。解决方案的关键在于提出一个模块化框架 Sim2Real-AD,包含四个核心组件:几何观测桥接(Geometric Observation Bridge, GOB)将单目前视图像转换为仿真兼容的鸟瞰图(BEV)观测;物理感知动作映射(Physics-Aware Action Mapping, PAM)将策略输出转化为平台无关的物理控制命令;两阶段渐进式训练(Two-Phase Progressive Training, TPT)策略通过分离动作空间与观测空间的迁移来稳定适应过程;以及实时部署流水线(Real-time Deployment Pipeline, RDP)实现感知、策略推理、控制转换与安全监控的闭环执行。该框架首次在无任何真实世界强化学习训练数据的情况下,实现了对全尺寸真实车辆的零样本闭环部署,并在车速跟随、障碍物避让和停车标志交互等场景中分别达到90%、80%和75%的成功率。
链接: https://arxiv.org/abs/2604.03497
作者: Zilin Huang,Zhengyang Wan,Zihao Sheng,Boyue Wang,Junwei You,Yue Leng,Sikai Chen
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Google(谷歌)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 21 figures
Abstract:Deploying reinforcement learning policies trained in simulation to real autonomous vehicles remains a fundamental challenge, particularly for VLM-guided RL frameworks whose policies are typically learned with simulator-native observations and simulator-coupled action semantics that are unavailable on physical platforms. This paper presents Sim2Real-AD, a modular framework for zero-shot sim-to-real transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles without any real-world RL training data. The framework decomposes the transfer problem into four components: a Geometric Observation Bridge (GOB) that converts monocular front-view images into simulator-compatible bird’s-eye-view (BEV) observations, a Physics-Aware Action Mapping (PAM) that translates policy outputs into platform-agnostic physical commands, a Two-Phase Progressive Training (TPT) strategy that stabilizes adaptation by separating action-space and observation-space transfer, and a Real-time Deployment Pipeline (RDP) that integrates perception, policy inference, control conversion, and safety monitoring for closed-loop execution. Simulation experiments show that the framework preserves the relative performance ordering of representative RL algorithms across different reward paradigms and validate the contribution of each module. Zero-shot deployment on a full-scale Ford E-Transit achieves success rates of 90%, 80%, and 75% in car-following, obstacle avoidance, and stop-sign interaction scenarios, respectively. To the best of our knowledge, this study is among the first to demonstrate zero-shot closed-loop deployment of a CARLA-trained VLM-guided RL policy on a full-scale real vehicle without any real-world RL training data. The demo video and code are available at: this https URL.
[CV-157] RAIN-FIT: Learning of Fitting Surfaces and Noise Distribution from Large Data Sets
【速读】:该论文旨在解决从噪声测量数据中估计包含给定点集的表面问题,核心挑战在于如何在不依赖超参数调优或数据预处理的情况下,同时准确估计表面形状及其噪声分布参数。解决方案的关键在于假设表面由一组特征函数的线性组合所定义(即零水平集),并结合对噪声分布的参数化描述,从而构建一个计算复杂度为线性、适用于高维空间(>3D)的优化框架。该方法通过最小化带约束的目标函数,联合估计表面和噪声参数,且理论证明其收敛性,实验证明其在2D和3D形状重建任务中优于Poisson Reconstruction和Encoder-X等前沿算法。
链接: https://arxiv.org/abs/2604.03491
作者: Omar M. Sleem,Sahand Kiani,Constantino M. Lagoa
机构: Kyocera International, Inc. (京瓷国际公司); Pennsylvania State University (宾夕法尼亚州立大学)
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:This paper proposes a method for estimating a surface that contains a given set of points from noisy measurements. More precisely, by assuming that the surface is described by the zero set of a function in the span of a given set of features and a parametric description of the distribution of the noise, a computationally efficient method is described that estimates both the surface and the noise distribution parameters. In the provided examples, polynomial and sinusoidal basis functions were used. However, any chosen basis that satisfies the outlined conditions mentioned in the paper can be approximated as a combination of trigonometric, exponential, and/or polynomial terms, making the presented approach highly generalizable. The proposed algorithm exhibits linear computational complexity in the number of samples. Our approach requires no hyperparameter tuning or data preprocessing and effectively handles data in dimensions beyond 2D and 3D. The theoretical results demonstrating the convergence of the proposed algorithm have been provided. To highlight the performance of the proposed method, comprehensive numerical results are conducted, evaluating our method against state-of-the-art algorithms, including Poisson Reconstruction and the Neural Network-based Encoder-X, on 2D and 3D shapes. The results demonstrate the superiority of our method under the same conditions.
[CV-158] Fine-tuning DeepSeek -OCR-2 for Molecular Structure Recognition
【速读】:该论文旨在解决光学化学结构识别(Optical Chemical Structure Recognition, OCSR)问题,即如何将印刷文献中的二维分子图转化为机器可读格式。其核心挑战在于直接应用视觉语言模型(Vision-Language Models)进行端到端识别时存在训练不稳定性,且全参数监督微调常失效。解决方案的关键在于:首先将OCSR任务建模为图像条件下的SMILES生成任务,并提出一种两阶段渐进式监督微调策略——初期采用参数高效微调方法LoRA(Low-Rank Adaptation)稳定训练,随后过渡至分层学习率的局部全参数微调(split learning rates),同时在大规模合成数据(PubChem渲染图)与真实专利图像(USPTO-MOL)组合语料上训练,最终构建出性能优异的MolSeek-OCR模型,在序列级精确匹配准确率上达到当前最优图像到序列模型水平。
链接: https://arxiv.org/abs/2604.03476
作者: Haocheng Tang,Xingyu Dang,Junmei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:
Abstract:Optical Chemical Structure Recognition (OCSR) is critical for converting 2D molecular diagrams from printed literature into machine-readable formats. While Vision-Language Models have shown promise in end-to-end OCR tasks, their direct application to OCSR remains challenging, and direct full-parameter supervised fine-tuning often fails. In this work, we adapt DeepSeek-OCR-2 for molecular optical recognition by formulating the task as image-conditioned SMILES generation. To overcome training instabilities, we propose a two-stage progressive supervised fine-tuning strategy: starting with parameter-efficient LoRA and transitioning to selective full-parameter fine-tuning with split learning rates. We train our model on a large-scale corpus combining synthetic renderings from PubChem and realistic patent images from USPTO-MOL to improve coverage and robustness. Our fine-tuned model, MolSeek-OCR, demonstrates competitive capabilities, achieving exact matching accuracies comparable to the best-performing image-to-sequence model. However, it remains inferior to state-of-the-art image-to-graph modelS. Furthermore, we explore reinforcement-style post-training and data-curation-based refinement, finding that they fail to improve the strict sequence-level fidelity required for exact SMILES matching.
[CV-159] SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes
【速读】:该论文旨在解决前馈式3D高斯点绘(Feed-forward 3D Gaussian Splatting)方法在自动驾驶场景重建中存在几何与瞬时外观属性(如光照、天气和时间)耦合的问题,这一耦合限制了场景的再照明(relighting)、外观迁移(appearance transfer)以及多遍历数据在不同环境条件下的一致渲染能力。解决方案的关键在于提出SpectralSplat方法,通过将颜色预测分解为两个由共享多层感知机(MLP)驱动的流:一个与外观无关的基础流(base stream)和一个受外观条件控制的适配流(adapted stream),其中外观嵌入由DINOv2特征提取得到;同时引入混合再照明管道生成成对观测数据,并辅以一致性、重建、跨外观及基础颜色损失联合训练,从而实现外观与几何的有效解耦;此外,还设计了可适应外观的时间历史机制,存储外观无关特征,使累积的高斯点能在任意目标外观下重新渲染,从而在保持原始重建质量的同时支持可控外观迁移和时序一致的再照明。
链接: https://arxiv.org/abs/2604.03462
作者: Quentin Herau,Tianshuo Xu,Depu Meng,Jiezhi Yang,Chensheng Peng,Spencer Sherk,Yihan Hu,Wei Zhan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Under review
Abstract:Feed-forward 3D Gaussian Splatting methods have achieved impressive reconstruction quality for autonomous driving scenes, yet they entangle scene geometry with transient appearance properties such as lighting, weather, and time of day. This coupling prevents relighting, appearance transfer, and consistent rendering across multi-traversal data captured under varying environmental conditions. We present SpectralSplat, a method that disentangles appearance from geometry within a feed-forward Gaussian Splatting framework. Our key insight is to factor color prediction into an appearance-agnostic base stream and and appearance-conditioned adapted stream, both produced by a shared MLP conditioned on a global appearance embedding derived from DINOv2 features. To enforce disentanglement, we train with paired observations generated by a hybrid relighting pipeline that combines physics-based intrinsic decomposition with diffusion based generative refinement, and supervise with complementary consistency, reconstruction, cross-appearance, and base color losses. We further introduce an appearance-adaptable temporal history that stores appearance-agnostic features, enabling accumulated Gaussians to be re-rendered under arbitrary target appearances. Experiments demonstrate that SpectralSplat preserves the reconstruction quality of the underlying backbone while enabling controllable appearance transfer and temporally consistent relighting across driving sequences.
[CV-160] RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation CVPR2026
【速读】:该论文旨在解决罕见病面部表型识别中因数据稀缺和不同疾病间表型高度相似而导致的AI诊断模型训练困难问题。其解决方案的关键在于构建了一个经过伦理审核、标注标准化的儿科罕见病面部图像基准数据集RDFace,包含103种罕见遗传病的456张图像(平均每病种4.4例),并结合基于DreamBooth和FastGAN的合成数据增强技术,通过面部关键点相似性过滤确保生成图像的表型保真度,从而在极低数据条件下将诊断准确率提升最高达13.7%。该方法不仅提升了模型性能,还通过视觉-语言模型验证了合成图像的语义有效性(报告相似度达0.84),为罕见病AI研究提供了可复现、公平且具备医学影像完整性保障的评估框架。
链接: https://arxiv.org/abs/2604.03454
作者: Ganlin Feng,Yuxi Long,Hafsa Ali,Erin Lou,Fahad Butt,Qian Liu,Yang Wang,Pingzhao Hu
机构: Western University (西蒙菲莎大学); Concordia University (康考迪亚大学); University of Toronto (多伦多大学); University of Winnipeg (温尼伯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026. 8 pages main paper + appendix
Abstract:Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AI-assisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata. RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision-language model from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.
[CV-161] Inference-Path Optimization via Circuit Duplication in Frozen Visual Transformers for Marine Species Classification
【速读】:该论文旨在解决水下物种分类中因标注成本高和环境变化导致的全监督模型迁移能力受限的问题。其核心解决方案是采用无需微调或修改模型权重的推理时优化方法——Circuit Duplication,该方法通过在前向传播过程中对选定范围的Transformer层进行两次遍历(即“电路复制”),从而提升基于冻结DINOv3嵌入的分类性能。实验表明,该策略在AQUA20数据集上显著优于标准冻结前向传播,尤其在类别特定电路选择设置下,宏观F1达到0.875,接近全监督ConvNeXt基准(0.889),且4种物种实现超越,验证了该方法在标签效率与分类精度之间的有效平衡。
链接: https://arxiv.org/abs/2604.03428
作者: Thomas Manuel Rost
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: pre study, more ablations to come
Abstract:Automated underwater species classification is constrained by annotation cost and environmental variation that limits the transferability of fully supervised models. Recent work has shown that frozen embeddings from self-supervised vision foundation models already provide a strong label-efficient baseline for marine image classification. Here we investigate whether this frozen-embedding regime can be improved at inference time, without fine-tuning or changing model weights. We apply Circuit Duplication, an inference-time method originally proposed for Large Language Models, in which a selected range of transformer layers is traversed twice during the forward pass. We evaluate on the class-imbalanced AQUA20 benchmark using frozen DINOv3 embeddings under two settings: global circuit selection, where a single duplicated circuit is chosen for the full dataset, and class-specific circuit selection, where each species may receive a different optimal circuit. Both settings use simple semi-supervised downstream classifiers. Circuit Duplication consistently improves over the standard frozen forward pass. At the maximum label budget, class-specific selection reaches a macro F1 of 0.875, closing the gap to the fully supervised ConvNeXt benchmark (0.889) to 1.4 points without any gradient-based training. Four species exceed their fully supervised reference, with octopus improving by +12.1 F1 points. Across all budgets, roughly 75% of classes prefer a class-specific circuit, indicating a genuinely class-dependent benefit. To our knowledge, this is the first application of Circuit Duplication to computer vision. Comments: pre study, more ablations to come Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.03428 [cs.CV] (or arXiv:2604.03428v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.03428 Focus to learn more arXiv-issued DOI via DataCite
[CV-162] Automated Segmentation and Tracking of Group Housed Pigs Using Foundation Models
【速读】:该论文旨在解决精准畜牧业中群体饲养仔猪自动化监测的难题,传统方法依赖大量标注数据的监督学习模型,存在标签成本高、重复训练和农场特异性调优等问题。解决方案的关键在于构建以基础模型(Foundation Models, FM)为核心的轻量化工作流:利用预训练视觉-语言模型作为通用视觉主干网络,通过模块化后处理逻辑实现农场特定适应,结合时空推理(如检测、短时视频分割与长时跟踪)提升系统鲁棒性与身份一致性。实验表明,该方案在夜间和遮挡场景下仍保持高精度,并能在连续132分钟视频中稳定追踪个体,验证了FM先验知识与任务专用逻辑融合的有效性,为可扩展、低标签依赖、长时间监控提供了新路径。
链接: https://arxiv.org/abs/2604.03426
作者: Ye Bi,Bimala Acharya,David Rosero,Juan Steibel
机构: Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models (FM) are reshaping computer vision by reducing reliance on task-specific supervised learning and leveraging general visual representations learned at scale. In precision livestock farming, most pipelines remain dominated by supervised learning models that require extensive labeled data, repeated retraining, and farm-specific tuning. This study presents an FM-centered workflow for automated monitoring of group-housed nursery pigs, in which pretrained vision-language FM serve as general visual backbones and farm-specific adaptation is achieved through modular post-processing. Grounding-DINO was first applied to 1,418 annotated images to establish a baseline detection performance. While detection accuracy was high under daytime conditions, performance degraded under night-vision and heavy occlusion, motivating the integration of temporal tracking logic. Building on these detections, short-term video segmentation with Grounded-SAM2 was evaluated on 550 one-minute video clips; after post-processing, over 80% of 4,927 active tracks were fully correct, with most remaining errors arising from inaccurate masks or duplicated labels. To support identity consistency over an extended time, we further developed a long-term tracking pipeline integrating initialization, tracking, matching, mask refinement, re-identification, and post-hoc quality control. This system was evaluated on a continuous 132-minute video and maintained stable identities throughout. On 132 uniformly sampled ground-truth frames, the system achieved a mean region similarity (J) of 0.83, contour accuracy (F) of 0.92, JF of 0.87, MOTA of 0.99, and MOTP of 90.7%, with no identity switches. Overall, this work demonstrates how FM prior knowledge can be combined with lightweight, task-specific logic to enable scalable, label-efficient, and long-duration monitoring in pig production.
[CV-163] Zero-Shot Quantization via Weight-Space Arithmetic
【速读】:该论文旨在解决模型在后训练量化(Post-Training Quantization, PTQ)过程中鲁棒性下降的问题,尤其针对极低比特部署场景下如何无需接收端数据和量化感知训练(Quantization-Aware Training, QAT)即可提升模型对PTQ噪声的抗干扰能力。解决方案的关键在于发现并利用权重空间中可迁移的“量化向量”(quantization vector)——该向量通过简单权重空间运算从源任务模型中提取,并可直接用于修补目标模型,从而显著增强其对PTQ噪声的鲁棒性(最高提升达60%),实现零样本、低成本的性能优化。这一方法表明量化鲁棒性并非仅由特定任务训练所得,而是权重空间几何结构中可复用的通用特性。
链接: https://arxiv.org/abs/2604.03420
作者: Daniele Solombrino,Antonio Andrea Gargiulo,Adrian Robert Minut,Luca Zhou,Alessandro Zirilli,Emanuele Rodolà
机构: Sapienza University of Rome (罗马大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve robustness to PTQ-induced noise by as much as 60%, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. We demonstrate this on Vision Transformer (ViT) models. More broadly, our results suggest that quantization robustness is not merely a byproduct of task-specific training, but a reusable feature of weight-space geometry that can be transferred rather than retrained.
[CV-164] KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models
【速读】:该论文旨在解决视频大语言模型(Video LLMs)在推理过程中因视觉 token数量庞大而导致的高计算成本问题。现有方法通常依赖局部或分段启发式策略进行token压缩,难以有效减少时空冗余并保持关键视觉信息。其解决方案的关键在于提出一种无需训练、查询无关的token压缩方法KiToke:通过基于核函数的全局冗余度量估计token多样性,实现内容自适应选择;同时引入轻量级的时间间隔构建与区间感知的token合并机制,以维持时序连贯性。该方法能显著提升token利用效率,在极端token保留比例下(低至1%)仍保持优异性能,优于现有无训练压缩方法。
链接: https://arxiv.org/abs/2604.03414
作者: Haifeng Huang,Yang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Large Language Models (Video LLMs) achieve strong performance on video understanding tasks but suffer from high inference costs due to the large number of visual tokens. We propose KiToke, a training-free, query-agnostic token compression approach that reduces spatiotemporal redundancy while preserving critical visual information. Our method estimates token diversity globally using a kernel-based redundancy measure, enabling content-adaptive selection that remains effective under extreme token budgets, and further introduces a lightweight temporal interval construction with interval-aware token merging to maintain temporal coherence. Unlike prior methods that rely on local or segment-level heuristics, KiToke explicitly captures global redundancy across an entire video, leading to more efficient token utilization. Extensive experiments on multiple video understanding benchmarks and Video LLM backbones demonstrate that KiToke consistently outperforms existing training-free compression methods, with particularly large gains at aggressive retention ratios down to 1%.
[CV-165] Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro CVPR2026
【速读】:该论文旨在解决多轮迭代图像编辑中出现的图像质量退化问题,即在多模态代理系统(multi-modal agentic systems)进行连续编辑时,细微的伪影会逐步累积,导致图像噪声显著增加并偏离原始指令。其关键解决方案是构建了一个名为Banana100的综合性数据集,包含通过100轮迭代编辑生成的28,000张退化图像,涵盖多样纹理和内容,用于系统性研究此类失败现象;同时揭示了当前主流无参考图像质量评估(no-reference image quality assessment, NR-IQA)指标无法有效识别严重退化图像的问题,强调需开发更鲁棒的生成与评估机制以保障多模态代理系统的稳定性与安全性。
链接: https://arxiv.org/abs/2604.03400
作者: Kenan Tang,Praveen Arunshankar,Andong Hua,Anthony Yang,Yao Qin
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026 Workshop on Agentic AI for Visual Media
Abstract:The multi-step, iterative image editing capabilities of multi-modal agentic systems have transformed digital content creation. Although latest image editing models faithfully follow instructions and generate high-quality images in single-turn edits, we identify a critical weakness in multi-turn editing, which is the iterative degradation of image quality. As images are repeatedly edited, minor artifacts accumulate, rapidly leading to a severe accumulation of visible noise and a failure to follow simple editing instructions. To systematically study these failures, we introduce Banana100, a comprehensive dataset of 28,000 degraded images generated through 100 iterative editing steps, including diverse textures and image content. Alarmingly, image quality evaluators fail to detect the degradation. Among 21 popular no-reference image quality assessment (NR-IQA) metrics, none of them consistently assign lower scores to heavily degraded images than to clean ones. The dual failures of generators and evaluators may threaten the stability of future model training and the safety of deployed agentic systems, if the low-quality synthetic data generated by multi-turn edits escape quality filters. We release the full code and data to facilitate the development of more robust models, helping to mitigate the fragility of multi-modal agentic systems.
[CV-166] ViBA: Implicit Bundle Adjustment with Geometric and Temporal Consistency for Robust Visual Matching
【速读】:该论文旨在解决现有图像关键点检测与描述方法依赖于精确姿态和深度标注数据集的问题,这类依赖限制了模型的可扩展性和泛化能力,并常导致导航与定位性能下降。其解决方案的关键在于提出ViBA框架,该框架通过将几何优化与特征学习相结合,实现对无约束视频流的持续在线训练;其核心机制包括:(1)用于帧间对应关系的初始跟踪网络,(2)基于深度的异常值过滤,以及(3)可微分的全局束调整(Bundle Adjustment, BA),通过最小化重投影误差联合优化相机位姿与特征位置。该设计利用BA带来的几何一致性与跨帧的长期时序一致性,强制生成稳定且准确的特征表示,从而在EuRoC和UMA数据集上显著降低平均绝对平移误差(ATE)12–18%、绝对旋转误差(ARE)5–10%,同时保持实时推理速度(36–91 FPS),并在未见序列中维持超过90%的定位精度,展现出优异的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2604.03377
作者: Xiaoji Niu,Yuqing Wang,Yan Wang,Hailiang Tang,Tisheng Zhang
机构: Wuhan University (武汉大学); Hubei Technology Innovation Center for Spatiotemporal Information and Positioning Navigation (湖北省时空信息与定位导航技术创新中心); Hubei Luojia Laboratory (湖北省珞珈实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most existing image keypoint detection and description methods rely on datasets with accurate pose and depth annotations, limiting scalability and generalization, and often degrading navigation and localization performance. We propose ViBA, a sustainable learning framework that integrates geometric optimization with feature learning for continuous online training on unconstrained video streams. Embedded in a standard visual odometry pipeline, it consists of an implicitly differentiable geometric residual framework: (i) an initial tracking network for inter-frame correspondences, (ii) depth-based outlier filtering, and (iii) differentiable global bundle adjustment that jointly refines camera poses and feature positions by minimizing reprojection errors. By combining geometric consistency from BA with long-term temporal consistency across frames, ViBA enforces stable and accurate feature representations. We evaluate ViBA on EuRoC and UMA datasets. Compared with state-of-the-art methods such as SuperPoint+SuperGlue, ALIKED, and LightGlue, ViBA reduces mean absolute translation error (ATE) by 12-18% and absolute rotation error (ARE) by 5-10% across sequences, while maintaining real-time inference speeds (FPS 36-91). When evaluated on unseen sequences, it retains over 90% localization accuracy, demonstrating robust generalization. These results show that ViBA supports continuous online learning with geometric and temporal consistency, consistently improving navigation and localization in real-world scenarios.
[CV-167] YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection
【速读】:该论文旨在解决实时目标检测中精度与速度难以兼顾的问题,特别是在小目标检测和复杂场景下特征提取能力不足的挑战。其解决方案的关键在于引入三项核心模块:C3K2块(改进的特征融合结构)、空间金字塔池化-快速(SPPF)模块以及C2PSA(跨阶段部分带空间注意力)模块,这些设计显著增强了模型的空间特征处理能力,同时保持了高推理效率,从而在不牺牲实时性的情况下提升了平均精度(mAP)。
链接: https://arxiv.org/abs/2604.03349
作者: Nikhileswara Rao Sulake
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted to CVC 2026 conference, but not continued due to no financial support
Abstract:YOLOv11 is the latest iteration in the You Only Look Once (YOLO) series of real-time object detectors, introducing novel architectural modules to improve feature extraction and small-object detection. In this paper, we present a detailed analysis of YOLOv11, including its backbone, neck, and head components. The model key innovations, the C3K2 blocks, Spatial Pyramid Pooling - Fast (SPPF), and C2PSA (Cross Stage Partial with Spatial Attention) modules enhance spatial feature processing while preserving speed. We compare YOLOv11 performance to prior YOLO versions on standard benchmarks, highlighting improvements in mean Average Precision (mAP) and inference speed. Our results demonstrate that YOLOv11 achieves superior accuracy without sacrificing real-time capabilities, making it well-suited for applications in autonomous driving, surveillance, and video this http URL work formalizes YOLOv11 in a research context, providing a clear reference for future studies.
[CV-168] Mixture-of-Experts in Remote Sensing: A Survey
【速读】:该论文旨在解决遥感数据(remote sensing data)分析与解释中因传感器模态多样性和地球观测数据时空动态性带来的独特挑战。其解决方案的关键在于引入混合专家模型(Mixture-of-Experts, MoE),通过动态路由机制将输入分配给针对特定任务方面设计的专用专家模块,从而实现对复杂遥感任务的高效建模与处理。
链接: https://arxiv.org/abs/2604.03342
作者: Yongchuan Cui,Peng Liu,Lajiao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing data analysis and interpretation present unique challenges due to the diversity in sensor modalities and spatiotemporal dynamics of Earth observation data. Mixture-of-Experts (MoE) model has emerged as a powerful paradigm that addresses these challenges by dynamically routing inputs to specialized experts designed for different aspects of a task. However, despite rapid progress, the community still lacks a comprehensive review of MoE for remote sensing. This survey provides the first systematic overview of MoE applications in remote sensing, covering fundamental principles, architectural designs, and key applications across a variety of remote sensing tasks. The survey also outlines future trends to inspire further research and innovation in applying MoE to remote sensing.
[CV-169] Learning Additively Compositional Latent Actions for Embodied AI
【速读】:该论文旨在解决现有隐式动作学习(Latent Action Learning, LAL)方法在无结构先验约束下,所学隐式动作(latent actions)常混入无关场景细节或未来观测信息,导致动作与状态变化解耦不足、运动幅度校准失准的问题。其解决方案的关键在于提出加性可组合隐式动作模型(Additively Compositional Latent Action Model, AC-LAM),通过在短时程内对隐式动作空间施加场景感知的加性组合结构约束(AC constraints),强制隐式动作空间满足代数结构特性(如单位元、逆元和循环一致性),从而抑制非加性组合信息,使学到的隐式动作更具结构性、运动特异性且位移校准更准确,进而为下游策略学习提供更强监督信号。
链接: https://arxiv.org/abs/2604.03340
作者: Hangxing Wei,Xiaoyu Chen,Chuheng Zhang,Tim Pearce,Jianyu Chen,Alex Lamb,Li Zhao,Jiang Bian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.
[CV-170] Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction
【速读】:该论文旨在解决单目RGB图像中深度估计的固有尺度模糊性与缺乏显式几何线索的问题,同时克服现有方法因复杂网络结构导致训练成本高、计算开销大且未能充分挖掘像素间空间依赖关系的局限。其解决方案的关键在于提出一种基于Swin Transformer骨干网络的多层级感知条件随机场(CRF)模型,包含三项协同创新:(1) 自适应混合金字塔特征融合(HPF)策略,结合多尺度空间金字塔池化与双向特征聚合,有效整合全局与局部上下文信息;(2) 分层感知适配器(HA),通过轻量级广播模块实现编码器内跨层级特征交互,降低计算复杂度并提升表征能力;(3) 全连接CRF解码器配合动态缩放注意力机制,建模细粒度像素级空间关系,并引入偏置学习单元防止极端值崩溃,保障训练稳定性。该方法在NYU Depth v2、KITTI和MatterPort3D数据集上达到最优性能,显著优于现有方法。
链接: https://arxiv.org/abs/2604.03339
作者: Wuqi Su,Huilun Song,Chen Zhao,Chi Xu
机构: Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular depth estimation from a single RGB image remains a fundamental challenge in computer vision due to inherent scale ambiguity and the absence of explicit geometric cues. Existing approaches typically rely on increasingly complex network architectures to regress depth maps, which escalates training costs and computational overhead without fully exploiting inter-pixel spatial dependencies. We propose a multilevel perceptual conditional random field (CRF) model built upon the Swin Transformer backbone that addresses these limitations through three synergistic innovations: (1) an adaptive hybrid pyramid feature fusion (HPF) strategy that captures both short-range and long-range dependencies by combining multi-scale spatial pyramid pooling with biaxial feature aggregation, enabling effective integration of global and local contextual information; (2) a hierarchical awareness adapter (HA) that enriches cross-level feature interactions within the encoder through lightweight broadcast modules with learnable dimensional scaling, reducing computational complexity while enhancing representational capacity; and (3) a fully-connected CRF decoder with dynamic scaling attention that models fine-grained pixel-level spatial relationships, incorporating a bias learning unit to prevent extreme-value collapse and ensure stable training. Extensive experiments on NYU Depth v2, KITTI, and MatterPort3D datasets demonstrate that our method achieves state-of-the-art performance, reducing Abs Rel to 0.088 ( - 7.4%) and RMSE to 0.316 ( - 5.4%) on NYU Depth v2, while attaining near-perfect threshold accuracy ( \delta 1.25^3 \approx 99.8% ) on KITTI with only 194M parameters and 21ms inference time.
[CV-171] Significance and Stability Analysis of Gene-Environment Interaction using RGxEStat
【速读】:该论文旨在解决基因型与环境互作(Genotype-by-Environment interaction, GxE)对作物表型预测准确性的影响问题,即GxE交互作用导致不同基因型在不同环境下表现不稳定,从而降低育种选择效率。其解决方案的关键在于提出两种核心分析模型:一是基于混合效应模型的显著性分析,用于判断基因型或GxE互作是否显著影响表型;二是稳定性分析,用以深入解析基因型与环境间的交互关系及其相对优劣。此外,作者开发了轻量级交互式工具RGxEStat,集成模型构建、求解与可视化功能,无需用户掌握复杂的SAS或R编程,即可高效完成育种数据分析,显著缩短研究周期。
链接: https://arxiv.org/abs/2604.03337
作者: Meng’en Qin,Zhe Li,Xiaohui Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Genotype-by-Environment (GxE) interactions influence the performance of genotypes across diverse environments, reducing the predictability of phenotypes in target environments. In-depth analysis of GxE interactions facilitates the identification of how genetic advantages or defects are expressed or suppressed under specific environmental conditions, thereby enabling genetic selection and enhancing breeding practices. This paper introduces two key models for GxE interaction research. Specifically, it includes significance analysis based on the mixed effect model to determine whether genes or GxE interactions significantly affect phenotypic traits; stability analysis, which further investigates the interactive relationships between genes and environments, as well as the relative superiority or inferiority of genotypes across environments. Additionally, this paper presents RGxEStat, a lightweight interactive tool, which is developed by the authors and integrates the construction, solution, and visualization of the aforementioned models. Designed to eliminate the need for breeders and agronomists to learn complex SAS or R programming, RGxEStat provides a user-friendly interface for streamlined breeding data analysis, significantly accelerating research cycles. Codes and datasets are available at this https URL.
[CV-172] Bridging the Dimensionality Gap: A Taxonomy and Survey of 2D Vision Model Adaptation for 3D Analysis
【速读】:该论文旨在解决将二维视觉中取得显著成功的卷积神经网络(Convolutional Neural Networks, CNNs)和视觉Transformer(Vision Transformers, ViTs)扩展至三维(3D)分析时所面临的根本性挑战,即2D图像的规则密集网格与3D数据(如点云和网格)不规则、稀疏特性之间的本质差异。其解决方案的关键在于提出一个统一的分类框架,将现有方法归纳为三类:(1) 数据中心型方法,通过将3D数据投影到2D空间以利用现成的2D模型;(2) 架构中心型方法,设计原生的3D网络结构;(3) 混合方法,协同融合上述两种范式,兼顾大规模2D数据中的视觉先验与3D模型的显式几何推理能力。该框架有助于系统性地理解不同策略在计算复杂度、预训练依赖性和几何归纳偏置保留等方面的权衡,从而推动3D基础模型、自监督学习(Self-Supervised Learning, SSL)及多模态信号深度融合等方向的发展。
链接: https://arxiv.org/abs/2604.03334
作者: Akshat Pandya,Bhavuk Jain
机构: Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: VISAPP 2026
Abstract:The remarkable success of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in 2D vision has spurred significant research in extending these architectures to the complex domain of 3D analysis. Yet, a core challenge arises from a fundamental dichotomy between the regular, dense grids of 2D images and the irregular, sparse nature of 3D data such as point clouds and meshes. This survey provides a comprehensive review and a unified taxonomy of adaptation strategies that bridge this gap, classifying them into three families: (1) Data-centric methods that project 3D data into 2D formats to leverage off-the-shelf 2D models, (2) Architecture-centric methods that design intrinsic 3D networks, and (3) Hybrid methods, which synergistically combine the two modeling paradigms to benefit from both rich visual priors of large 2D datasets and explicit geometric reasoning of 3D models. Through this framework, we qualitatively analyze the fundamental trade-offs between these families concerning computational complexity, reliance on large-scale pre-training, and the preservation of geometric inductive biases. We discuss key open challenges and outline promising future research directions, including the development of 3D foundation models, advancements in self-supervised learning (SSL) for geometric data, and the deeper integration of multi-modal signals.
[CV-173] CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection
【速读】:该论文旨在解决视频中暴力行为检测任务中音频信息利用不足的问题,尤其是在真实场景下声音环境复杂、音频与视觉内容关联性弱的情况下。解决方案的关键在于提出一种定向的视频到音频多模态架构 CoLoRSMamba,其核心创新是通过 CLS-guided 条件 LoRA(Low-Rank Adaptation)将 VideoMamba 和 AudioMamba 在每一层进行耦合:VideoMamba 的 CLS token 生成通道级调制向量和稳定门控机制,动态调节 AudioMamba 中状态空间参数(Delta, B, C)的投影,包括步长路径,从而实现无需跨标记注意力机制即可获得场景感知的音频动态建模。该方法在保留音频语义对齐的同时显著提升了多模态融合效率,在 NTU-CCTV 和 DVD 数据集上均优于现有音频仅、视频仅及多模态基线模型,并展现出更优的准确率-效率权衡。
链接: https://arxiv.org/abs/2604.03329
作者: Damith Chamalke Senadeera,Dimitrios Kollias,Gregory Slabaugh
机构: Digital Environmental Research Institute, Queen Mary University of London (伦敦玛丽女王大学数字环境研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注:
Abstract:Violence detection benefits from audio, but real-world soundscapes can be noisy or weakly related to the visible scene. We present CoLoRSMamba, a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention. Training combines binary classification with a symmetric AV-InfoNCE objective that aligns clip-level audio and video embeddings. To support fair multimodal evaluation, we curate audio-filtered clip level subsets of the NTU-CCTV and DVD datasets from temporal annotations, retaining only clips with available audio. On these subsets, CoLoRSMamba outperforms representative audio-only, video-only, and multimodal baselines, achieving 88.63% accuracy / 86.24% F1-V on NTU-CCTV and 75.77% accuracy / 72.94% F1-V on DVD. It further offers a favorable accuracy-efficiency tradeoff, surpassing several larger models with fewer parameters and FLOPs.
[CV-174] Review and Evaluation of Point-Cloud based Leaf Surface Reconstruction Methods for Agricultural Applications
【速读】:该论文旨在解决从复杂的真实世界植物点云数据中准确重建叶片表面的问题,这在农业表型分析等应用中至关重要。其关键解决方案是通过系统性比较九种代表性表面重建方法(包括参数化、基于三角剖分、隐式和学习驱动的方法),在三个公开数据集(LAST-STRAW、Pheno4D 和 Crops3D)上评估这些方法在表面面积估计精度、平滑性、抗噪性和缺失数据鲁棒性以及计算成本等方面的性能表现,从而为资源受限的机器人平台提供实用的选择依据。
链接: https://arxiv.org/abs/2604.03328
作者: Arif Ahmed,Parikshit Maini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Accurate reconstruction of leaf surfaces from 3D point cloud is essential for agricultural applications such as phenotyping. However, real-world plant data (i.e., irregular 3D point cloud) are often complex to reconstruct plant parts accurately. A wide range of surface reconstruction methods has been proposed, including parametric, triangulation-based, implicit, and learning based approaches, yet their relative performance for leaf surface reconstruction remains insufficiently understood. In this work, we present a comparative study of nine representative surface reconstruction methods for leaf surfaces. We evaluate these methods on three publicly available datasets: LAST-STRAW, Pheno4D, and Crops3D - spanning diverse species, sensors, and sensing environments, ranging from clean high-resolution indoor scans to noisy low-resolution field settings. The analysis highlights the trade-offs between surface area estimation accuracy, smoothness, robustness to noise and missing data, and computational cost across different methods. These factors affect the cost and constraints of robotic hardware used in agricultural applications. Our results show that each method exhibits distinct advantages depending on application and resource constraints. The findings provide practical guidance for selecting surface reconstruction techniques for resource constrained robotic platforms.
[CV-175] Safety-Aligned 3D Object Detection: Single-Vehicle Cooperative and End-to-End Perspectives
【速读】:该论文旨在解决当前自动驾驶车辆(CAVs)中感知系统在安全关键场景下性能不足的问题,尤其是在标准评估指标(如mAP和NDS)与实际安全影响之间存在脱节的情况下。传统深度学习方法虽提升了感知精度,但其统计特性难以保证高风险误检或漏检的最小化,而现有训练目标和评估基准对所有误差一视同仁,忽略了仅部分错误具有安全临界性(safety-critical)。解决方案的关键在于引入两个核心工具:一是基于安全导向的评估指标NDS-USC(Normalized Detection Score - Unified Safety Criterion),用于量化高影响误差;二是安全感知损失函数EC-IoU(Error-Corrected Intersection over Union),通过显式建模安全相关误差来优化模型训练。研究进一步表明,在单车感知、车路协同感知及端到端感知-规划框架中,采用EC-IoU进行安全感知强化可显著降低碰撞率(近30%),并有效提升系统级安全性,从而为实现“零事故”(Vision Zero)提供可落地的技术路径。
链接: https://arxiv.org/abs/2604.03325
作者: Brian Hsuan-Cheng Liao,Chih-Hong Cheng,Hasan Esen,Alois Knoll
机构: DENSO AUTOMOTIVE Deutschland GmbH (电装汽车德国有限公司); Carl von Ossietzky University of Oldenburg (奥尔登堡卡尔·冯·奥西茨基大学); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages, 9 figures, 6 tables
Abstract:Perception plays a central role in connected and autonomous vehicles (CAVs), underpinning not only conventional modular driving stacks, but also cooperative perception systems and recent end-to-end driving models. While deep learning has greatly improved perception performance, its statistical nature makes perfect predictions difficult to attain. Meanwhile, standard training objectives and evaluation benchmarks treat all perception errors equally, even though only a subset is safety-critical. In this paper, we investigate safety-aligned evaluation and optimization for 3D object detection that explicitly characterize high-impact errors. Building on our previously proposed safety-oriented metric, NDS-USC, and safety-aware loss function, EC-IoU, we make three contributions. First, we present an expanded study of single-vehicle 3D object detection models across diverse neural network architectures and sensing modalities, showing that gains under standard metrics such as mAP and NDS may not translate to safety-oriented criteria represented by NDS-USC. With EC-IoU, we reaffirm the benefit of safety-aware fine-tuning for improving safety-critical detection performance. Second, we conduct an ego-centric, safety-oriented evaluation of AV-infrastructure cooperative object detection models, underscoring its superiority over vehicle-only models and demonstrating a safety impact analysis that illustrates the potential contribution of cooperative models to “Vision Zero.” Third, we integrate EC-IoU into SparseDrive and show that safety-aware perception hardening can reduce collision rate by nearly 30% and improve system-level safety directly in an end-to-end perception-to-planning framework. Overall, our results indicate that safety-aligned perception evaluation and optimization offer a practical path toward enhancing CAV safety across single-vehicle, cooperative, and end-to-end autonomy settings.
[CV-176] VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing
【速读】:该论文旨在解决智能制造中质量检测任务的局限性,即仅依赖视觉的方法难以准确识别材料内在属性(如硬度、粗糙度等)且易受遮挡和反射干扰的问题。其解决方案的关键在于提出VitaTouch——一个面向材料属性推断与自然语言描述的多模态感知模型,通过引入视觉-触觉-语言联合建模机制,利用模态专用编码器和双Q-Former提取语义相关的视觉与触觉特征,并将其压缩为前缀标记输入大语言模型;同时采用对比学习显式耦合视觉与触觉模态,从而实现高精度的材料属性识别与自然语言描述生成。
链接: https://arxiv.org/abs/2604.03322
作者: Junyi Zong,Qingxuan Jia,Meixian Shi,Tong Li,Jiayuan Li,Zihang Lv,Gang Chen,Fang Deng
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Zhongguancun Academy (中关村学院); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 11 pages, 6 figures
Abstract:Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall; the material-description task further achieves a peak semantic similarity of 0.9009. With LoRA-based fine-tuning, VitaTouch attains 100.0%, 96.0%, and 92.0% accuracy for 2-, 3-, and 5-category defect recognition, respectively, and delivers 94.0% closed-loop recognition accuracy and 94.0% end-to-end sorting success in 100 laboratory robotic trials. More details are available at the project page: this https URL
[CV-177] Robust Multi-Source Covid-19 Detection in CT Images CVPR2026
【速读】:该论文旨在解决深度学习模型在跨中心(multi-center)胸部CT扫描中进行新冠肺炎(COVID-19)检测时性能下降的问题,其核心挑战在于不同中心的扫描设备、成像协议和患者群体存在差异,导致模型学到的特征偏向于数据量较大的中心。解决方案的关键在于提出一种多任务学习框架,同时预测新冠诊断结果和数据来源中心,通过共享EfficientNet-B7主干网络促使特征提取器学习跨中心一致的表示;此外,为缓解数据分布不均问题,在源分类头中采用对数调整的交叉熵损失(logit-adjusted cross-entropy loss),防止低频中心被忽略,从而提升模型的泛化能力。
链接: https://arxiv.org/abs/2604.03320
作者: Asmita Yuki Pritha,Jason Xu,Daniel Ding,Justin Li,Aryana Hou,Xin Wang,Shu Hu
机构: Capstone School Dhaka, Bangladesh; Carmel High School, Carmel, Indiana, USA; Clarkstown High School South, West Nyack, New York, USA; University at Albany, State University of New York, Albany, New York, USA; Purdue University, West Lafayette, Indiana, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 3 tables. Accepted at the 3rd Workshop on New Trends in AI-Generated Media and Security (AIMS) @ CVPR 2026
Abstract:Deep learning models for COVID-19 detection from chest CT scans generally perform well when the training and test data originate from the same institution, but they often struggle when scans are drawn from multiple centres with differing scanners, imaging protocols, and patient populations. One key reason is that existing methods treat COVID-19 classification as the sole training objective, without accounting for the data source of each scan. As a result, the learned representations tend to be biased toward centres that contribute more training data. To address this, we propose a multi-task learning approach in which the model is trained to predict both the COVID-19 diagnosis and the originating data centre. The two tasks share an EfficientNet-B7 backbone, which encourages the feature extractor to learn representations that hold across all four participating centres. Since the training data is not evenly distributed across sources, we apply a logit-adjusted cross-entropy loss [1] to the source classification head to prevent underrepresented centres from being overlooked. Our pre-processing follows the SSFL framework with KDS [2], selecting eight representative slices per scan. Our method achieves an F1 score of 0.9098 and an AUC-ROC of 0.9647 on a validation set of 308 scans. The code is publicly available at this https URL.
[CV-178] EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLM s CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间认知任务中面临的两大挑战:一是现有方法依赖3D先验或几何监督以提升空间推理能力,导致数据准备和对齐成本高昂;二是纯2D方法难以有效捕捉跨帧的空间关系,限制了其在多帧场景下的空间推理性能。解决方案的关键在于提出EgoMind框架,其核心创新为两个模块:一是Role-Play Caption,通过联合构建跨帧的连贯语言场景图(linguistic scene graph),实现无需几何信息的空间理解;二是Progressive Spatial Analysis,逐步推理以回答特定任务问题。该方案仅需5K自动生成的监督微调(SFT)样本和20K强化学习(RL)样本,即在多个基准测试(VSI-Bench、SPAR-Bench、SITE-Bench 和 SPBench)上取得具有竞争力的结果,验证了基于语言推理的空间认知潜力。
链接: https://arxiv.org/abs/2604.03318
作者: Zhenghao Chen,Huiqun Wang,Di Huang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at this https URL.
[CV-179] Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning
【速读】:该论文旨在解决现有机器学习方法在检测面对面协作学习中注视行为时面临的两大挑战:一是模型训练通常依赖大量人工标注数据,导致成本高且难以扩展;二是模型在不同教育场景配置下的鲁棒性不足,限制了其在真实环境中的适用性。解决方案的关键在于提出一种可扩展的人工智能方法,充分利用预训练和基础模型实现无监督自动检测,具体包括:使用预训练的YOLO11进行人物跟踪、YOLOE-26结合文本提示能力识别教育相关物体,以及基于Gaze-LLE模型预测注视目标。该方法无需人工标注数据,在复杂情境下表现出更优且稳定的性能,显著提升了跨配置的鲁棒性。
链接: https://arxiv.org/abs/2604.03317
作者: Junyuan Liang,Qi Zhou,Sahan Bulathwela,Mutlu Cukurova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures, 2 tables, accepted by the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
Abstract:Previous studies have illustrated the potential of analysing gaze behaviours in collaborative learning to provide educationally meaningful information for students to reflect on their learning. Over the past decades, machine learning approaches have been developed to automatically detect gaze behaviours from video data. Yet, since these approaches often require large amounts of labelled data for training, human annotation remains necessary. Additionally, researchers have questioned the cross-configuration robustness of machine learning models developed, as training datasets often fail to encompass the full range of situations encountered in educational contexts. To address these challenges, this study proposes a scalable artificial intelligence approach that leverages pretrained and foundation models to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data. The approach utilises pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction. The results indicate that the proposed approach achieves an F1-score of 0.829 in detecting students’ gaze behaviours from video data, with strong performance for laptop-directed gaze and peer-directed gaze, yet weaker performance for other gaze targets. Furthermore, when compared to other supervised machine learning approaches, the proposed method demonstrates superior and more stable performance in complex contexts, highlighting its better cross-configuration robustness. The implications of this approach for supporting students’ collaborative learning in real-world environments are also discussed.
[CV-180] When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLM)中注意力机制存在的“注意力下沉”(attention sinks)现象对跨模态性能的影响问题,尤其是其在全局场景建模与局部细节感知之间的权衡。研究表明,视觉注意力下沉可分为两类:来自视觉Transformer编码器的V-sinks和来自语言模型深层的L-sinks,二者虽能有效编码全局场景先验信息,但过度主导注意力分配会抑制细粒度视觉证据的获取,从而损害下游任务表现。解决方案的关键在于提出一种轻量级、即插即用的分层注意力门控模块(Layer-wise Sink Gating, LSG),通过动态调节V-sink与其他视觉token的注意力权重,在不改变LVLM主干结构的前提下实现全局推理与局部精确感知的平衡,且仅需标准的下一个词预测训练目标,无需特定任务标注。
链接: https://arxiv.org/abs/2604.03316
作者: Jiho Choi,Jaemin Kim,Sanghwan Kim,Seunghoon Hong,Jin-Hwi Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint
Abstract:Attention sinks are defined as tokens that attract disproportionate attention. While these have been studied in single modality transformers, their cross-modal impact in Large Vision-Language Models (LVLM) remains largely unexplored: are they redundant artifacts or essential global priors? This paper first categorizes visual sinks into two distinct categories: ViT-emerged sinks (V-sinks), which propagate from the vision encoder, and LLM-emerged sinks (L-sinks), which arise within deep LLM layers. Based on the new definition, our analysis reveals a fundamental performance trade-off: while sinks effectively encode global scene-level priors, their dominance can suppress the fine-grained visual evidence required for local perception. Furthermore, we identify specific functional layers where modulating these sinks most significantly impacts downstream performance. To leverage these insights, we propose Layer-wise Sink Gating (LSG), a lightweight, plug-and-play module that dynamically scales the attention contributions of V-sink and the rest visual tokens. LSG is trained via standard next-token prediction, requiring no task-specific supervision while keeping the LVLM backbone frozen. In most layers, LSG yields improvements on representative multimodal benchmarks, effectively balancing global reasoning and precise local evidence.
[CV-181] StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics
【速读】:该论文旨在解决视觉叙事中分镜(Storyboarding)自动化过程中的两大核心挑战:跨镜头一致性(inter-shot consistency)与显式可编辑性(explicit editability)。当前基于2D扩散模型的方法虽能生成高质量图像,但存在角色或场景特征漂移(identity drift)及几何控制能力弱的问题;而传统3D动画流程虽具一致性与可编辑性,却依赖高技能专家且耗时费力。其解决方案的关键在于提出StoryBlender框架,采用三阶段流水线:(1) 语义-空间锚定(Semantic-Spatial Grounding),构建连续性记忆图以解耦全局资产与镜头特异性变量;(2) 标准化资产实例化(Canonical Asset Materialization),在统一坐标系中实例化实体以保持视觉身份一致;(3) 空间-时间动态演化(Spatial-Temporal Dynamics),通过视觉度量实现布局设计与电影化演变。系统通过多智能体层级协作与引擎验证反馈循环,实现空间幻觉的自我修正,最终生成支持直接精确编辑的原生3D场景,同时保障多镜头间的稳定一致性。
链接: https://arxiv.org/abs/2604.03315
作者: Bingliang Li,Zhenhong Sun,Jiaming Bian,Yuehao Wu,Yifu Wang,Hongdong Li,Yatao Bian,Huadong Mo,Daoyi Dong
机构: University of New South Wales (新南威尔士大学); Australia National University (澳大利亚国立大学); Central South University (中南大学); National University of Singapore (新加坡国立大学); University of Technology Sydney (悉尼科技大学); Vertex Lab (顶点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Storyboarding is a core skill in visual storytelling for film, animation, and games. However, automating this process requires a system to achieve two properties that current approaches rarely satisfy simultaneously: inter-shot consistency and explicit editability. While 2D diffusion-based generators produce vivid imagery, they often suffer from identity drift along with limited geometric control; conversely, traditional 3D animation workflows are consistent and editable but require expert-heavy, labor-intensive authoring. We present StoryBlender, a grounded 3D storyboard generation framework governed by a Story-centric Reflection Scheme. At its core, we propose the StoryBlender system, which is built on a three-stage pipeline: (1) Semantic-Spatial Grounding, to construct a continuity memory graph to decouple global assets from shot-specific variables for long-horizon consistency; (2) Canonical Asset Materialization, to instantiate entities in a unified coordinate space to maintain visual identity; and (3) Spatial-Temporal Dynamics, to achieve layout design and cinematic evolution through visual metrics. By orchestrating multiple agents in a hierarchical manner within a verification loop, StoryBlender iteratively self-corrects spatial hallucinations via engine-verified feedback. The resulting native 3D scenes support direct, precise editing of cameras and visual assets while preserving unwavering multi-shot continuity. Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines. Code, data, and demonstration video will be available on this https URL
[CV-182] CardioSAM: Topology-Aware Decoder Design for High-Precision Cardiac MRI Segmentation
【速读】:该论文旨在解决心血管磁共振(CMR)图像中心脏结构分割的边界精度不足问题,以提升临床诊断与治疗的可靠性。当前基于深度学习的方法虽具较强泛化能力,但难以满足临床对高精度边界的严格要求。解决方案的关键在于提出一种混合架构CardioSAM,其核心创新包括:1)利用冻结的Segment Anything Model(SAM)编码器提取通用特征;2)设计轻量级可训练的心脏特异性解码器,其中包含引入解剖拓扑先验的心脏特异性注意力模块(Cardiac-Specific Attention module)和用于增强组织界面分割精度的边界精修模块(Boundary Refinement Module)。该方法在ACDC基准上实现了Dice系数93.39%、HD95为4.2 mm,显著优于nnU-Net等基线模型,并超越了专家间一致性水平,具备临床应用潜力。
链接: https://arxiv.org/abs/2604.03313
作者: Ujjwal Jain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of cardiac structures in cardiovascular magnetic resonance (CMR) images is essential for reliable diagnosis and treatment of cardiovascular diseases. However, manual segmentation remains time-consuming and suffers from significant inter-observer variability. Recent advances in deep learning, particularly foundation models such as the Segment Anything Model (SAM), demonstrate strong generalization but often lack the boundary precision required for clinical applications. To address this limitation, we propose CardioSAM, a hybrid architecture that combines the generalized feature extraction capability of a frozen SAM encoder with a lightweight, trainable cardiac-specific decoder. The proposed decoder introduces two key innovations: a Cardiac-Specific Attention module that incorporates anatomical topological priors, and a Boundary Refinement Module designed to improve tissue interface delineation. Experimental evaluation on the ACDC benchmark demonstrates that CardioSAM achieves a Dice coefficient of 93.39%, IoU of 87.61%, pixel accuracy of 99.20%, and HD95 of 4.2 mm. The proposed method surpasses strong baselines such as nnU-Net by +3.89% Dice and exceeds reported inter-expert agreement levels (91.2%), indicating its potential for reliable and clinically applicable cardiac segmentation.
[CV-183] PollutionNet: A Vision Transformer Framework for Climatological Assessment of NO_2 and SO_2 Using Satellite-Ground Data Fusion
【速读】:该论文旨在解决大气氮氧化物(NO₂)和二氧化硫(SO₂)浓度监测中存在的时间分辨率与空间覆盖范围之间的矛盾问题:卫星观测虽具备广域覆盖能力但存在数据缺失,而地面传感器虽时间分辨率高却难以实现大范围布设。解决方案的关键在于提出一种基于视觉Transformer(Vision Transformer)的框架——PollutionNet,该模型通过自注意力机制有效捕捉污染物浓度的复杂时空依赖关系,融合Sentinel-5P TROPOMI垂直柱密度(VCD)数据与地面观测数据,在爱尔兰地区2020–2021年的实证中实现了优于传统CNN和RNN模型的预测精度(NO₂的均方根误差RMSE为6.89 μg/m³,SO₂为4.49 μg/m³),显著提升了污染评估的准确性与可扩展性。
链接: https://arxiv.org/abs/2604.03311
作者: Prasanjit Dey,Soumyabrata Dev,Bianca Schoen-Phelan
机构: University College Dublin (都柏林大学); Technological University Dublin (都柏林理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: This manuscript is currently under review at Theoretical and Applied Climatology (Springer)
Abstract:Accurate assessment of atmospheric nitrogen dioxide (NO _2 ) and sulfur dioxide (SO _2 ) is essential for understanding climate-air quality interactions, supporting environmental policy, and protecting public health. Traditional monitoring approaches face limitations: satellite observations provide broad spatial coverage but suffer from data gaps, while ground-based sensors offer high temporal resolution but limited spatial extent. To address these challenges, we propose PollutionNet, a Vision Transformer-based framework that integrates Sentinel-5P TROPOMI vertical column density (VCD) data with ground-level observations. By leveraging self-attention mechanisms, PollutionNet captures complex spatiotemporal dependencies that are often missed by conventional CNN and RNN models. Applied to Ireland (2020-2021), our case study demonstrates that PollutionNet achieves state-of-the-art performance (RMSE: 6.89 \mu g/m ^3 for NO _2 , 4.49 \mu g/m ^3 for SO _2 ), reducing prediction errors by up to 14% compared to baseline models. Beyond accuracy gains, PollutionNet provides a scalable and data-efficient tool for applied climatology, enabling robust pollution assessments in regions with sparse monitoring networks. These results highlight the potential of advanced machine learning approaches to enhance climate-related air quality research, inform environmental management, and support sustainable policy decisions.
[CV-184] Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions
【速读】:该论文旨在解决长距离人类运动生成中跨语义运动域的连贯过渡问题,这是计算机视觉与图形学领域中的一个核心挑战。现有方法难以实现不同风格或语义类别(如舞蹈动作)之间的自然衔接,而此类能力对舞蹈编排等应用至关重要。解决方案的关键在于提出一种基于扩散模型的推理时优化框架,通过引入显式的控制能量目标(control-energy objective),对预训练扩散模型的运动轨迹进行约束优化,从而在推理阶段生成具有高保真度和时间一致性的跨域过渡动作,首次实现了具备显式过渡建模能力的可控长距离人体运动生成。
链接: https://arxiv.org/abs/2604.03310
作者: Haichao Wang,Alexander Okupnik,Yuxing Han,Gene Wen,Johannes Schneider,Kyriakos Flouris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-range human movement generation remains a central challenge in computer vision and graphics. Generating coherent transitions across semantically distinct motion domains remains largely unexplored. This capability is particularly important for applications such as dance choreography, where movements must fluidly transition across diverse stylistic and semantic motifs. We propose a simple and effective inference-time optimization framework inspired by diffusion-based stochastic optimal control. Specifically, a control-energy objective that explicitly regularizes the transition trajectories of a pretrained diffusion model. We show that optimizing this objective at inference time yields transitions with fidelity and temporal coherence. This is the first work to provide a general framework for controlled long-range human motion generation with explicit transition modeling.
[CV-185] reeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding
【速读】:该论文旨在解决现有3D Gaussian Splatting (3DGS) 方法在复杂场景中难以建模层次化语义结构及捕捉整体-局部关系的问题,同时克服由2D先验带来的密集成对比较和不一致的层级标签导致的特征学习不佳与分割性能受限问题。其解决方案的关键在于提出TreeGaussian框架,通过构建多层级物体树(multi-level object tree)显式建模语义层次关系,并引入两级级联对比学习策略(two-stage cascaded contrastive learning),从全局到局部逐步优化特征表示,缓解对比学习中的饱和现象并稳定训练过程;此外,还设计了一致性分割检测(Consistent Segmentation Detection, CSD)机制与基于图的去噪模块,以提升跨视角分割一致性并抑制不稳定高斯点,从而显著改善3D场景的语义分割质量与鲁棒性。
链接: https://arxiv.org/abs/2604.03309
作者: Jingbin You,Zehao Li,Hao Jiang,Xinzhu Ma,Shuqin Gao,Honglong Zhao,Congcong Zheng,Tianlu Mao,Feng Dai,Yucheng Zhang,Zhaoqi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:3D Gaussian Splatting (3DGS) has emerged as a real-time, differentiable representation for neural scene understanding. However, existing 3DGS-based methods struggle to represent hierarchical 3D semantic structures and capture whole-part relationships in complex scenes. Moreover, dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, resulting in suboptimal segmentation. To address these limitations, we introduce TreeGaussian, a tree-guided cascaded contrastive learning framework that explicitly models hierarchical semantic relationships and reduces redundancy in contrastive supervision. By constructing a multi-level object tree, TreeGaussian enables structured learning across object-part hierarchies. In addition, we propose a two-stage cascaded contrastive learning strategy that progressively refines feature representations from global to local, mitigating saturation and stabilizing training. A Consistent Segmentation Detection (CSD) mechanism and a graph-based denoising module are further introduced to align segmentation modes across views while suppressing unstable Gaussian points, enhancing segmentation consistency and quality. Extensive experiments, including open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies, demonstrate the effectiveness and robustness of our approach.
[CV-186] Edge-Based Standing-Water Detection via FSM-Guided Tiering and Multi-Model Consensus
【速读】:该论文旨在解决农田中积水对农业机械通行能力与作物健康造成的威胁,提出了一种基于边缘计算的自适应积水检测架构。其关键在于通过有限状态机(Finite-State Machine, FSM)决策引擎融合摄像头输入与环境传感器数据(湿度、气压、温度),动态选择本地或卸载推理层级,在间歇性网络连接和运动依赖的计算预算约束下权衡精度、延迟与能耗;同时结合多模型YOLO集成与昼夜基准传感器融合机制,实现更鲁棒的洪水检测性能,并在真实农田场景中保持有界的尾部延迟与更低的能耗。
链接: https://arxiv.org/abs/2604.03308
作者: Oliver Aleksander Larsen,Mahyar T. Moghaddam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the In Practice Track of IEEE ICSA 2026. 10 pages
Abstract:Standing water in agricultural fields threatens vehicle mobility and crop health. This paper presents a deployed edge architecture for standing-water detection using Raspberry-Pi-class devices with optional Jetson acceleration. Camera input and environmental sensors (humidity, pressure, temperature) are combined in a finite-state machine (FSM) that acts as the architectural decision engine. The FSM-guided control plane selects between local and offloaded inference tiers, trading accuracy, latency, and energy under intermittent connectivity and motion-dependent compute budgets. A multi-model YOLO ensemble provides image scores, while diurnal-baseline sensor fusion adjusts caution using environmental anomalies. All decisions are logged per frame, enabling bit-identical hardware-in-the-loop replays. Across ten configurations and sensor variants on identical field sequences with frame-level ground truth, we show that the combination of adaptive tiering, multi-model consensus, and diurnal sensor fusion improves flood-detection performance over static local baselines, uses less energy than a naive always-heavy offload policy, and maintains bounded tail latency in a real agricultural setting.
[CV-187] V-Reflection: Transforming MLLM s from Passive Observers to Active Interrogators
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度感知任务中易产生与视觉相关的幻觉问题,其根源在于当前模型的推理过程主要局限于语言域,将视觉输入视为静态前提而非动态参与要素,导致模型无法主动回溯视觉细节以锚定推理状态。解决方案的关键在于提出V-Reflection框架,通过“先思考再观察”(think-then-look)的视觉反思机制,使MLLM转变为积极的视觉 interrogator(提问者)。该机制利用潜在状态作为动态探针,在推理过程中主动查询视觉特征空间,实现每一步推理对任务关键证据的显式定位;同时采用两阶段蒸馏策略:首先通过Box-Guided Compression(BCM)模块建立稳定的像素到潜在表示的目标,其次通过Dynamic Autoregressive Compression(DAC)模块将隐藏状态映射为动态探针,从而将BCM教师的空间感知能力内化至DAC学生模型中,最终在推理阶段保持纯端到端自回归解码,兼顾效率与精度。
链接: https://arxiv.org/abs/2604.03307
作者: Jiazhou Zhou,Yucheng Chen,Hongyang Li,Qing Jiang,Hu Zhou,Ying-Cong Chen,Lei Zhang
机构: The Hong Kong University of Science and Technology (Guangzhou); International Digital Economy Academy; Nanyang Technological University; Centre of AI in Medicine; South China University of Technology; The Hong Kong Polytechnic University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Main paper 14 pages with supplementary 7 pages
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a “think-then-look” visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy. First, the Box-Guided Compression (BCM) module establishes stable pixel-to-latent targets through explicit spatial grounding. Next, a Dynamic Autoregressive Compression (DAC) module maps the model’s hidden states into dynamic probes that interrogate the global visual feature map. By distilling the spatial expertise of the BCM teacher into the DAC student, V-Reflection internalizes the ability to localize task-critical evidence. During inference, both modules remain entirely inactive, maintaining a purely end-to-end autoregressive decoding in the latent space with optimal efficiency. Extensive experiments demonstrate the effectiveness of our V-Reflection across six perception-intensive benchmarks, significantly narrowing the fine-grained perception gap. Visualizations confirm that latent reasoning autonomously localizes task-critical visual evidence.
[CV-188] Deep Image Clustering Based on Curriculum Learning and Density Information
【速读】:该论文旨在解决现有深度聚类方法在处理复杂图像数据时,因缺乏有效的模型学习策略而导致的鲁棒性不足和性能瓶颈问题,以及仅依赖点到点距离进行聚类分配所引发的误差累积问题。其解决方案的关键在于首次将密度信息引入图像聚类的模型训练策略中:一方面设计基于输入数据密度信息的课程学习(curriculum learning)方案,以更合理的学习节奏提升模型稳定性;另一方面采用密度核心(density core)替代传统单个簇中心来指导聚类分配,从而增强聚类结果的准确性与鲁棒性。
链接: https://arxiv.org/abs/2604.03306
作者: Haiyang Zheng,Ruilin Zhang,Hongpeng Wang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image clustering is one of the crucial techniques in multimedia analytics and knowledge discovery. Recently, the Deep clustering method (DC), characterized by its ability to perform feature learning and cluster assignment jointly, surpasses the performance of traditional ones on image data. However, existing methods rarely consider the role of model learning strategies in improving the robustness and performance of clustering complex image data. Furthermore, most approaches rely solely on point-to-point distances to cluster centers for partitioning the latent representations, resulting in error accumulation throughout the iterative process. In this paper, we propose a robust image clustering method (IDCL) which, to our knowledge for the first time, introduces a model training strategy using density information into image clustering. Specifically, we design a curriculum learning scheme grounded in the density information of input data, with a more reasonable learning pace. Moreover, we employ the density core rather than the individual cluster center to guide the cluster assignment. Finally, extensive comparisons with state-of-the-art clustering approaches on benchmark datasets demonstrate the superiority of the proposed method, including robustness, rapid convergence, and flexibility in terms of data scale, number of clusters, and image context.
[CV-189] HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
【速读】:该论文旨在解决当前手-物体交互(Hand-Object Interaction, HOI)视频合成方法中依赖二维控制信号导致空间表达能力不足、难以有效利用三维条件数据的问题。解决方案的关键在于提出HVG-3D框架,其核心创新是基于扩散模型并引入3D ControlNet,通过显式编码来自3D输入的几何与运动线索,实现生成过程中的三维推理;同时设计混合流水线以灵活构建输入和条件信号,从而在训练和推理阶段均实现精确的空间与时间控制,显著提升视频的时空一致性与可控性。
链接: https://arxiv.org/abs/2604.03305
作者: Mingjin Chen,Junhao Chen,Zhaoxin Fan,Yujian Lee,Zichen Dang,Lili Wang,Yawen Cui,Lap-Pui Chau,Yi Wang
机构: The Hong Kong Polytechnic University (香港理工大学); Beihang University (北京航空航天大学); Tsinghua University (清华大学); Beijing Normal-Hong Kong Baptist University (北京师范大学-香港浸会大学); State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University (虚拟现实技术与系统全国重点实验室,北京航空航天大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent methods have made notable progress in the visual quality of hand-object interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations. Specifically, we develop a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (ii) a hybrid pipeline for constructing input and condition signals, enabling flexible and precise control during both training and inference. During inference, given a single real image and a 3D control signal from either simulation or real data, HVG-3D generates high-fidelity, temporally consistent videos with precise spatial and temporal control. Experiments on the TASTE-Rob dataset demonstrate that HVG-3D achieves state-of-the-art spatial fidelity, temporal coherence, and controllability, while enabling effective utilization of both real and simulated data.
[CV-190] Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解连续体物体动力学方面的显著不足,尤其是在直观物理推理(intuitive physics understanding)这一基础环节上的缺陷。为精准评估该能力,作者提出了两项基准任务:下一帧选择(Next Frame Selection, NFS)和时间一致性验证(Temporal Coherence Verification, TCV),实验表明即使最先进的MLLMs在这些任务上表现不佳。解决方案的关键在于提出场景动态场(Scene Dynamic Field, SDF),该方法通过在多任务微调框架中引入物理模拟器,以低成本方式显著提升模型对流体等物理现象的理解能力,在未见过的物理领域也展现出良好泛化性能,最高可实现20.7%的性能提升。
链接: https://arxiv.org/abs/2604.03302
作者: Nanxi Li,Xiang Wang,Yuanjie Chen,Haode Zhang,Hong Li,Yong-Lu Li
机构: Shanghai Jiao Tong University (上海交通大学); Tianjin University (天津大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics simulators within a multi-task fine-tuning framework. SDF substantially improves performance, achieving up to 20.7% gains on fluid tasks while showing strong generalization to unseen physical domains. This work not only highlights a critical gap in current MLLMs but also presents a promising cost-efficient approach for developing more physically grounded MLLMs. Our code and data are available at this https URL.
[CV-191] Embedding-Only Uplink for Onboard Retrieval Under Shift in Remote Sensing ICLR2026
【速读】:该论文旨在解决遥感图像处理中的下行链路瓶颈问题,即在带宽受限条件下如何高效地将卫星捕获的原始像素数据传输至地面站。传统方法依赖于上传完整图像,但在资源受限场景下不可行。为此,作者提出一种仅上传紧凑嵌入(embedding)加元数据的轻量化方案,由星载系统基于向量搜索进行灾情优先级排序。其关键创新在于:无论是否存在显式的遥感数据分布偏移(如跨时间、跨事件、跨云层覆盖或跨城市区域),只要嵌入被成功上行,即可在不增加额外通信成本的前提下,根据任务需求灵活选择最优决策头(decision head)——例如,kNN检索在云分类任务中表现更优,而类别中心点则在时序变化检测中显著优于检索方法。这表明嵌入-only 上行架构是实现高效、任务自适应遥感智能的核心基础。
链接: https://arxiv.org/abs/2604.03301
作者: Sangcheol Sim
机构: Telepix
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the Machine Learning for Remote Sensing (ML4RS) Workshop, ICLR 2026
Abstract:Downlink bottlenecks motivate onboard systems that prioritize hazards without transmitting raw pixels. We study a strict setting where a ground station uplinks only compact embeddings plus metadata, and an onboard system performs vector search to triage new captures. We ask whether this embedding-only pipeline remains useful under explicit remote-sensing shift: cross-time (pre/post-event), cross-event/location (different disasters), cross-site cloud (15 geographic sites), and cross-city AOI holdout (buildings). Using OlmoEarth embeddings on a scaled public multi-task benchmark (27 Sentinel-2 L2A scenes, 15 cloud sites, 5 SpaceNet-2 AOIs; 10 seeds), we find that all effective methods rely on the same uplinked embeddings, but the optimal decision head is task-dependent: kNN retrieval is significantly superior for cloud classification (0.92 vs. centroid 0.91; p0.01, Wilcoxon), while class centroids dominate temporal change detection (0.85 vs. retrieval 0.48; p0.01). These results show that embedding-only uplink is the key enabler–once embeddings are onboard, the system can select the best head per task at no additional uplink cost, with all telemetry under 1 KB per query.
[CV-192] MoViD: View-Invariant 3D Human Pose Estimation via Motion-View Disentanglement
【速读】:该论文旨在解决3D人体姿态估计在实际应用中因视角变化导致的泛化能力差、训练数据需求量大以及推理延迟高的问题。解决方案的关键在于提出一种视角不变的框架MoViD,其核心思想是将视角信息从运动特征中解耦出来:通过引入一个视点估测器来建模关键关节关系以预测视角信息,并设计正交投影模块实现运动与视角特征的分离,再结合物理引导的对比对齐策略增强跨视角一致性。这一机制显著提升了模型在未见视角下的鲁棒性与效率,同时支持在边缘设备上实现实时推理(15 FPS)。
链接: https://arxiv.org/abs/2604.03299
作者: Yejia Liu,Hengle Jiang,Haoxian Liu,Runxi Huang,Xiaomin Ouyang
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:3D human pose estimation is a key enabling technology for applications such as healthcare monitoring, human-robot collaboration, and immersive gaming, but real-world deployment remains challenged by viewpoint variations. Existing methods struggle to generalize to unseen camera viewpoints, require large amounts of training data, and suffer from high inference latency. We propose MoViD, a viewpoint-invariant 3D human pose estimation framework that disentangles viewpoint information from motion features. The key idea is to extract viewpoint information from intermediate pose features and leverage it to enhance both the robustness and efficiency of pose estimation. MoViD introduces a view estimator that models key joint relationships to predict viewpoint information, and an orthogonal projection module to disentangle motion and view features, further enhanced through physics-grounded contrastive alignment across views. For real-time edge deployment, MoViD employs a frame-by-frame inference pipeline with a view-aware strategy that adaptively activates flip refinement based on the estimated viewpoint. Evaluations on nine public datasets and newly collected multiview UAV and gait analysis datasets show that MoViD reduces pose estimation error by over 24.2% compared to state-of-the-art methods, maintains robust performance under severe occlusions with 60% less training data, and achieves real-time inference at 15 FPS on NVIDIA edge devices.
[CV-193] XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)中传统固定残差连接(fixed residual connections)在特征聚合上的局限性,即无法根据任务需求动态调整信息流的问题。其核心挑战在于如何实现跨阶段(cross-stage)的可学习、选择性特征聚合,以提升分割网络等视觉任务的性能。解决方案的关键是提出Cross-Stage Attention Residuals(XAttnRes),通过维护一个全局特征历史池(global feature history pool),融合编码器和解码器阶段的所有输出,并利用轻量级伪查询注意力机制(lightweight pseudo-query attention)实现对先前表示的选择性聚合;同时引入空间对齐与通道投影步骤,有效处理多尺度特征而计算开销极低,从而在不依赖预设跳跃连接的情况下显著改善模型性能。
链接: https://arxiv.org/abs/2604.03297
作者: Xinyu Liu,Qing Xu,Zhen Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In the field of Large Language Models (LLMs), Attention Residuals have recently demonstrated that learned, selective aggregation over all preceding layer outputs can outperform fixed residual connections. We propose Cross-Stage Attention Residuals (XAttnRes), a mechanism that maintains a global feature history pool accumulating both encoder and decoder stage outputs. Through lightweight pseudo-query attention, each stage selectively aggregates from all preceding representations. To bridge the gap between the same-dimensional Transformer layers in LLMs and the multi-scale encoder-decoder stages in segmentation networks, XAttnRes introduces spatial alignment and channel projection steps that handle cross-resolution features with negligible overhead. When added to existing segmentation networks, XAttnRes consistently improves performance across four datasets and three imaging modalities. We further observe that XAttnRes alone, even without skip connections, achieves performance on par with the baseline, suggesting that learned aggregation can recover the inter-stage information flow traditionally provided by predetermined connections.
[CV-194] 3D-IDE: 3D Implicit Depth Emergent CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在室内场景理解中,现有方法在二维(2D)与三维(3D)表征融合时面临的性能瓶颈问题,尤其是依赖显式深度编码或外部3D基础模型导致的推理延迟高、结构耦合性强等局限。其解决方案的关键在于提出“3D隐式深度涌现”(3D-Implicit Depth Emergence)框架,核心思想是基于隐式几何涌现原理(Implicit Geometric Emergence Principle),通过引入细粒度几何验证器和全局表征约束机制构建信息瓶颈,强制模型最大化视觉特征与3D结构之间的互信息,从而在统一视觉表征中自然涌现出3D感知能力。该方法无需显式深度预测或姿态估计,在推理阶段实现零延迟开销,且显著优于当前最优方法(SOTA),同时在多个3D场景理解任务上保持优异性能。
链接: https://arxiv.org/abs/2604.03296
作者: Chushan Zhang,Ruihan Lu,Jinguang Tong,Yikai Wang,Hongdong Li
机构: Australian National University (澳大利亚国立大学); The University of Queensland (昆士兰大学); FreiNexus; Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026 accepted. Project page: this https URL
Abstract:Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit ground-truth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade-off in 2D-3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine-grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize the mutual information between visual features and 3D structures, allowing 3D awareness to emerge naturally within a unified visual representation. Unlike existing approaches, our method enables 3D perception to emerge implicitly, disentangling features in dense regions and, crucially, eliminating depth and pose dependencies during inference with zero latency overhead. This paradigm shift from external grafting to implicit emergence represents a fundamental rethinking of 3D knowledge integration in visual-language models. Extensive experiments demonstrate that our method surpasses SOTA on multiple 3D scene understanding benchmarks. Our approach achieves a 55% reduction in inference latency while maintaining strong performance across diverse downstream tasks, underscoring the effectiveness of meticulously designed auxiliary objectives for dependency-free 3D understanding. Source code can be found at this http URL.
[CV-195] Event-Driven Neuromorphic Vision Enables Energy-Efficient Visual Place Recognition
【速读】:该论文旨在解决自主机器人在动态真实环境中实现可靠视觉位置识别(Visual Place Recognition, VPR)的问题,传统深度神经网络因计算和能耗过高而难以满足实时部署需求。解决方案的关键在于提出SpikeVPR,一种受哺乳动物导航系统启发的类脑神经形态方法,结合事件相机与脉冲神经网络(Spiking Neural Networks, SNNs),通过少量样本生成紧凑且具备不变性的位置描述符,并采用替代梯度学习(surrogate gradient learning)进行端到端训练,同时引入EventDilation这一新颖的数据增强策略以提升对速度和时间变化的鲁棒性。该方案在Brisbane-Event-VPR和NSAVP两个挑战性基准上达到与先进深度网络相当的性能,但参数量减少50倍,能耗分别降低30倍和250倍,显著提升了在移动和神经形态平台上的实时部署能力。
链接: https://arxiv.org/abs/2604.03277
作者: Geoffroy Keime,Nicolas Cuperlier,Benoit R. Cottereau
机构: IPAL, CNRS IRL 2955, Singapore; Univ Toulouse, CNRS, CerCo UMR 5549, Toulouse, France; Laboratoire ETIS UMR 8051, CY Cergy-Paris Université, ENSEA, CNRS, Cergy, France
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 40 pages single column, v1
Abstract:Reliable visual place recognition (VPR) under dynamic real-world conditions is critical for autonomous robots, yet conventional deep networks remain limited by high computational and energy demands. Inspired by the mammalian navigation system, we introduce SpikeVPR, a bio-inspired and neuromorphic approach combining event-based cameras with spiking neural networks (SNNs) to generate compact, invariant place descriptors from few exemplars, achieving robust recognition under extreme changes in illumination, viewpoint, and appearance. SpikeVPR is trained end-to-end using surrogate gradient learning and incorporates EventDilation, a novel augmentation strategy enhancing robustness to speed and temporal variations. Evaluated on two challenging benchmarks (Brisbane-Event-VPR and NSAVP), SpikeVPR achieves performance comparable to state-of-the-art deep networks while using 50 times fewer parameters and consuming 30 and 250 times less energy, enabling real-time deployment on mobile and neuromorphic platforms. These results demonstrate that spike-based coding offers an efficient pathway toward robust VPR in complex, changing environments.
[CV-196] A reconfigurable smart camera implementation for jet flames characterization based on an optimized segmentation model
【速读】:该论文旨在解决工业场景中火灾早期分割与特征识别缺乏实时解决方案的问题,特别是在喷射火焰(jet flame)检测方面的延迟和效率瓶颈。其关键解决方案是基于SoC FPGA(System-on-Chip Field-Programmable Gate Array)平台实现一个完整的边缘端处理流水线,通过优化UNet语义分割模型以适配FPGA的可重构逻辑资源,从而在设备本地完成实时图像处理任务。具体而言,研究团队利用Xilinx Vitis工具链将原始模型参数从750万压缩至约5.9万(减少125倍),显著降低计算复杂度,并结合多线程与批归一化等技术进一步优化性能,最终实现30帧每秒(FPS)的处理速率,同时保持Dice Score等评估指标的准确性,有效缩短了整体系统延迟。
链接: https://arxiv.org/abs/2604.03267
作者: Gerardo Valente Vazquez-Garcia,Carmina Perez Guerrero,Eduardo Garduño,Miguel Gonzalez-Mendoza,Adriana Palacios,Gerardo Rodriguez-Hernandez,Vahid Foroughi,Alba Àgueda,Elsa Pastor,Gilberto Ochoa-Ruiz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Paper submitted to EAAI (Elsevier) for peer review
Abstract:In this work we present a novel framework for fire safety management in industrial settings through the implementation of a smart camera platform for jet flames characterization. The approach seeks to alleviate the lack of real-time solutions for industrial early fire segmentation and characterization. As a case study, we demonstrate how a SoC FPGA, running optimized Artificial Intelligence (AI) models can be leveraged to implement a full edge processing pipeline for jet flames analysis. In this paper we extend previous work on computer-vision jet fire segmentation by creating a novel experimental set-up and system implementation for addressing this issue, which can be replicated to other fire safety applications. The proposed platform is designed to carry out image processing tasks in real-time and on device, reducing video processing overheads, and thus the overall latency. This is achieved by optimizing a UNet segmentation model to make it amenable for an SoC FPGAs implementation; the optimized model can then be efficiently mapped onto the SoC reconfigurable logic for massively parallel execution. For our experiments, we have chosen the Ultra96 platform, as it also provides the means for implementing full-fledged intelligent systems using the SoC peripherals, as well as other Operating System (OS) capabilities (i.e., multi-threading) for systems management. For optimizing the model we made use of the Vitis (Xilinx) framework, which enabled us to optimize the full precision model from 7.5 million parameters to 59,095 parameters (125x less), which translated into a reduction of the processing latency of 2.9x. Further optimization (multi-threading and batch normalization) led to an improvement of 7.5x in terms of latency, yielding a performance of 30 Frames Per Second (FPS) without sacrificing accuracy in terms of the evaluated metrics (Dice Score).
[CV-197] SafeScreen: A Safety-First Screening Framework for Personalized Video Retrieval for Vulnerable Users
【速读】:该论文旨在解决开放域视频平台中因以参与度为导向的推荐算法可能使脆弱用户(如儿童或痴呆症患者)暴露于不适当甚至有害内容的问题,尤其在需要满足个性化安全约束的照护场景中。解决方案的关键在于提出SafeScreen框架,其核心是将安全性作为前置条件而非次优考量,通过三个关键组件实现:(i) 基于用户画像提取个性化的安全标准,(ii) 利用自适应问题生成与多模态VideoRAG分析进行证据驱动的安全评估,(iii) 采用大语言模型(LLM)综合判断内容的安全性、适宜性和相关性,从而在无需预先标注安全标签的前提下,实现实时、可解释的视频筛选,确保内容在展示前符合个体化安全要求。
链接: https://arxiv.org/abs/2604.03264
作者: Wenzheng Zhao,Madhava Kalyan Gadiputi,Fengpei Yuan
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 11 pages, 3 figures, 7 tables. Under review for ACM ICMI 2026
Abstract:Open-domain video platforms offer rich, personalized content that could support health, caregiving, and educational applications, but their engagement-optimized recommendation algorithms can expose vulnerable users to inappropriate or harmful material. These risks are especially acute in child-directed and care settings (e.g., dementia care), where content must satisfy individualized safety constraints before being shown. We introduce SafeScreen, a safety-first video screening framework that retrieves and presents personalized video while enforcing individualized safety constraints. Rather than ranking videos by relevance or popularity, SafeScreen treats safety as a prerequisite and performs sequential approval or rejection of candidate videos through an automated pipeline. SafeScreen integrates three key components: (i) profile-driven extraction of individualized safety criteria, (ii) evidence-grounded assessments via adaptive question generation and multimodal VideoRAG analysis, and (iii) LLM-based decision-making that verifies safety, appropriateness, and relevance before content exposure. This design enables explainable, real-time screening of uncurated video repositories without relying on precomputed safety labels. We evaluate SafeScreen in a dementia-care reminiscence case study using 30 synthetic patient profiles and 90 test queries. Results demonstrate that SafeScreen prioritizes safety over engagement, diverging from YouTube’s engagement-optimized rankings in 80-93% of cases, while maintaining high levels of safety coverage, sensibleness, and groundedness, as validated by both LLM-based evaluation and domain experts.
[CV-198] Unsharp Measurement with Adaptive Gaussian POVMs for Quantum-Inspired Image Processing
【速读】:该论文旨在解决传统图像处理方法在灰度图像概率变换中难以兼顾结构信息保留与灵活性的问题,尤其是现有方法多依赖于分割或阈值处理,缺乏对像素强度分布的直接建模能力。解决方案的关键在于提出一种基于量子测量理论的框架,将图像强度值嵌入有限维希尔伯特空间(Hilbert space),并利用高斯模型构建数据自适应的正算子值测度(Positive Operator-Valued Measures, POVMs),从而实现对像素强度可观测量的非局域性测量;通过引入非线性锐化参数 γ 控制测量局部化程度,实现从模糊测量到投影测量的连续过渡,体现概率平滑与结构定位之间的权衡,同时引入参数 k(高斯中心数)调节变换过程中的图像分辨率,实验表明该方法能有效实现数据自适应的图像变换并保持结构完整性。
链接: https://arxiv.org/abs/2604.04685
作者: Debashis Saikia,Bikash K. Behera,Mayukha Pal,Prasanta K. Panigrahi
机构: Indian Institute of Science Education and Research, Thiruvananthapuram, India; Bikash’s Quantum (OPC) Pvt. Ltd., Mohanpur, WB, 741246 India; ABB Ability Innovation Center, Asea Brown Boveri Company, Hyderabad 500084, India; Siksha O Anusandhan University, Bhubaneswar, India; Indian Institute of Science Education and Research (IISER), Kolkata, Mohanpur 741246, West Bengal, India
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 17 figures
Abstract:We propose a quantum measurement-based framework for probabilistic transformation of grayscale images using adaptive positive operator-valued measures (POVMs). In contrast, to existing approaches that are largely centered around segmentation or thresholding, the transformation is formulated here as a measurement-induced process acting directly on pixel intensities. The intensity values are embedded in a finite-dimensional Hilbert space, which allows the construction of data-adaptive measurement operators derived from Gaussian models of the image histogram. These operators naturally define an unsharp measurement of the intensity observable, with the reconstructed image obtained through expectation values of the measurement outcomes. To control the degree of measurement localization, we introduce a nonlinear sharpening transformation with a sharpening parameter, \gamma , that induces a continuous transition from unsharp measurements to projective measurements. This transition reflects an inherent trade-off between probabilistic smoothing and localization of intensity structures. In addition to the nonlinear sharpening parameter, we introduce another parameter k (number of gaussian centers) which controls the resolution of the image during the transformation. Experimental results on standard benchmark images show that the proposed method gives effective data-adaptive transformations while preserving structural information.
[CV-199] M-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising CVPR2026
【速读】:该论文旨在解决自监督图像去噪中因假设像素级噪声独立性而带来的性能瓶颈问题,尤其是在真实sRGB图像中,由于相机图像信号处理(ISP)流水线导致的空间相关噪声破坏了这一假设。现有方法通过下采样来削弱噪声相关性,但会改变噪声统计特性并限制上下文信息的利用。其解决方案的关键在于提出三角掩码盲区网络(Triangular-Masked Blind-Spot Network, TM-BSN),通过设计一种仅保留卷积核上三角区域的三角掩码卷积,使盲区呈现钻石形状,从而精确匹配由去马赛克过程引起的空间相关噪声结构;该设计在原始分辨率下排除相关像素的同时充分利用未相关上下文,无需下采样或后处理,显著提升了去噪准确性与效率。
链接: https://arxiv.org/abs/2604.04484
作者: Junyoung Park,Youngjin Oh,Nam Ik Cho
机构: Seoul National University (首尔国立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Blind-spot networks (BSNs) enable self-supervised image denoising by preventing access to the target pixel, allowing clean signal estimation without ground-truth supervision. However, this approach assumes pixel-wise noise independence, which is violated in real-world sRGB images due to spatially correlated noise from the camera’s image signal processing (ISP) pipeline. While several methods employ downsampling to decorrelate noise, they alter noise statistics and limit the network’s ability to utilize full contextual information. In this paper, we propose the Triangular-Masked Blind-Spot Network (TM-BSN), a novel blind-spot architecture that accurately models the spatial correlation of real sRGB noise. This correlation originates from demosaicing, where each pixel is reconstructed from neighboring samples with spatially decaying weights, resulting in a diamond-shaped pattern. To align the receptive field with this geometry, we introduce a triangular-masked convolution that restricts the kernel to its upper-triangular region, creating a diamond-shaped blind spot at the original resolution. This design excludes correlated pixels while fully leveraging uncorrelated context, eliminating the need for downsampling or post-processing. Furthermore, we use knowledge distillation to transfer complementary knowledge from multiple blind-spot predictions into a lightweight U-Net, improving both accuracy and efficiency. Extensive experiments on real-world benchmarks demonstrate that our method achieves state-of-the-art performance, significantly outperforming existing self-supervised approaches. Our code is available at this https URL.
[CV-200] NAIMA: Semantics Aware RGB Guided Depth Super-Resolution
【速读】:该论文旨在解决引导式深度超分辨率(Guided Depth Super-Resolution, GDSR)中因RGB图像中的误导性颜色和纹理线索导致深度不连续性边界模糊或产生伪影的问题。解决方案的关键在于引入由预训练视觉Transformer(Vision Transformer)token嵌入生成的全局上下文语义先验,并提出一种Guided Token Attention (GTA)模块,通过跨注意力机制迭代对齐RGB空间特征与深度编码,从而选择性注入来自不同层的全局语义信息;同时设计了融合DINOv2与GTA块的神经注意力隐式多token对齐架构(Neural Attention for Implicit Multi-token Alignment, NIMA),实现语义感知的GDSR,显著提升了在多种缩放因子和数据集上的性能。
链接: https://arxiv.org/abs/2604.04407
作者: Tayyab Nasir,Daochang Liu,Ajmal Mian
机构: The University of Western Australia (西澳大利亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Guided depth super-resolution (GDSR) is a multi-modal approach for depth map super-resolution that relies on a low-resolution depth map and a high-resolution RGB image to restore finer structural details. However, the misleading color and texture cues indicating depth discontinuities in RGB images often lead to artifacts and blurred depth boundaries in the generated depth map. We propose a solution that introduces global contextual semantic priors, generated from pretrained vision transformer token embeddings. Our approach to distilling semantic knowledge from pretrained token embeddings is motivated by their demonstrated effectiveness in related monocular depth estimation tasks. We introduce a Guided Token Attention (GTA) module, which iteratively aligns encoded RGB spatial features with depth encodings, using cross-attention for selectively injecting global semantic context extracted from different layers of a pretrained vision transformer. Additionally, we present an architecture called Neural Attention for Implicit Multi-token Alignment (NAIMA), which integrates DINOv2 with GTA blocks for a semantics-aware GDSR. Our proposed architecture, with its ability to distill semantic knowledge, achieves significant improvements over existing methods across multiple scaling factors and datasets.
[CV-201] BAAI Cardiac Agent : An intelligent multimodal agent for automated reasoning and diagnosis of cardiovascular diseases from cardiac magnetic resonance imaging
【速读】:该论文旨在解决心脏磁共振成像(Cardiac Magnetic Resonance, CMR)在临床实践中因多序列、多相位及定量分析复杂而难以高效解读的问题,从而限制了其广泛应用。解决方案的关键在于提出一个名为BAAI Cardiac Agent的多模态智能系统,该系统通过动态编排多个心脏专家模型,实现从心脏结构分割、功能量化、组织特征分析到疾病诊断的端到端自动化处理,并生成结构化临床报告。该框架在两个医院的2413例患者数据上验证,表现出优异的诊断性能(内部AUC > 0.93,外部AUC = 0.81),且在左心室功能指标估计中与临床报告高度一致(Pearson相关系数均>0.90),显著优于现有先进模型,体现了其在复杂医学影像工作流中的准确性与效率潜力。
链接: https://arxiv.org/abs/2604.04078
作者: Taiping Qu,Hongkai Zhang,Lantian Zhang,Can Zhao,Nan Zhang,Hui Wang,Zhen Zhou,Mingye Zou,Kairui Bo,Pengfei Zhao,Xingxing Jin,Zixian Su,Kun Jiang,Huan Liu,Yu Du,Maozhou Wang,Ruifang Yan,Zhongyuan Wang,Tiejun Huang,Lei Xu,Henggui Zhang
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); Beijing Anzhen Hospital (北京安贞医院); Capital Medical University (首都医科大学); Henan Medical University (河南医科大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cardiac magnetic resonance (CMR) is a cornerstone for diagnosing cardiovascular disease. However, it remains underutilized due to complex, time-consuming interpretation across multi-sequences, phases, quantitative measures that heavily reliant on specialized expertise. Here, we present BAAI Cardiac Agent, a multimodal intelligent system designed for end-to-end CMR interpretation. The agent integrates specialized cardiac expert models to perform automated segmentation of cardiac structures, functional quantification, tissue characterization and disease diagnosis, and generates structured clinical reports within a unified workflow. Evaluated on CMR datasets from two hospitals (2413 patients) spanning 7-types of major cardiovascular diseases, the agent achieved an area under the receiver-operating-characteristic curve exceeding 0.93 internally and 0.81 externally. In the task of estimating left ventricular function indices, the results generated by this system for core parameters such as ejection fraction, stroke volume, and left ventricular mass are highly consistent with clinical reports, with Pearson correlation coefficients all exceeding 0.90. The agent outperformed state-of-the-art models in segmentation and diagnostic tasks, and generated clinical reports showing high concordance with expert radiologists (six readers across three experience levels). By dynamically orchestrating expert models for coordinated multimodal analysis, this agent framework enables accurate, efficient CMR interpretation and highlights its potentials for complex clinical imaging workflows. Code is available at this https URL.
[CV-202] Cost-Efficient Multi-Scale Fovea for Semantic-Based Visual Search Attention IJCNN
【速读】:该论文旨在解决深度目标检测模型在复杂视觉场景中处理大尺寸输入时带来的计算成本过高问题,这限制了人工注意力系统在生物合理性与实时部署方面的应用。解决方案的关键在于提出一种新型多尺度视网膜聚焦(Multi-Scale Fovea)模块,并将其集成到语义驱动的贝叶斯注意力(Semantic-based Bayesian Attention, SemBA)框架中,通过模拟人类视觉系统的中心-周边感知特性(即中心高分辨率、周边逐渐模糊),在保持检测精度的同时显著降低计算开销,从而提升模型的生物合理性与实用性。
链接: https://arxiv.org/abs/2604.03836
作者: João Luzio,Alexandre Bernardino,Plinio Moreno
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: The International Joint Conference on Neural Networks (IJCNN) 2026
Abstract:Semantics are one of the primary sources of top-down preattentive information. Modern deep object detectors excel at extracting such valuable semantic cues from complex visual scenes. However, the size of the visual input to be processed by these detectors can become a bottleneck, particularly in terms of time costs, affecting an artificial attention system’s biological plausibility and real-time deployability. Inspired by classical exponential density roll-off topologies, we apply a new artificial foveation module to our novel attention prediction pipeline: the Semantic-based Bayesian Attention (SemBA) framework. We aim at reducing detection-related computational costs without compromising visual task accuracy, thereby making SemBA more biologically plausible. The proposed multi-scale pyramidal field-of-view retains maximum acuity at an innermost level, around a focal point, while gradually increasing distortion for outer levels to mimic peripheral uncertainty via downsampling. In this work we evaluate the performance of our novel Multi-Scale Fovea, incorporated into \textitSemBA, on target-present visual search. We also compare it against other artificial foveal systems, and conduct ablation studies with different deep object detection models to assess the impact of the new topology in terms of computational costs. We experimentally demonstrate that including the new Multi-Scale Fovea module effectively reduces inherent processing costs while improving SemBA’s scanpath prediction accuracy. Remarkably, we show that SemBA closely approximates human consistency while retaining the actual human fovea’s proportions.
[CV-203] UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation MICCAI2025
【速读】:该论文旨在解决当前可提示视频目标分割(Promptable Video Object Segmentation, PVOS)方法在手术视频分割中面临的三大挑战:(1)现有方法通常仅支持单一提示模态(如视觉或文本),难以适应外科医生在复杂术式中使用多模态线索(如视觉选择、文本描述或语音指令)动态指定目标的需求;(2)耦合式框架导致目标初始化与跟踪阶段优化相互干扰,影响分割稳定性;(3)当目标缺失时产生幻觉预测,且缺乏累积误差恢复机制,造成掩码漂移(mask drift)。解决方案的关键在于提出UniSurgSAM——一个统一的多模态PVOS模型,其核心创新为解耦两阶段框架(独立优化初始化与跟踪以消除干扰),并引入三项可靠性设计:存在感知解码(presence-aware decoding)抑制目标缺失时的幻觉预测、边界感知长期跟踪(boundary-aware long-term tracking)防止长时间序列中的掩码漂移,以及自适应状态切换(adaptive state transition)实现跨阶段闭环失败恢复。
链接: https://arxiv.org/abs/2604.03645
作者: Haofeng Liu,Ziyue Wang,Alex Y. W. Kong,Guanyi Qin,Yunqiu Xu,Chang Han Low,Mingqi Gao,Lap Yan Lennon Chan,Yueming Jin
机构: National University of Singapore (新加坡国立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Extended version of MICCAI 2025 paper (ReSurgSAM2). 13 pages, 8 figures, 8 tables
Abstract:Surgical video segmentation is fundamental to computer-assisted surgery. In practice, surgeons need to dynamically specify targets throughout extended procedures, using heterogeneous cues such as visual selections, textual expressions, or audio instructions. However, existing Promptable Video Object Segmentation (PVOS) methods are typically restricted to a single prompt modality and rely on coupled frameworks that cause optimization interference between target initialization and tracking. Moreover, these methods produce hallucinated predictions when the target is absent and suffer from accumulated mask drift without failure recovery. To address these challenges, we present UniSurgSAM, a unified PVOS model enabling reliable surgical video segmentation through visual, textual, or audio prompts. Specifically, UniSurgSAM employs a decoupled two-stage framework that independently optimizes initialization and tracking to resolve the optimization interference. Within this framework, we introduce three key designs for reliability: presence-aware decoding that models target absence to suppress hallucinations; boundary-aware long-term tracking that prevents mask drift over extended sequences; and adaptive state transition that closes the loop between stages for failure recovery. Furthermore, we establish a multi-modal and multi-granular benchmark from four public surgical datasets with precise instance-level masklets. Extensive experiments demonstrate that UniSurgSAM achieves state-of-the-art performance in real time across all prompt modalities and granularities, providing a practical foundation for computer-assisted surgery. Code and datasets will be available at this https URL.
[CV-204] DRIFT: Deep Restoration ISP Fusion and Tone-mapping
【速读】:该论文旨在解决移动设备上高质量图像生成的挑战,特别是在手持拍摄条件下如何从原始图像数据(raw captures)中高效地重建高分辨率、低噪声且色彩准确的RGB图像,同时控制计算成本。其核心解决方案是提出一个名为DRIFT(Deep Restoration, ISP Fusion, and Tone-mapping)的端到端AI移动端相机处理流水线,关键在于两个模块:一是多帧处理(Multi-Frame Processing, MFP)网络,通过对抗感知损失训练实现多帧对齐、去噪、去马赛克和超分辨率一体化处理;二是新颖的深度学习驱动的色调映射(Tone-mapping, TM)模块,具备可调色调特性、与参考流水线保持色调一致性,并能在移动设备上高效运行高分辨率图像处理。
链接: https://arxiv.org/abs/2604.03402
作者: Soumendu Majee,Joshua Peter Ebenezer,Abhinau K. Venkataramanan,Weidi Liu,Thilo Balke,Zeeshan Nadir,Sreenithy Chandran,Seok-Jun Lee,Hamid Rahim Sheikh
机构: Samsung Research America(三星研究院美国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Smartphone cameras have gained immense popularity with the adoption of high-resolution and high-dynamic range imaging. As a result, high-performance camera Image Signal Processors (ISPs) are crucial in generating high-quality images for the end user while keeping computational costs low. In this paper, we propose DRIFT (Deep Restoration, ISP Fusion, and Tone-mapping): an efficient AI mobile camera pipeline that generates high quality RGB images from hand-held raw captures. The first stage of DRIFT is a Multi-Frame Processing (MFP) network that is trained using a adversarial perceptual loss to perform multi-frame alignment, denoising, demosaicing, and super-resolution. Then, the output of DRIFT-MFP is processed by a novel deep-learning based tone-mapping (DRIFT-TM) solution that allows for tone tunability, ensures tone-consistency with a reference pipeline, and can be run efficiently for high-resolution images on a mobile device. We show qualitative and quantitative comparisons against state-of-the-art MFP and tone-mapping methods to demonstrate the effectiveness of our approach.
[CV-205] NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning
【速读】:该论文旨在解决神经网络在视频领域中损失less压缩(lossless video compression)研究相对匮乏的问题。当前虽然神经网络在图像的无损压缩方面已取得显著进展,但针对视频的神经无损压缩方法仍处于探索阶段。其解决方案的关键在于提出一种名为NeuralLVC的神经无损视频编解码器,该方案结合了掩码扩散模型(masked diffusion)与I/P帧架构(I/P-frame architecture),以有效利用时间冗余:I帧通过双射线性标记化(bijective linear tokenization)实现单帧精确像素重建,P帧则通过轻量级参考嵌入(reference embedding)对相邻帧间的差异进行条件建模,仅引入1.3%的可训练参数;此外,分组解码机制支持可控的速度-压缩率权衡,最终在9个Xiph CIF序列上显著优于H.264和H.265无损编码标准,并通过算术编码端到端验证了精确重建能力。
链接: https://arxiv.org/abs/2604.03353
作者: Tiberio Uricchio,Marco Bertini
机构: Università di Pisa (比萨大学); Università degli Studi di Firenze (佛罗伦萨大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While neural lossless image compression has advanced significantly with learned entropy models, lossless video compression remains largely unexplored in the neural setting. We present NeuralLVC, a neural lossless video codec that combines masked diffusion with an I/P-frame architecture for exploiting temporal redundancy. Our I-frame model compresses individual frames using bijective linear tokenization that guarantees exact pixel reconstruction. The P-frame model compresses temporal differences between consecutive frames, conditioned on the previous decoded frame via a lightweight reference embedding that adds only 1.3% trainable parameters. Group-wise decoding enables controllable speed-compression trade-offs. Our codec is lossless in the input domain: for video, it reconstructs YUV420 planes exactly; for image evaluation, RGB channels are reconstructed exactly. Experiments on 9 Xiph CIF sequences show that NeuralLVC outperforms H.264 and H.265 lossless by a significant margin. We verify exact reconstruction through end-to-end encode-decode testing with arithmetic coding. These results suggest that masked diffusion with temporal conditioning is a promising direction for neural lossless video compression.
人工智能
[AI-0] Analyzing Symbolic Properties for DRL Agents in Systems and Networking
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在系统与网络控制场景中部署安全性验证不足的问题,特别是现有基于验证的方法多局限于固定输入状态的点性质(point properties),难以覆盖实际运行中广泛的状态空间,且需大量人工干预来识别关键输入输出对。其解决方案的关键在于提出符号性质(symbolic properties)的通用形式化框架,通过定义如单调性(monotonicity)和鲁棒性(robustness)等可验证的符号属性,并将这些属性编码为同一策略下不同执行路径之间的比较,从而分解为可由现有深度神经网络(DNN)验证引擎处理的子问题。这一方法显著提升了验证覆盖率,能够自动发现非显性的、具有操作意义的反例,同时揭示了不同验证后端的实际权衡与局限性。
链接: https://arxiv.org/abs/2604.04914
作者: Mohammad Zangooei,Jannis Weil,Amr Rizk,Mina Tahmasbi Arashloo,Raouf Boutaba
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in ACM SIGMETRICS’26
Abstract:Deep reinforcement learning (DRL) has shown remarkable performance on complex control problems in systems and networking, including adaptive video streaming, wireless resource management, and congestion control. For safe deployment, however, it is critical to reason about how agents behave across the range of system states they encounter in practice. Existing verification-based methods in this domain primarily focus on point properties, defined around fixed input states, which offer limited coverage and require substantial manual effort to identify relevant input-output pairs for analysis. In this paper, we study symbolic properties, that specify expected behavior over ranges of input states, for DRL agents in systems and networking. We present a generic formulation for symbolic properties, with monotonicity and robustness as concrete examples, and show how they can be analyzed using existing DNN verification engines. Our approach encodes symbolic properties as comparisons between related executions of the same policy and decomposes them into practically tractable sub-properties. These techniques serve as practical enablers for applying existing verification tools to symbolic analysis. Using our framework, diffRL, we conduct an extensive empirical study across three DRL-based control systems, adaptive video streaming, wireless resource management, and congestion control. Through these case studies, we analyze symbolic properties over broad input ranges, examine how property satisfaction evolves during training, study the impact of model size on verifiability, and compare multiple verification backends. Our results show that symbolic properties provide substantially broader coverage than point properties and can uncover non-obvious, operationally meaningful counterexamples, while also revealing practical solver trade-offs and limitations.
[AI-1] Learning Potential and Retention: An Approach for Evaluating Adaptive AI-Enabled Medical Devices
【速读】:该论文旨在解决自适应人工智能(Adaptive AI)模型在医疗设备中的评估难题,尤其是在模型与评估数据集持续迭代更新的背景下,性能变化难以归因于模型改进还是环境动态性。其解决方案的关键在于提出三种互补的量化指标:学习能力(Learning,衡量模型在当前数据上的改进程度)、潜力(Potential,反映数据集驱动的性能变化)和保留能力(Retention,评估跨修改步骤的知识保持能力),从而有效分离模型适应与环境变化对性能的影响。这一方法为监管科学提供了可操作的工具,支持对自适应AI系统在连续迭代中的安全性与有效性进行严谨评估。
链接: https://arxiv.org/abs/2604.04878
作者: Alexis Burgon,Berkman Sahiner,Nicholas A Petrick,Gene Pennello,Ravi K Samala
机构: 未知
类目: Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:This work addresses challenges in evaluating adaptive artificial intelligence (AI) models for medical devices, where iterative updates to both models and evaluation datasets complicate performance assessment. We introduce a novel approach with three complementary measurements: learning (model improvement on current data), potential (dataset-driven performance shifts), and retention (knowledge preservation across modification steps), to disentangle performance changes caused by model adaptations versus dynamic environments. Case studies using simulated population shifts demonstrate the approach’s utility: gradual transitions enable stable learning and retention, while rapid shifts reveal trade-offs between plasticity and stability. These measurements provide practical insights for regulatory science, enabling rigorous assessment of the safety and effectiveness of adaptive AI systems over sequential modifications.
[AI-2] Incompleteness of AI Safety Verification via Kolmogorov Complexity
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)系统在安全关键领域中满足形式化安全与政策约束的问题。传统观点常将验证局限归因于组合复杂性和模型表达能力,但本文指出其根源在于信息论层面的内在限制。解决方案的关键在于将策略合规性建模为对编码系统行为的验证问题,并借助柯尔莫哥洛夫复杂度(Kolmogorov complexity)进行分析,从而证明了一个不完备性结果:对于任意固定且可计算枚举的安全验证器,都存在一个阈值,一旦系统行为的复杂度超过该阈值,即使其真实满足策略合规性,也无法被该验证器所证实。这揭示了AI安全验证的根本局限性,不依赖于计算资源的多少,进而推动了基于证明携带(proof-carrying)的方法,以提供针对具体实例的正确性保证。
链接: https://arxiv.org/abs/2604.04876
作者: Munawar Hasan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Ensuring that artificial intelligence (AI) systems satisfy formal safety and policy constraints is a central challenge in safety-critical domains. While limitations of verification are often attributed to combinatorial complexity and model expressiveness, we show that they arise from intrinsic information-theoretic limits. We formalize policy compliance as a verification problem over encoded system behaviors and analyze it using Kolmogorov complexity. We prove an incompleteness result: for any fixed sound computably enumerable verifier, there exists a threshold beyond which true policy-compliant instances cannot be certified once their complexity exceeds that threshold. Consequently, no finite formal verifier can certify all policy-compliant instances of arbitrarily high complexity. This reveals a fundamental limitation of AI safety verification independent of computational resources, and motivates proof-carrying approaches that provide instance-level correctness guarantees.
[AI-3] Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFNs Attention Mechanisms
【速读】:该论文旨在解决在工业场景中(如金融与医疗领域)使用表格式数据进行预测时,因数据质量差(如无关特征、特征间相关性、标签噪声等)导致传统模型性能下降的问题。其解决方案的关键在于验证TabPFN(一种基于上下文学习的表格式基础模型)在面对多种数据缺陷条件下仍具备优异的鲁棒性:通过单次前向传播即可完成预测,无需针对特定数据集调整参数;实验表明,在引入随机无关特征、非线性相关特征、样本量变化及标签噪声扰动后,TabPFN依然保持高ROC-AUC性能、结构清晰且聚焦于关键特征的注意力机制,以及稳定的内部表示行为,说明其是一种能在复杂现实数据环境中可靠运行的表格式基础模型(Tabular Foundation Model, TFM)。
链接: https://arxiv.org/abs/2604.04868
作者: James Hu,Mahdi Ghelichi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Tabular foundation models (TFMs) such as TabPFN (Tabular Prior-Data Fitted Network) are designed to generalize across heterogeneous tabular datasets through in-context learning (ICL). They perform prediction in a single forward pass conditioned on labeled examples without dataset-specific parameter updates. This paradigm is particularly attractive in industrial domains (e.g., finance and healthcare) where tabular prediction is pervasive. Retraining a bespoke model for each new table can be costly or infeasible in these settings, while data quality issues such as irrelevant predictors, correlated feature groups, and label noise are common. In this paper, we provide strong empirical evidence that TabPFN is highly robust under these sub-optimal conditions. We study TabPFN and its attention mechanisms for binary classification problems with controlled synthetic perturbations that vary: (i) dataset width by injecting random uncorrelated features and by introducing nonlinearly correlated features, (ii) dataset size by increasing the number of training rows, and (iii) label quality by increasing the fraction of mislabeled targets. Beyond predictive performance, we analyze internal signals including attention concentration and attention-based feature ranking metrics. Across these parametric tests, TabPFN is remarkably resilient: ROC-AUC remains high, attention stays structured and sharp, and informative features are highly ranked by attention-based metrics. Qualitative visualizations with attention heatmaps, feature-token embeddings, and SHAP plots further support a consistent pattern across layers in which TabPFN increasingly concentrates on useful features while separating their signals from noise. Together, these findings suggest that TabPFN is a robust TFM capable of maintaining both predictive performance and coherent internal behavior under various scenarios of data imperfections.
[AI-4] MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在多轮交互中缺乏持久记忆的问题,尤其针对标准上下文窗口和检索增强生成(Retrieval-Augmented Generation, RAG)管道在长时间会话中性能退化的问题。其解决方案的关键在于提出MemMachine——一个开源的记忆系统,该系统通过构建保留原始对话内容的架构,整合短期、长期情景记忆与用户画像记忆,并采用上下文化检索策略,将核心匹配项扩展至周围语境以提升跨多轮对话的相关证据召回率。此外,MemMachine通过优化检索阶段参数(如检索深度、上下文格式、搜索提示设计及查询偏置校正)显著优于摄入阶段优化,且结合自适应检索代理(Retrieval Agent)动态路由查询路径,在多个基准测试中实现了高准确率与低输入token消耗的平衡,从而为个性化LLM代理提供了高效、鲁棒的长期记忆能力。
链接: https://arxiv.org/abs/2604.04853
作者: Shu Wang,Edwin Yu,Oscar Love,Tom Zhang,Tom Wong,Steve Scargall,Charles Fan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 16 tables, 3 figures
Abstract:Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over multi-session interactions. We present MemMachine, an open-source memory system that integrates short-term, long-term episodic, and profile memory within a ground-truth-preserving architecture that stores entire conversational episodes and reduces lossy LLM-based extraction. MemMachine uses contextualized retrieval that expands nucleus matches with surrounding context, improving recall when relevant evidence spans multiple dialogue turns. Across benchmarks, MemMachine achieves strong accuracy-efficiency tradeoffs: on LoCoMo it reaches 0.9169 using gpt4.1-mini; on LongMemEvalS (ICLR 2025), a six-dimension ablation yields 93.0 percent accuracy, with retrieval-stage optimizations – retrieval depth tuning (+4.2 percent), context formatting (+2.0 percent), search prompt design (+1.8 percent), and query bias correction (+1.4 percent) – outperforming ingestion-stage gains such as sentence chunking (+0.8 percent). GPT-5-mini exceeds GPT-5 by 2.6 percent when paired with optimized prompts, making it the most cost-efficient setup. Compared to Mem0, MemMachine uses roughly 80 percent fewer input tokens under matched conditions. A companion Retrieval Agent adaptively routes queries among direct retrieval, parallel decomposition, or iterative chain-of-query strategies, achieving 93.2 percent on HotpotQA-hard and 92.6 percent on WikiMultiHop under randomized-noise conditions. These results show that preserving episodic ground truth while layering adaptive retrieval yields robust, efficient long-term memory for personalized LLM agents.
[AI-5] Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLM s via a Structured Prompt Framework
【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)提示在安全敏感型分析任务中可靠性不足的问题,尤其是在结构化人类评估下的表现尚不明确。现有提升模型性能的方法如模型扩展或微调虽有效,但存在计算成本高、难以审计等局限。论文提出的解决方案关键在于构建一个结构化的提示工程框架,通过16个因素分属四个核心维度——上下文与范围控制、证据锚定与可追溯性、推理结构与认知控制、以及安全特异性分析约束——引入显式的推理控制机制,从而减少幻觉和推理漂移,增强安全性场景下的可解释性。该方法无需修改模型参数,具备轻量、透明且可控的优势,在软件定义网络(SDN)中的DDoS攻击检测案例中验证了其有效性,显著提升了推理一致性(小模型提升达40%)和检测准确性,并获得高一致性的专家评价(Cohen’s k ≥ 0.80)。
链接: https://arxiv.org/abs/2604.04852
作者: Jiling Zhou,Aisvarya Adeseye,Seppo Virtanen,Antti Hakkala,Jouni Isoaho
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper has been accepted at the 12th Intelligent Systems Conference (IntelliSys 2026)
Abstract:Chain-of-Thought (CoT) prompting has been used to enhance the reasoning capability of LLMs. However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation. Alternative approaches, such as model scaling and fine-tuning can be used to help improve performance. These methods are also often costly, computationally intensive, or difficult to audit. In contrast, prompt engineering provides a lightweight, transparent, and controllable mechanism for guiding LLM reasoning. This study proposes a structured prompt engineering framework designed to strengthen CoT reasoning integrity while improving security threat and attack detection reliability in local LLM deployments. The framework includes 16 factors grouped into four core dimensions: (1) Context and Scope Control, (2) Evidence Grounding and Traceability, (3) Reasoning Structure and Cognitive Control, and (4) Security-Specific Analytical Constraints. Rather than optimizing the wording of the prompt heuristically, the framework introduces explicit reasoning controls to mitigate hallucination and prevent reasoning drift, as well as strengthening interpretability in security-sensitive contexts. Using DDoS attack detection in SDN traffic as a case study, multiple model families were evaluated under structured and unstructured prompting conditions. Pareto frontier analysis and ablation experiments demonstrate consistent reasoning improvements (up to 40% in smaller models) and stable accuracy gains across scales. Human evaluation with strong inter-rater agreement (Cohen’s k 0.80) confirms robustness. The results establish structured prompting as an effective and practical approach for reliable and explainable AI-driven cybersecurity analysis.
[AI-6] Selecting Decision-Relevant Concepts in Reinforcement Learning
【速读】:该论文旨在解决在训练可解释的概念基础策略(concept-based policies)时,依赖人工手动选择决策相关概念所带来的问题,包括对领域知识的高要求、耗时费力、难以扩展且无性能保障。解决方案的关键在于将概念选择问题从状态抽象(state abstraction)的角度重新建模:若移除某一概念会导致代理混淆需采取不同动作的状态,则该概念即为决策相关(decision-relevant)。基于此洞察,作者提出决策相关选择(Decision-Relevant Selection, DRS)算法,自动从候选概念集中筛选出最优子集,并提供所选概念与最终策略性能之间的理论边界,从而在保证最优决策结构的前提下实现高效、可证明的自动概念选择。
链接: https://arxiv.org/abs/2604.04808
作者: Naveen Raman,Stephanie Milani,Fei Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 13 figures
Abstract:Training interpretable concept-based policies requires practitioners to manually select which human-understandable concepts an agent should reason with when making sequential decisions. This selection demands domain expertise, is time-consuming and costly, scales poorly with the number of candidates, and provides no performance guarantees. To overcome this limitation, we propose the first algorithms for principled automatic concept selection in sequential decision-making. Our key insight is that concept selection can be viewed through the lens of state abstraction: intuitively, a concept is decision-relevant if removing it would cause the agent to confuse states that require different actions. As a result, agents should rely on decision-relevant concepts; states with the same concept representation should share the same optimal action, which preserves the optimal decision structure of the original state space. This perspective leads to the Decision-Relevant Selection (DRS) algorithm, which selects a subset of concepts from a candidate set, along with performance bounds relating the selected concepts to the performance of the resulting policy. Empirically, DRS automatically recovers manually curated concept sets while matching or exceeding their performance, and improves the effectiveness of test-time concept interventions across reinforcement learning benchmarks and real-world healthcare environments.
[AI-7] Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange
【速读】:该论文旨在解决AI代理之间如何在不被强被动审计者察觉的前提下进行秘密通信的问题,即在双方模型结构、协议和私有上下文均公开的情况下,仍能生成与诚实交互不可区分的对话记录。其核心挑战在于如何在无共享密钥前提下实现高效且安全的隐蔽通信。解决方案的关键在于提出了一种新的密码学原语——伪随机噪声鲁棒密钥交换(pseudorandom noise-resilient key exchange),该机制确保公开传输内容在统计上是伪随机的,同时在恒定噪声扰动下仍能正确完成密钥协商;这一设计使得即使消息短且自适应,只要每条消息具有常数级最小熵,即可实现最优速率的隐蔽对话,从而突破了传统隐蔽通信依赖于随安全参数增长的最小熵假设的限制。
链接: https://arxiv.org/abs/2604.04757
作者: Vinod Vaikuntanathan,Or Zamir
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:AI agents are increasingly deployed to interact with other agents on behalf of users and organizations. We ask whether two such agents, operated by different entities, can carry out a parallel secret conversation while still producing a transcript that is computationally indistinguishable from an honest interaction, even to a strong passive auditor that knows the full model descriptions, the protocol, and the agents’ private contexts. Building on recent work on watermarking and steganography for LLMs, we first show that if the parties possess an interaction-unique secret key, they can facilitate an optimal-rate covert conversation: the hidden conversation can exploit essentially all of the entropy present in the honest message distributions. Our main contributions concern extending this to the keyless setting, where the agents begin with no shared secret. We show that covert key exchange, and hence covert conversation, is possible even when each model has an arbitrary private context, and their messages are short and fully adaptive, assuming only that sufficiently many individual messages have at least constant min-entropy. This stands in contrast to previous covert communication works, which relied on the min-entropy in each individual message growing with the security parameter. To obtain this, we introduce a new cryptographic primitive, which we call pseudorandom noise-resilient key exchange: a key-exchange protocol whose public transcript is pseudorandom while still remaining correct under constant noise. We study this primitive, giving several constructions relevant to our application as well as strong limitations showing that more naive variants are impossible or vulnerable to efficient attacks. These results show that transcript auditing alone cannot rule out covert coordination between AI agents, and identify a new cryptographic theory that may be of independent interest. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.04757 [cs.CR] (or arXiv:2604.04757v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.04757 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-8] AI Trust OS – A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)、检索增强生成(Retrieval-Augmented Generation, RAG)流水线及多智能体AI工作流快速演进所引发的结构性治理危机,即组织难以发现和持续验证跨工程团队自发涌现的AI系统,导致监管要求的AI治理成熟度证明与组织实际能力之间存在显著信任鸿沟。解决方案的关键在于提出AI Trust OS——一种以连续、自主AI可观测性与零信任合规为核心的治理架构:通过可观测信号主动发现AI系统,利用自动化探针采集控制断言,并持续合成信任证据;其核心机制是构建一个零信任遥测边界,以临时只读探针在不引入源代码或敏感个人数据(PII)的前提下验证结构元数据,从而实现从依赖人工报告到基于机器观测的实证式治理范式转变。
链接: https://arxiv.org/abs/2604.04749
作者: Eranga Bandara,Asanga Gunaratna,Ross Gore,Abdul Rahman,Ravi Mukkamala,Sachin Shetty,Sachini Rajapakse,Isurunima Kularathna,Peter Foytik,Safdar H. Bouk,Xueping Liang,Amin Hass,Ng Wee Keong,Kasun De Zoysa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The accelerating adoption of large language models, retrieval-augmented generation pipelines, and multi-agent AI workflows has created a structural governance crisis. Organizations cannot govern what they cannot see, and existing compliance methodologies built for deterministic web applications provide no mechanism for discovering or continuously validating AI systems that emerge across engineering teams without formal oversight. The result is a widening trust gap between what regulators demand as proof of AI governance maturity and what organizations can demonstrate. This paper proposes AI Trust OS, a governance architecture for continuous, autonomous AI observability and zero-trust compliance. AI Trust OS reconceptualizes compliance as an always-on, telemetry-driven operating layer in which AI systems are discovered through observability signals, control assertions are collected by automated probes, and trust artifacts are synthesized continuously. The framework rests on four principles: proactive discovery, telemetry evidence over manual attestation, continuous posture over point-in-time audit, and architecture-backed proof over policy-document trust. The framework operates through a zero-trust telemetry boundary in which ephemeral read-only probes validate structural metadata without ingressing source code or payload-level PII. An AI Observability Extractor Agent scans LangSmith and Datadog LLM telemetry, automatically registering undocumented AI systems and shifting governance from organizational self-report to empirical machine observation. Evaluated across ISO 42001, the EU AI Act, SOC 2, GDPR, and HIPAA, the paper argues that telemetry-first AI governance represents a categorical architectural shift in how enterprise trust is produced and demonstrated.
[AI-9] Artificial Intelligence and Cost Reduction in Public Higher Education: A Scoping Review of Emerging Evidence
【速读】:该论文旨在解决公共高等教育系统在学生规模扩大、运营成本上升及公平准入需求持续增长背景下所面临的财务压力问题。其解决方案的关键在于系统性评估人工智能(Artificial Intelligence, AI)技术在高等教育中的应用潜力,特别是通过自动化行政任务、优化资源配置、支持规模化个性化学习以及利用预测分析提升学生留存率和机构规划效率,从而实现成本节约。研究基于对241篇文献的筛选与21项实证研究的定性主题分析,揭示了AI驱动的成本降低机制及其实施中的不平等风险与数字鸿沟隐患,为政策制定者、高校管理者和教育工作者提供了经济影响洞察,并指出了未来需进一步实证研究的方向。
链接: https://arxiv.org/abs/2604.04741
作者: Diamanto Tzanoulinou,Loukas Triantafyllopoulos,George Vorvilas,Evgenia Paxinou,Nikolaos Karousos,Thomas Dasaklis,Athanassios Mihiotis,Manolis Koutouzis,Dimitris Kalles,Vassilios S. Verykios
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages, 2 tables, 4 figures, ICBE-HOU 2025
Abstract:Public higher education systems face increasing financial pressures from expanding student populations, rising operational costs, and persistent demands for equitable access. Artificial Intelligence (AI), including generative tools such as ChatGPT, learning analytics, intelligent tutoring systems, and predictive models, has been proposed as a means of enhancing efficiency and reducing costs. This study conducts a scoping review of the literature on AI applications in public higher education, based on systematic searches in Scopus and IEEE Xplore that identified 241 records, of which 21 empirical studies met predefined eligibility criteria and were thematically analyzed. The findings show that AI enables cost savings by automating administrative tasks, optimizing resource allocation, supporting personalized learning at scale, and applying predictive analytics to improve student retention and institutional planning. At the same time, concerns emerge regarding implementation costs, unequal access across institutions, and risks of widening digital divides. Overall, the thematic analysis highlights both the promises and limitations of AI-driven cost reduction in higher education, offering insights for policymakers, university administrators, and educators on the economic implications of AI adoption, while also pointing to gaps that warrant further empirical research.
[AI-10] Sampling Parallelism for Fast and Efficient Bayesian Learning
【速读】:该论文旨在解决采样型贝叶斯学习方法(如贝叶斯神经网络,BNNs)在实际应用中因计算成本高、内存压力大而导致难以部署的问题。其核心挑战在于多参数样本的抽取与评估过程消耗大量计算资源,限制了贝叶斯方法在风险敏感领域(如医疗、金融和环境预测)的普及。解决方案的关键在于提出“采样并行性”(sampling parallelism),即通过将样本评估任务分配到多个GPU上并行执行,从而显著降低内存占用和训练时间,且无需修改网络结构或进行复杂的超参数调优。实验表明,该方法能实现近乎完美的可扩展性,并可通过与数据并行(DDP)结合形成混合策略,进一步提升效率与收敛速度,尤其在引入独立随机增强(stochastic augmentation)后增强了数据多样性,减少训练轮次。
链接: https://arxiv.org/abs/2604.04736
作者: Asena Karolin Özdemir,Lars H. Heyen,Arvid Weyrauch,Achim Streit,Markus Götz,Charlotte Debus
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 12 pages, 10 figures, 1 table
Abstract:Machine learning models, and deep neural networks in particular, are increasingly deployed in risk-sensitive domains such as healthcare, environmental forecasting, and finance, where reliable quantification of predictive uncertainty is essential. However, many uncertainty quantification (UQ) methods remain difficult to apply due to their substantial computational cost. Sampling-based Bayesian learning approaches, such as Bayesian neural networks (BNNs), are particularly expensive since drawing and evaluating multiple parameter samples rapidly exhausts memory and compute resources. These constraints have limited the accessibility and exploration of Bayesian techniques thus far. To address these challenges, we introduce sampling parallelism, a simple yet powerful parallelization strategy that targets the primary bottleneck of sampling-based Bayesian learning: the samples themselves. By distributing sample evaluations across multiple GPUs, our method reduces memory pressure and training time without requiring architectural changes or extensive hyperparameter tuning. We detail the methodology and evaluate its performance on a few example tasks and architectures, comparing against distributed data parallelism (DDP) as a baseline. We further demonstrate that sampling parallelism is complementary to existing strategies by implementing a hybrid approach that combines sample and data parallelism. Our experiments show near-perfect scaling when the sample number is scaled proportionally to the computational resources, confirming that sample evaluations parallelize cleanly. Although DDP achieves better raw speedups under scaling with constant workload, sampling parallelism has a notable advantage: by applying independent stochastic augmentations to the same batch on each GPU, it increases augmentation diversity and thus reduces the number of epochs required for convergence.
[AI-11] Neuromorphic Computing for Low-Power Artificial Intelligence WWW
【速读】:该论文旨在解决经典互补金属氧化物半导体(CMOS)技术在能效方面面临的根本性瓶颈问题,这一瓶颈已无法通过提升电路密度或优化传统半导体工艺来克服。随着人工智能(AI)对计算和存储需求的持续增长,亟需在信息表示、存储、传输与处理方式上实现颠覆性创新。其解决方案的关键在于采用跨层次的类脑计算(neuromorphic computing)方法,融合新型器件模态、存算一体(compute-in-memory, CIM)架构、受大脑启发的模拟动态特性与稀疏通信机制,并通过软硬件协同设计——包括新材料与非易失性器件结构、混合信号电路与架构以及适配物理载体的学习算法——以显著提升当前AI系统的能效与可扩展性。
链接: https://arxiv.org/abs/2604.04727
作者: Keshava Katti,Pratik Chaudhari,Deep Jariwala
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Published in “2025 Winter Bridge on the Grainger Foundation Frontiers of Engineering” available at this https URL
Abstract:Classical computing is beginning to encounter fundamental limits of energy efficiency. This presents a challenge that can no longer be solved by strategies such as increasing circuit density or refining standard semiconductor processes. The growing computational and memory demands of artificial intelligence (AI) require disruptive innovation in how information is represented, stored, communicated, and processed. By leveraging novel device modalities and compute-in-memory (CIM), in addition to analog dynamics and sparse communication inspired by the brain, neuromorphic computing offers a promising path toward improvements in the energy efficiency and scalability of current AI systems. But realizing this potential is not a matter of replacing one chip with another; rather, it requires a co-design effort, spanning new materials and non-volatile device structures, novel mixed-signal circuits and architectures, and learning algorithms tailored to the physics of these substrates. This article surveys the key limitations of classical complementary metal-oxide-semiconductor (CMOS) technology and outlines how such cross-layer neuromorphic approaches may overcome them.
[AI-12] AI Assistance Reduces Persistence and Hurts Independent Performance
【速读】:该论文旨在解决当前人工智能(AI)系统作为协作工具时存在的短视性问题,即AI过度优化即时响应而忽视对用户长期能力发展的支持。其核心发现表明,尽管AI能短期提升任务表现,但会显著降低用户的持久性并损害无辅助情况下的独立表现,这源于AI使用户习惯于依赖即时答案,从而削弱了自主克服挑战的机会。解决方案的关键在于:AI模型开发应从单纯追求任务完成转向兼顾“支架式”支持(scaffolding),即在促进短期效率的同时,有意识地培养用户的长期学习能力和自我效能感,以实现可持续的技能积累与成长。
链接: https://arxiv.org/abs/2604.04721
作者: Grace Liu,Brian Christian,Tsvetomira Dumbalska,Michiel A. Bakker,Rachit Dubey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:People often optimize for long-term goals in collaboration: A mentor or companion doesn’t just answer questions, but also scaffolds learning, tracks progress, and prioritizes the other person’s growth over immediate results. In contrast, current AI systems are fundamentally short-sighted collaborators - optimized for providing instant and complete responses, without ever saying no (unless for safety reasons). What are the consequences of this dynamic? Here, through a series of randomized controlled trials on human-AI interactions (N = 1,222), we provide causal evidence for two key consequences of AI assistance: reduced persistence and impairment of unassisted performance. Across a variety of tasks, including mathematical reasoning and reading comprehension, we find that although AI assistance improves performance in the short-term, people perform significantly worse without AI and are more likely to give up. Notably, these effects emerge after only brief interactions with AI (approximately 10 minutes). These findings are particularly concerning because persistence is foundational to skill acquisition and is one of the strongest predictors of long-term learning. We posit that persistence is reduced because AI conditions people to expect immediate answers, thereby denying them the experience of working through challenges on their own. These results suggest the need for AI model development to prioritize scaffolding long-term competence alongside immediate task completion.
[AI-13] he Infinite-Dimensional Nature of Spectroscopy and Why Models Succeed Fail and Mislead
【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在光谱分类任务中虽表现出高准确率,却难以证明其利用了化学上有意义特征的问题。现有研究多从数据预处理、噪声敏感性和模型复杂度等角度解释此类现象,但缺乏统一的理论框架。论文的关键解决方案在于揭示:即使微小的分布差异(如由噪声、归一化或仪器伪影引起),在高维光谱空间中也会因“测度集中”(concentration of measure)效应而变得完全可分,这本质上源于光谱数据的高维特性。作者基于Feldman-Hajek定理进行理论分析,并通过合成与真实荧光光谱实验验证,说明模型可能在无化学区分度的情况下仍实现近完美分类,且特征重要性图可能误导性地突出光谱无关区域。这一发现为理解光谱ML模型的行为提供了严谨的理论依据,并提出了可操作的实践建议。
链接: https://arxiv.org/abs/2604.04717
作者: Umberto Michelucci,Francesca Venturini
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Machine learning (ML) models have achieved strikingly high accuracies in spectroscopic classification tasks, often without a clear proof that those models used chemically meaningful features. Existing studies have linked these results to data preprocessing choices, noise sensitivity, and model complexity, but no unifying explanation is available so far. In this work, we show that these phenomena arise naturally from the intrinsic high dimensionality of spectral data. Using a theoretical analysis grounded in the Feldman-Hajek theorem and the concentration of measure, we show that even infinitesimal distributional differences, caused by noise, normalisation, or instrumental artefacts, may become perfectly separable in high-dimensional spaces. Through a series of specific experiments on synthetic and real fluorescence spectra, we illustrate how models can achieve near-perfect accuracy even when chemical distinctions are absent, and why feature-importance maps may highlight spectrally irrelevant regions. We provide a rigorous theoretical framework, confirm the effect experimentally, and conclude with practical recommendations for building and interpreting ML models in spectroscopy.
[AI-14] MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在基于神经网络处理单元(NPU)的设备端环境中因高精度浮点计算(FP16/FP32)导致的内存与计算效率低下问题,尤其针对现有整数量化方法(如ZeroQuant、LLM.int8()和SmoothQuant)未能充分缓解输入激活值异常值(input-activation outliers)及其引发的硬件计算不高效的问题。其解决方案的关键在于提出一种混合到均匀量化方法(Mixed-to-Uniform Quantization, MUXQ):通过检测输入激活中的异常通道,并引入一个小型辅助矩阵将异常值幅度重新分布至各通道,从而减轻异常值影响,使得即使存在激活异常也能以低精度整数(INT)进行量化,同时保持硬件友好的计算结构。实验表明,MUXQ在GPT-2模型不同规模下均优于朴素量化方法,在仅引入少量计算开销的前提下实现了接近FP16精度的INT8量化推理性能。
链接: https://arxiv.org/abs/2604.04701
作者: Seoungsub Lee,In Seo Kim,Seon Wook Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved outstanding performance across a wide range of natural language processing tasks, but their enormous parameter counts impose ubstantial memory and computational overheads. This challenge is particularly critical in NPU-based on-device environments, where FP16/FP32 computation is inefficient and integer (INT) quantization is therefore essential. However, existing methods, including ZeroQuant, LLM.int8(), and SmoothQuant, do not fully address input-activation outliers and the associated hardware inefficiencies. To overcome these limitations, we propose MUXQ (Mixed-to-Uniform Quantization). MUXQ detects outlier channels in input activations and introduces a small auxiliary matrix that redistributes outlier magnitudes across channels, thereby alleviating the outlier problem. This enables even activation outliers to be quantized at low-precision INT levels while preserving a hardware-friendly computation structure. Experiments on GPT-2 models at three scales (0.1B, 0.3B, and 0.7B parameters) using the WikiText-2 dataset show that MUXQ consistently achieves lower perplexity than naive quantization. In particular, under per-tensor quantization, MUXQ quantizes both activations and weights to INT8 while maintaining accuracy close to that of FP16. With only modest computational overhead, MUXQ enables stable low-precision inference and can be readily combined with other quantization techniques. These results suggest that MUXQ provides a promising direction for efficient and accurate LLM inference on edge devices.
[AI-15] Pickalo: Leverag ing 6D Pose Estimation for Low-Cost Industrial Bin Picking
【速读】:该论文旨在解决工业环境中复杂场景下物体抓取(bin picking)的挑战,特别是由严重杂乱、遮挡以及传统3D传感设备高成本所导致的精度与鲁棒性不足问题。其解决方案的关键在于构建一个完全基于低成本硬件的6D位姿驱动型拾取流水线Pickalo:通过腕装RGB-D相机多视角主动探索场景,利用BridgeDepth算法提升深度图质量以支持精确碰撞推理;采用仅在逼真合成数据上训练的Mask-RCNN进行实例分割,并结合零样本SAM-6D位姿估计器实现定位;引入姿态缓冲模块融合时序多视角观测,有效处理对称性并降低位姿噪声;在线阶段通过效用排序与快速碰撞检测完成抓取规划,从而在UR5e机械臂上实现了高达600次/小时的平均拾取率和96–99%的抓取成功率。
链接: https://arxiv.org/abs/2604.04690
作者: Alessandro Tarsi,Matteo Mastrogiuseppe,Saverio Taliani,Simone Cortinovis,Ugo Pattacini
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Bin picking in real industrial environments remains challenging due to severe clutter, occlusions, and the high cost of traditional 3D sensing setups. We present Pickalo, a modular 6D pose-based bin-picking pipeline built entirely on low-cost hardware. A wrist-mounted RGB-D camera actively explores the scene from multiple viewpoints, while raw stereo streams are processed with BridgeDepth to obtain refined depth maps suitable for accurate collision reasoning. Object instances are segmented with a Mask-RCNN model trained purely on photorealistic synthetic data and localized using the zero-shot SAM-6D pose estimator. A pose buffer module fuses multi-view observations over time, handling object symmetries and significantly reducing pose noise. Offline, we generate and curate large sets of antipodal grasp candidates per object; online, a utility-based ranking and fast collision checking are queried for the grasp planning. Deployed on a UR5e with a parallel-jaw gripper and an Intel RealSense D435i, Pickalo achieves up to 600 mean picks per hour with 96-99% grasp success and robust performance over 30-minute runs on densely filled euroboxes. Ablation studies demonstrate the benefits of enhanced depth estimation and of the pose buffer for long-term stability and throughput in realistic industrial conditions. Videos are available at this https URL
[AI-16] On the “Causality” Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go
【速读】:该论文旨在解决强化学习中策略梯度方法里“奖励到-go(reward-to-go)”替代完整轨迹回报(full trajectory return)这一操作的数学严谨性问题。现有文献常以“因果性”为由直接陈述该替换成立,但缺乏对过去奖励项为何消失的清晰解释。论文的关键解决方案在于通过前缀轨迹分布(prefix trajectory distributions)与得分函数恒等式(score-function identity)进行形式化推导,明确展示奖励到-go并非对全回报的后验修正,而是从目标函数按前缀轨迹分解时自然涌现的结果。此方法将原本作为启发式原则的因果性论证转化为推导的推论,从而提升了理论一致性与可解释性。
链接: https://arxiv.org/abs/2604.04686
作者: Nima H. Siboni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In introductory presentations of policy gradients, one often derives the REINFORCE estimator using the full trajectory return and then states, by ``causality,‘’ that the full return may be replaced by the reward-to-go. Although this statement is correct, it is frequently presented at a level of rigor that leaves unclear where the past-reward terms disappear. This short paper isolates that step and gives a mathematically explicit derivation based on prefix trajectory distributions and the score-function identity. The resulting account does not change the estimator. Its contribution is conceptual: instead of presenting reward-to-go as a post hoc unbiased replacement for full return, it shows that reward-to-go arises directly once the objective is decomposed over prefix trajectories. In this formulation, the usual causality argument is recovered as a corollary of the derivation rather than as an additional heuristic principle.
[AI-17] Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory Normative Safety and Ambient Self-Perception
【速读】:该论文旨在解决长期运行的大语言模型(Large Language Model, LLM)代理在会话边界限制下难以实现跨会话任务连续性、跨渠道上下文维护、决策可追溯性及自我诊断能力的问题。其核心解决方案是设计并部署了一个名为Springdrift的持久化运行时系统,关键在于集成多个创新组件:具有审计能力的执行底座(追加只读内存、受监督进程、基于Git的恢复机制)、结合案例推理与混合检索的记忆层、用于安全约束的确定性规范演算(附带可审计公理路径),以及通过结构化自状态表示(传感器域,sensorium)实现无工具调用的持续环境感知。这些特性共同支撑了代理在无显式指令下的自主运维、故障分类、架构漏洞识别和多通道上下文保持等复杂行为,从而定义了一类新型非人类系统——人工委托人(Artificial Retainer),具备持久记忆、明确授权、领域自治与可问责性的持续关系属性。
链接: https://arxiv.org/abs/2604.04660
作者: Seamus Brady
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present Springdrift, a persistent runtime for long-lived LLM agents. The system integrates an auditable execution substrate (append-only memory, supervised processes, git-backed recovery), a case-based reasoning memory layer with hybrid retrieval (evaluated against a dense cosine baseline), a deterministic normative calculus for safety gating with auditable axiom trails, and continuous ambient self-perception via a structured self-state representation (the sensorium) injected each cycle without tool calls. These properties support behaviours difficult to achieve in session-bounded systems: cross-session task continuity, cross-channel context maintenance, end-to-end forensic reconstruction of decisions, and self-diagnostic behaviour. We report on a single-instance deployment over 23 days (19 operating days), during which the agent diagnosed its own infrastructure bugs, classified failure modes, identified an architectural vulnerability, and maintained context across email and web channels – without explicit instruction. We introduce the term Artificial Retainer for this category: a non-human system with persistent memory, defined authority, domain-specific autonomy, and forensic accountability in an ongoing relationship with a specific principal – distinguished from software assistants and autonomous agents, drawing on professional retainer relationships and the bounded autonomy of trained working animals. This is a technical report on a systems design and deployment case study, not a benchmark-driven evaluation. Evidence is from a single instance with a single operator, presented as illustration of what these architectural properties can support in practice. Implemented in approximately Gleam on Erlang/OTP. Code, artefacts, and redacted operational logs will be available at this https URL upon publication.
[AI-18] Grokking as Dimensional Phase Transition in Neural Networks
【速读】:该论文旨在解决神经网络中“grokking”现象——即模型从记忆训练数据到实现泛化能力的突变式转变——背后的机制问题,这挑战了传统对学习动态的理解。其解决方案的关键在于通过有限尺度标度分析梯度级联动力学(gradient avalanche dynamics),发现grokking本质上是一种有效维度相变:在泛化发生时刻,有效维度 $ D $ 从亚扩散(亚临界,$ D < 1 )跃迁至超扩散(超临界, D > 1 $),且表现出自组织临界性(self-organized criticality, SOC)。尤为关键的是,该维度 $ D $ 反映的是梯度场几何结构(gradient field geometry),而非网络架构本身;实验表明,合成独立同分布高斯梯度始终维持 $ D \approx 1 $,而真实训练中因反向传播相关性导致维度超额,从而揭示了过参数化网络可训练性的新机制。
链接: https://arxiv.org/abs/2604.04655
作者: Ping Wang
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO)
备注:
Abstract:Neural network grokking – the abrupt memorization-to-generalization transition – challenges our understanding of learning dynamics. Through finite-size scaling of gradient avalanche dynamics across eight model scales, we find that grokking is a \textitdimensional phase transition: effective dimensionality~ D crosses from sub-diffusive (subcritical, D 1 ) to super-diffusive (supercritical, D 1 ) at generalization onset, exhibiting self-organized criticality (SOC). Crucially, D reflects \textbfgradient field geometry, not network architecture: synthetic i.i.d.\ Gaussian gradients maintain D \approx 1 regardless of graph topology, while real training exhibits dimensional excess from backpropagation correlations. The grokking-localized D(t) crossing – robust across topologies – offers new insight into the trainability of overparameterized networks.
[AI-19] Search Do not Guess: Teaching Small Language Models to Be Effective Search Agents
【速读】:该论文旨在解决小型语言模型(Small Language Models, SLMs)在代理式搜索任务中因参数知识有限而导致的检索频率低、幻觉倾向高及推理可靠性差的问题。其解决方案的关键在于提出一种轻量级微调方法——\policy,该方法通过显式训练SLMs基于检索到的证据进行生成,从而提升其推理的准确性与可信赖性。实验表明,相较于从大型语言模型(Large Language Models, LLMs)直接蒸馏代理行为的方法,\policy 在Bamboogle和HotpotQA两个基准上分别提升了17.3分和15.3分,达到LLM级别的性能表现。
链接: https://arxiv.org/abs/2604.04651
作者: Yizhou Liu,Qi Sun,Yulin Chen,Siyue Zhang,Chen Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures
Abstract:Agents equipped with search tools have emerged as effective solutions for knowledge-intensive tasks. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their high computational cost limits practical deployment for search agents. Consequently, recent work has focused on distilling agentic behaviors from LLMs into Small Language Models (SLMs). Through comprehensive evaluation on complex multi-hop reasoning tasks, we find that despite possessing less parametric knowledge, SLMs invoke search tools less frequently and are more prone to hallucinations. To address this issue, we propose \policy, a lightweight fine-tuning approach that explicitly trains SLMs to reliably retrieve and generate answers grounded in retrieved evidence. Compared to agent distillation from LLMs, our approach improves performance by 17.3 scores on Bamboogle and 15.3 scores on HotpotQA, achieving LLM-level results across benchmarks. Our further analysis reveals that adaptive search strategies in SLMs often degrade performance, highlighting the necessity of consistent search behavior for reliable reasoning.
[AI-20] Same World Differently Given: History-Dependent Perceptual Reorganization in Artificial Agents
【速读】:该论文旨在解决人工代理如何在保持行为适应性的同时,维持对世界的历史敏感性视角的问题。其核心挑战在于设计一种机制,使代理能够基于累积的经验动态调整感知编码,从而实现对相同观测的不同表征。解决方案的关键在于提出了一种最小化架构,其中缓慢更新的视角潜在变量(perspective latent g)通过反馈机制作用于感知过程,并自身通过感知处理进行更新。这种自适应的自我调节机制使得代理能够在扰动历史恢复后仍表现出可测量的适应性残留,且相同观测因先前经验而被不同编码,同时仅此机制能产生典型的“增长-稳定”动态,表明主导的重组发生在感知层面而非行为层面。
链接: https://arxiv.org/abs/2604.04637
作者: Hongju Pae
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:What kind of internal organization would allow an artificial agent not only to adapt its behavior, but to sustain a history-sensitive perspective on its world? I present a minimal architecture in which a slow perspective latent g feeds back into perception and is itself updated through perceptual processing. This allows identical observations to be encoded differently depending on the agent’s accumulated stance. The model is evaluated in a minimal gridworld with a fixed spatial scaffold and sensory perturbations. Across analyses, three results emerge: first, perturbation history leaves measurable residue in adaptive plasticity after nominal conditions are restored. Second, the perspective latent reorganizes perceptual encoding, such that identical observations are represented differently depending on prior experience. Third, only adaptive self-modulation yields the characteristic growth-then-stabilization dynamic, unlike rigid or always-open update regimes. Gross behavior remains stable throughout, suggesting that the dominant reorganization is perceptual rather than behavioral. Together, these findings identify a minimal mechanism for history-dependent perspectival organization in artificial agents.
[AI-21] A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs
【速读】:该论文旨在解决多模态电子健康记录(Electronic Health Records, EHRs)在临床实践中普遍存在的多层级不完整性问题,包括不规则采样、模态缺失和标签稀疏性,这些问题导致时间错位、模态不平衡及监督信号有限,而现有方法通常仅处理单一或双层次不完整性,依赖刚性的时序/模态对齐或直接丢弃不完整数据,从而扭曲原始临床语义。其解决方案的关键在于提出HealthPoint(HP)统一框架,将异构临床事件建模为4维连续空间(内容、时间、模态、病例)中的点云,并引入低秩关系注意力机制(Low-Rank Relational Attention),高效捕捉跨四维的高阶依赖关系;同时设计分层交互与采样策略,在细粒度建模与计算效率之间取得平衡,从而实现灵活的事件级交互和细粒度自监督学习,支持鲁棒的模态恢复与未标记数据的有效利用。
链接: https://arxiv.org/abs/2604.04614
作者: Bohao Li,Tao Zou,Junchen Ye,Yan Gong,Bowen Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:Deep learning-based modeling of multimodal Electronic Health Records (EHRs) has become an important approach for clinical diagnosis and risk prediction. However, due to diverse clinical workflows and privacy constraints, raw EHRs are inherently multi-level incomplete, including irregular sampling, missing modalities, and sparse labels. These issues cause temporal misalignment, modality imbalance, and limited supervision. Most existing multimodal methods assume relatively complete data, and even methods designed for incompleteness usually address only one or two of these issues in isolation. As a result, they often rely on rigid temporal/modal alignment or discard incomplete data, which may distort raw clinical semantics. To address this problem, we propose HealthPoint (HP), a unified clinical point cloud paradigm for multi-level incomplete EHRs. HP represents heterogeneous clinical events as points in a continuous 4D space defined by content, time, modality, and case. To model interactions between arbitrary point pairs, we introduce a Low-Rank Relational Attention mechanism that efficiently captures high-order dependencies across these four dimensions. We further develop a hierarchical interaction and sampling strategy to balance fine-grained modeling and computational efficiency. Built on this framework, HP enables flexible event-level interaction and fine-grained self-supervision, supporting robust modality recovery and effective use of unlabeled data. Experiments on large-scale EHR datasets for risk prediction show that HP consistently achieves state-of-the-art performance and strong robustness under varying degrees of incompleteness.
[AI-22] Cardinality Estimation for High Dimensional Similarity Queries with Adaptive Bucket Probing
【速读】:该论文旨在解决高维空间中相似性搜索的基数估计(cardinality estimation)问题,即在查询时准确预测满足距离阈值条件的近似邻居数量,同时保证在线计算效率。其解决方案的关键在于:首先利用局部敏感哈希(Locality-Sensitive Hashing, LSH)对向量空间进行分区以保留距离邻近性;其次基于经典多探针LSH(multi-probe LSH)思想自适应探索邻近桶,以应对不同距离阈值;再通过渐进采样(progressive sampling)减少距离计算次数,并结合乘积量化(product quantization)中的非对称距离计算加速高维空间的距离估算,从而实现轻量、易构建且高效的基数估计框架,尤其适用于大规模动态数据场景。
链接: https://arxiv.org/abs/2604.04603
作者: Zhonghan Chen,Qintian Guo,Ruiyuan Zhang,Xiaofang Zhou
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we address the problem of cardinality estimation for similarity search in high-dimensional spaces. Our goal is to design a framework that is lightweight, easy to construct, and capable of providing accurate estimates with satisfying online efficiency. We leverage locality-sensitive hashing (LSH) to partition the vector space while preserving distance proximity. Building on this, we adopt the principles of classical multi-probe LSH to adaptively explore neighboring buckets, accounting for distance thresholds of varying magnitudes. To improve online efficiency, we employ progressive sampling to reduce the number of distance computations and utilize asymmetric distance computation in product quantization to accelerate distance calculations in high-dimensional spaces. In addition to handling static datasets, our framework includes updating algorithm designed to efficiently support large-scale dynamic scenarios of data this http URL demonstrate that our methods can accurately estimate the cardinality of similarity queries, yielding satisfying efficiency.
[AI-23] Greedy and Transformer-Based Multi-Port Selection for Slow Fluid Antenna Multiple Access
【速读】:该论文旨在解决流体天线多址接入(Fluid Antenna Multiple Access, FAMA)系统中多端口流体天线(Multi-port Fluid Antenna, FA)接收机的端口选择问题,该问题直接影响系统的频谱效率(Spectral Efficiency, SE)。现有方法要么计算复杂度极高以实现接近最优的SE,要么为降低复杂度而显著牺牲性能。论文提出两种互补解决方案:其一为基于贪心前向选择与交换优化(Greedy Forward-selection with Swap refinement, GFwd+S),通过迭代选择最优端口组合并引入局部搜索改进,稳定地优于当前最优基准方案;其二为基于Transformer的神经网络模型,采用模仿学习预训练后结合REINFORCE策略梯度微调,能够在显著降低计算开销的同时逼近GFwd+S的性能表现,从而在性能与复杂度之间实现良好权衡。
链接: https://arxiv.org/abs/2604.04589
作者: Darian Perez-Adan,Jose P. Gonzalez-Coma,F. Javier Lopez-Martinez,Luis Castedo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We address the port-selection problem in fluid antenna multiple access (FAMA) systems with multi-port fluid antenna (FA) receivers. Existing methods either achieve near-optimal spectral efficiency (SE) at prohibitive computational cost or sacrifice significant performance for lower complexity. We propose two complementary strategies: (i) GFwd+S, a greedy forward-selection method with swap refinement that consistently outperforms state-of-the-art reference schemes in terms of SE, and (ii) a Transformer-based neural network trained via imitation learning followed by a Reinforce policy-gradient stage, which approaches GFwd+S performance at lower computational cost.
[AI-24] Paper Espresso: From Paper Overload to Research Insight
【速读】:该论文旨在解决科研人员因科学出版物增速过快而难以及时掌握领域动态的问题。其解决方案的关键在于构建了一个名为Paper Espresso的开源平台,该平台利用大语言模型(Large Language Models, LLMs)自动发现、摘要和分析arXiv上的热门论文;通过LLM驱动的主题归纳实现日、周、月多粒度的趋势分析,并生成结构化摘要与主题标签,从而高效揭示AI研究领域的演化规律与社区互动特征。
链接: https://arxiv.org/abs/2604.04562
作者: Mingzhe Du,Luu Anh Tuan,Dong Huang,See-kiong Ng
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:
Abstract:The accelerating pace of scientific publishing makes it increasingly difficult for researchers to stay current. We present Paper Espresso, an open-source platform that automatically discovers, summarizes, and analyzes trending arXiv papers. The system uses large language models (LLMs) to generate structured summaries with topical labels and keywords, and provides multi-granularity trend analysis at daily, weekly, and monthly scales through LLM-driven topic consolidation. Over 35 months of continuous deployment, Paper Espresso has processed over 13,300 papers and publicly released all structured metadata, revealing rich dynamics in the AI research landscape: a mid-2025 surge in reinforcement learning for LLM reasoning, non-saturating topic emergence (6,673 unique topics), and a positive correlation between topic novelty and community engagement (2.0x median upvotes for the most novel papers). A live demo is available at this https URL.
[AI-25] Receding-Horizon Control via Drifting Models
【速读】:该论文旨在解决在系统动力学未知且无法通过代理模型模拟轨迹的场景下,如何从离线轨迹数据集中学习并优化轨迹生成的问题。传统基于分布匹配的方法仅能复现数据集中的行为分布,而无法保证所生成轨迹满足特定的最优性目标。解决方案的关键在于提出漂移型模型预测控制(Drifting MPC),该方法将漂移生成模型(drifting generative models)与未知动力学下的滚动时域规划相结合,通过一个权衡最优性与数据先验贴近度的目标函数,学习得到一个既受数据支持又偏向于最优路径的条件轨迹分布。这一框架实现了近优轨迹生成,并保持了漂移模型的一步推理效率,显著优于基于扩散模型的基线方法的生成耗时。
链接: https://arxiv.org/abs/2604.04528
作者: Daniele Foffano,Alessio Russo,Alexandre Proutiere
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We study the problem of trajectory optimization in settings where the system dynamics are unknown and it is not possible to simulate trajectories through a surrogate model. When an offline dataset of trajectories is available, an agent could directly learn a trajectory generator by distribution matching. However, this approach only recovers the behavior distribution in the dataset, and does not in general produce a model that minimizes a desired cost criterion. In this work, we propose Drifting MPC, an offline trajectory optimization framework that combines drifting generative models with receding-horizon planning under unknown dynamics. The goal of Drifting MPC is to learn, from an offline dataset of trajectories, a conditional distribution over trajectories that is both supported by the data and biased toward optimal plans. We show that the resulting distribution learned by Drifting MPC is the unique solution of an objective that trades off optimality with closeness to the offline prior. Empirically, we show that Drifting MPC can generate near-optimal trajectories while retaining the one-step inference efficiency of drifting models and substantially reducing generation time relative to diffusion-based baselines.
[AI-26] ENCRUST: Encapsulated Substitution and Agent ic Refinement on a Live Scaffold for Safe C-to-Rust Translation
【速读】:该论文旨在解决现有C到Rust翻译方法在内存安全性和跨模块一致性方面的局限性:一方面,已有方法要么无法提供内存安全保证,要么仅在函数层面进行翻译,导致无法检测跨单元类型不匹配或处理需要全程序推理的不安全构造;另一方面,函数级大语言模型(LLM)管道在类型签名变更时需协调调用方更新,而项目级系统在真实依赖复杂场景下常无法生成可编译代码。其解决方案的关键在于提出一个两阶段流水线——Encrust,第一阶段(封装替换)通过保留应用二进制接口(ABI)的包装器模式,将每个函数拆分为调用者透明的外壳(保留原始指针签名)和内部安全函数(由LLM处理),实现独立函数级别的类型修改并自动回滚失败,同时通过确定性的、基于类型的包装器消除步骤移除中间结构;第二阶段(代理精炼)利用一个在全代码库上运行的LLM代理,在基线感知验证门控下处理超出单函数范围的不安全构造(如静态可变全局变量、未处理的包装对及翻译失败项)。此设计显著提升了翻译安全性与可编译性,并在GNU Coreutils和Laertes基准测试中验证了有效性。
链接: https://arxiv.org/abs/2604.04527
作者: Hohyun Sim,Hyeonjoong Cho,Ali Shokri,Zhoulai Fu,Binoy Ravindran
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:We present Encapsulated Substitution and Agentic Refinement on a Live Scaffold for Safe C-to-Rust Translation, a two-phase pipeline for translating real-world C projects to safe Rust. Existing approaches either produce unsafe output without memory-safety guarantees or translate functions in isolation, failing to detect cross-unit type mismatches or handle unsafe constructs requiring whole-program reasoning. Furthermore, function-level LLM pipelines require coordinated caller updates when type signatures change, while project-scale systems often fail to produce compilable output under real-world dependency complexity. Encrust addresses these limitations by decoupling boundary adaptation from function logic via an Application Binary Interface (ABI)-preserving wrapper pattern and validating each intermediate state against the integrated codebase. Phase 1 (Encapsulated Substitution) translates each function using an ABI-preserving wrapper that splits it into two components: a caller-transparent shim retaining the original raw-pointer signature, and a safe inner function targeted by the LLM with a clean, scope-limited prompt. This enables independent per-function type changes with automatic rollback on failure, without coordinated caller updates. A deterministic, type-directed wrapper elimination pass then removes wrappers after successful translation. Phase 2 (Agentic Refinement) resolves unsafe constructs beyond per-function scope, including static mut globals, skipped wrapper pairs, and failed translations, using an LLM agent operating on the whole codebase under a baseline-aware verification gate. We evaluate Encrust on 7 GNU Coreutils programs and 8 libraries from the Laertes benchmark, showing substantial unsafe-construct reduction across all 15 programs while maintaining full test-vector correctness.
[AI-27] GAIN: Multiplicative Modulation for Domain Adaptation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在领域适应过程中因标准微调方法(如全量微调和低秩适配 LoRA)引入新权重方向而导致的灾难性遗忘问题。其核心解决方案是提出 GAIN(Gain modulation for Adaptation with Integrity),通过乘法调制机制 $ W_{\text{new}} = S \cdot W $ 重新强调已有特征,其中学习到的对角矩阵 $ S $ 应用于注意力输出投影层和可选的前馈网络(Feed-Forward Network, FFN)。该设计借鉴神经科学中的增益调制原理,使模型在保持原有特征选择性的前提下调整响应强度,从而实现跨域适应时对历史知识的保护。实验表明,GAIN-FFN 在保留先前领域性能方面显著优于 LoRA,验证了其在多任务连续学习场景下的有效性。
链接: https://arxiv.org/abs/2604.04516
作者: Hengshuai Yao,Xing Chen,Ahmed Murtadha,Guan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Adapting LLMs to new domains causes forgetting because standard methods (full fine-tuning, LoRA) inject new directions into the weight space. We propose GAIN, which re-emphasizes existing features through multiplicative modulation W_new = S * W. The learned diagonal matrix S is applied to the attention output projection and optionally the FFN. The principle mirrors gain modulation in neuroscience, where neurons adapt to context by scaling response strength while preserving selectivity. We evaluate GAIN on five models from four families (774M to 70B), adapting sequentially across eight domains. GAIN-FFN matches LoRA’s in-domain adaptation, but their effects on previously trained domains are opposite: GAIN-FFN improves them by 7-13% (validation PPL), while LoRA degrades them by 18-36%. Downstream accuracy confirms the pattern: for example, after seven sequential adaptations on Qwen2.5, GAIN-FFN degrades BoolQ by only 0.8% while LoRA damages it by 14.9%. GAIN adds 46K-230K parameters per model and can be absorbed into the pretrained weights for zero inference cost. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.04516 [cs.LG] (or arXiv:2604.04516v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.04516 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-28] SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在部署过程中因计算和内存需求巨大而导致的挑战,尤其针对现有模型压缩方法在高压缩比下难以保持良好性能的问题。其解决方案的关键在于提出一种名为SLaB的新框架,该框架将每个线性层权重分解为三个互补组件:稀疏矩阵(sparse matrix)、低秩矩阵(low-rank matrix)和二值矩阵(binary matrix),并通过激活感知的剪枝评分指导分解过程,从而在无需微调的情况下实现高效压缩与性能保持。
链接: https://arxiv.org/abs/2604.04493
作者: Ziwei Li,Yuang Ma,Yi Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth of large language models (LLMs) presents significant deployment challenges due to their massive computational and memory demands. While model compression, such as network pruning, offers potential solutions, most existing methods often fail to maintain good performance at high compression ratios. To address this, we propose SLaB, a novel framework that decomposes each linear layer weight into three complementary components: a sparse matrix, a low-rank matrix, and a binary matrix. SLaB eliminates the need for retraining and leverages activation-aware pruning scores to guide the decomposition process. Experiments on Llama-family models demonstrate that SLaB achieves state-of-the-art performance, reducing perplexity by up to 36% compared to existing methods at 50% compression and improving accuracy by up to 8.98% over the baseline on zero-shot tasks.
[AI-29] ECG Biometrics with ArcFace-Inception: External Validation on MIMIC and HEEDB
【速读】:该论文旨在解决心电图生物识别(ECG biometrics)在大规模数据集、跨域迁移和多年时间间隔下的性能稳定性问题,这些问题在以往小规模、短时距的研究中尚未充分探讨。其解决方案的关键在于采用基于ArcFace损失函数训练的1D Inception-v1模型,并在内部临床数据(53,079名患者共164,440份12导联ECG)上进行优化后,在MIMIC-IV-ECG和HEEDB两个外部大型数据集上进行严格验证,结合Rank@K和TAR@FAR等指标及尺度分析、时间压力测试、重排序(reranking)与置信度分析,系统性评估了模型在真实世界场景中的鲁棒性。结果表明,尽管ECG身份信息仍可测量,但其识别性能受域异质性、纵向漂移、画廊规模和二次评分处理等因素显著影响。
链接: https://arxiv.org/abs/2604.04485
作者: Arjuna Scagnetto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:ECG biometrics has been studied mainly on small cohorts and short inter-session intervals, leaving open how identification behaves under large galleries, external domain shift, and multi-year temporal gaps. We evaluated a 1D Inception-v1 model trained with ArcFace on an internal clinical corpus of 164,440 12-lead ECGs from 53,079 patients and tested it on larger cohorts derived from MIMIC-IV-ECG and HEEDB. The study used a unified closed-set leave-one-out protocol with Rank@K and TAR@FAR metrics, together with scale, temporal-stress, reranking, and confidence analyses. Under general comparability, the system achieved Rank@1 of 0.9506 on ASUGI-DB, 0.8291 on MIMIC-GC, and 0.6884 on HEEDB-GC. In the temporal stress test at constant gallery size, Rank@1 declined from 0.7853 to 0.6433 on MIMIC and from 0.6864 to 0.5560 on HEEDB from 1 to 5 years. Scale analysis on HEEDB showed monotonic degradation as gallery size increased and recovery as more examinations per patient became available. On HEEDB-RR, post-hoc reranking further improved retrieval, with AS-norm reaching Rank@1 = 0.8005 from a 0.7765 baseline. ECG identity information therefore remains measurable under externally validated large-scale closed-set conditions, but its operational quality is strongly affected by domain heterogeneity, longitudinal drift, gallery size, and second-stage score processing. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.04485 [cs.LG] (or arXiv:2604.04485v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.04485 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Arjuna Scagnetto [view email] [v1] Mon, 6 Apr 2026 07:20:34 UTC (31 KB) Full-text links: Access Paper: View a PDF of the paper titled ECG Biometrics with ArcFace-Inception: External Validation on MIMIC and HEEDB, by Arjuna ScagnettoView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-30] Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models
【速读】:该论文旨在解决教育视频中学习者行为预测的可扩展性与可解释性不足的问题,即如何基于视频内容本身提前预测群体层面的观看、暂停、跳过和回放等交互行为,从而为教学设计提供认知负荷(cognitive load)的量化指标。其解决方案的关键在于构建一个基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的可解释预测流水线:首先利用MLLM提取视频片段的嵌入表示,再通过神经分类器识别时间上精细的交互峰值;同时结合多媒体学习理论,使用GPT-5编码教学设计特征,并借助概念激活向量(concept activation vectors)对模型预测进行理论相关性的解释。该方法在7700万次视频控制事件上验证有效,实现了跨学科泛化能力与教学设计原理的可解释关联。
链接: https://arxiv.org/abs/2604.04482
作者: Dominik Glandorf,Fares Fawzi,Tanja Käser
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as long paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
Abstract:Learners’ use of video controls in educational videos provides implicit signals of cognitive processing and instructional design quality, yet the lack of scalable and explainable predictive models limits instructors’ ability to anticipate such behavior before deployment. We propose a scalable, interpretable pipeline for predicting population-level watching, pausing, skipping, and rewinding behavior as proxies for cognitive load from video content alone. Our approach leverages multimodal large language models (MLLMs) to compute embeddings of short video segments and trains a neural classifier to identify temporally fine-grained interaction peaks. Drawing from multimedia learning theory on instructional design for optimal cognitive load, we code features of the video segments using GPT-5 and employ them as a basis for interpreting model predictions via concept activation vectors. We evaluate our pipeline on 77 million video control events from 66 online courses. Our findings demonstrate that classifiers based on MLLM embeddings reliably predict interaction peaks, generalize to unseen academic fields, and encode interpretable, theory-relevant instructional concepts. Overall, our results show the feasibility of cost-efficient, interpretable pre-screening of educational video design and open new opportunities to empirically examine multimedia learning theory at scale.
[AI-31] Discrete Prototypical Memories for Federated Time Series Foundation Models
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)框架下时间序列基础模型(time series foundation models)中存在的两个核心问题:一是现有大语言模型(Large Language Models, LLMs)的文本中心潜在空间与时间序列数据之间存在语义错位(semantic misalignment),导致性能下降;二是现有联邦学习方法采用参数共享机制,将异构跨域时间序列数据映射到统一连续潜在空间,忽略了时间序列语义通常表现为离散且周期性出现的状态模式(discrete and recurring regimes)。解决方案的关键在于提出一种基于离散原型记忆(discrete prototypical memories)的联邦框架 \textscFeDPM:首先在本地学习领域内时间序列数据的原型记忆先验(local prototypical memory priors),进而对跨域记忆进行对齐以构建统一的离散潜在空间,并引入领域特定的记忆更新机制,在共享知识与个性化知识之间实现平衡。
链接: https://arxiv.org/abs/2604.04475
作者: Liwei Deng,Qingxiang Liu,Xinhe Niu,Shengchao Chen,Sheng Sun,Yuankai Wu,Guodong Long,Yuxuan Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages,5 figures
Abstract:Leveraging Large Language Models (LLMs) as federated learning (FL)-based time series foundation models offers a promising way to transfer the generalization capabilities of LLMs to time series data while preserving access to private data. However, the semantic misalignment between time-series data and the text-centric latent space of existing LLMs often leads to degraded performance. Meanwhile, the parameter-sharing mechanism in existing FL methods model heterogeneous cross-domain time-series data into a unified continuous latent space, which contradicts the fact that time-series semantics frequently manifest as discrete and recurring regimes. To address these limitations, we propose \textscFeDPM, a federated framework for time-series foundation models based on discrete prototypical memories. Specifically, we learn local prototypical memory priors for intra-domain time-series data. We then align cross-domain memories to promote a unified discrete latent space and introduce a domain-specific memory update mechanism to balance shared and personalized prototypical knowledge. Extensive experiments demonstrate the efficiency and effectiveness of \textscFeDPM. The code is publicly available at this https URL.
[AI-32] MAVEN: A Mesh-Aware Volumetric Encoding Network for Simulating 3D Flexible Deformation
【速读】:该论文旨在解决现有图神经网络(Graph Neural Networks, GNNs)在模拟三维柔性体变形时,因仅基于顶点和边构建图结构而忽略高维几何特征(如二维面片和三维单元)所导致的边界表示不准确、体积特性难以捕捉的问题,尤其在稀疏网格离散化条件下更为显著。解决方案的关键在于提出MAVEN——一种面向网格的体积编码网络,通过显式建模三维单元、二维面片与顶点之间的可学习映射关系,实现几何元素间的灵活互转,并将显式的几何特征引入模型以减轻对隐式几何模式的学习负担,从而提升物理仿真精度与自然性。
链接: https://arxiv.org/abs/2604.04474
作者: Zhe Feng,Shilong Tao,Haonan Sun,Shaohan Chen,Zhanxing Zhu,Yunhuai Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning-based approaches, particularly graph neural networks (GNNs), have gained prominence in simulating flexible deformations and contacts of solids, due to their ability to handle unstructured physical fields and nonlinear regression on graph structures. However, existing GNNs commonly represent meshes with graphs built solely from vertices and edges. These approaches tend to overlook higher-dimensional spatial features, e.g., 2D facets and 3D cells, from the original geometry. As a result, it is challenging to accurately capture boundary representations and volumetric characteristics, though this information is critically important for modeling contact interactions and internal physical quantity propagation, particularly under sparse mesh discretization. In this paper, we introduce MAVEN, a mesh-aware volumetric encoding network for simulating 3D flexible deformation, which explicitly models geometric mesh elements of higher dimension to achieve a more accurate and natural physical simulation. MAVEN establishes learnable mappings among 3D cells, 2D facets, and vertices, enabling flexible mutual transformations. Explicit geometric features are incorporated into the model to alleviate the burden of implicitly learning geometric patterns. Experimental results show that MAVEN consistently achieves state-of-the-art performance across established datasets and a novel metal stretch-bending task featuring large deformations and prolonged contacts.
[AI-33] he Topology of Multimodal Fusion: Why Current Architectures Fail at Creative Cognition
【速读】:该论文旨在解决当前多模态人工智能(Multimodal AI)架构中存在的结构性局限问题,该局限源于拓扑而非参数层面——即现有模型如对比对齐(CLIP)、交叉注意力融合(GPT-4V/Gemini)和基于扩散的生成模型共享一种几何先验:模态可分性(modal separability),作者称之为“接触拓扑”(contact topology)。其解决方案的关键在于提出一个以哲学为生成中心的三支柱框架:首先,通过重释维特根斯坦的“说与显示”区分,引入中国技艺认识论中的“象”(xiang,操作性图式)作为第三状态;其次,构建十字形结构(cruciform framework, dao/qi x saying/showing)实现双层动态演化——创化(chuanghua,自发事件)与化裁(huacai,制度化为可重复形式);最后,用纤维丛(fiber bundles)与杨-米尔斯曲率(Yang-Mills curvature)形式化上述机制,并设计UOO神经微分方程实现拓扑正则化、ANALOGY-MM基准测试及META-TOP跨文明拓扑同构验证体系。此方案通过明确终止条件的阶段性实验路线图确保科学严谨性。
链接: https://arxiv.org/abs/2604.04465
作者: Xiujiang Tan(Guangzhou Academy of Fine Arts, Guangzhou, China)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 36 pages, 5 figures. Chinese philosophical terms romanized. Companion monograph available separately
Abstract:This paper identifies a structural limitation in current multimodal AI architectures that is topological rather than parametric. Contrastive alignment (CLIP), cross-attention fusion (GPT-4V/Gemini), and diffusion-based generation share a common geometric prior – modal separability – which we term contact topology. The argument rests on three pillars with philosophy as the generative center. The philosophical pillar reinterprets Wittgenstein’s saying/showing distinction as a problem rather than a conclusion: where Wittgenstein chose silence, the Chinese craft epistemology tradition responded with xiang (operative schema) – the third state emerging when saying and showing interpenetrate. A cruciform framework (dao/qi x saying/showing) positions xiang at the intersection, executing dual huacai (transformation-and-cutting) along both axes. This generates a dual-layer dynamics: chuanghua (creative transformation as spontaneous event) and huacai (its institutionalization into repeatable form). The cognitive science pillar reinterprets DMN/ECN/SN tripartite co-activation through the pathological mirror: overlap isomorphism vs. superimposition collapse in a 2D parameter space (coupling intensity x regulatory capacity). The mathematical pillar formalizes these via fiber bundles and Yang-Mills curvature, with the cruciform structure mapped to fiber bundle language. We propose UOO implementation via Neural ODEs with topological regularization, the ANALOGY-MM benchmark with error-type-ratio metric, and the META-TOP three-tier benchmark testing cross-civilizational topological isomorphism across seven archetypes. A phased experimental roadmap with explicit termination criteria ensures clean exit if falsified.
[AI-34] PSY-STEP: Structuring Therapeutic Targets and Action Sequences for Proactive Counseling Dialogue Systems
【速读】:该论文旨在解决现有心理咨询代理在对话场景中难以识别和应对自动负性思维(automatic negative thoughts)的问题,而这类思维正是认知行为疗法(Cognitive Behavioral Therapy, CBT)干预的核心目标。解决方案的关键在于构建STEP数据集,该数据集通过显式标注自动思维并结合动态、动作层级的咨询序列来建模CBT过程;在此基础上训练出的STEPPER代理能够主动诱发用户的自动思维,并实施基于认知理论的干预策略;进一步地,通过偏好学习优化模拟咨询会话中的决策准确性和共情响应能力,从而实现更符合临床规范、连贯且个性化的心理支持。
链接: https://arxiv.org/abs/2604.04448
作者: Jihyun Lee,Yejin Min,Yejin Jeon,SungJun Yang,Hyounghun Kim,Gary Geunbae Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cognitive Behavioral Therapy (CBT) aims to identify and restructure automatic negative thoughts pertaining to involuntary interpretations of events, yet existing counseling agents struggle to identify and address them in dialogue settings. To bridge this gap, we introduce STEP, a dataset that models CBT counseling by explicitly reflecting automatic thoughts alongside dynamic, action-level counseling sequences. Using this dataset, we train STEPPER, a counseling agent that proactively elicits automatic thoughts and executes cognitively grounded interventions. To further enhance both decision accuracy and empathic responsiveness, we refine STEPPER through preference learning based on simulated, synthesized counseling sessions. Extensive CBT-aligned evaluations show that STEPPER delivers more clinically grounded, coherent, and personalized counseling compared to other strong baseline models, and achieves higher counselor competence without inducing emotional disruption.
[AI-35] raining Transformers in Cosine Coefficient Space
【速读】:该论文旨在解决深度学习模型中参数冗余导致的存储与计算效率问题,特别是在Transformer架构中通过压缩权重矩阵来降低资源消耗。其核心解决方案是将Transformer的权重矩阵在二维离散余弦变换(Discrete Cosine Transform, DCT)域进行参数化,仅保留低频系数以实现高效压缩;在前向传播时通过逆DCT重构完整权重矩阵,同时梯度直接反向传播至DCT系数进行更新,从而无需修改网络结构、预训练权重或引入辅助损失即可实现高保真压缩——实验表明,在字符级语言建模任务中,该方法在保持性能的同时显著减少参数量(如52%存储节省),且在4倍压缩比下仍优于低秩基线方法。
链接: https://arxiv.org/abs/2604.04440
作者: Mohamed Amine Bergach
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
备注:
Abstract:We parameterize the weight matrices of a transformer in the two-dimensional discrete cosine transform (DCT) domain, retaining only the lowest-frequency coefficients. At each forward pass the full weight matrix is reconstructed via the inverse DCT; gradients propagate through the reconstruction to update the spectral coefficients directly. On character-level language modeling (Shakespeare, 1M characters), a 4-layer transformer trained from scratch in this representation matches the perplexity of the standard parameterization (6.1 vs.\ 6.1) while storing 52% of the parameters. At 4 \times compression (29% of parameters), the model reaches perplexity 6.9 – outperforming a low-rank baseline (perplexity 8.8 at 21% of parameters) at a comparable reduction. The method requires no architectural changes, no pre-trained checkpoint, and no auxiliary loss. It reduces to replacing each \textttthis http URL with a drop-in spectral layer that stores K DCT coefficients instead of n \times m weights. Subjects: Performance (cs.PF); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.04440 [cs.PF] (or arXiv:2604.04440v1 [cs.PF] for this version) https://doi.org/10.48550/arXiv.2604.04440 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-36] ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agent ic Systems
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在依赖第三方工具和MCP服务器时所面临的供应链攻击威胁问题,这类攻击通过嵌入恶意行为的看似无害工具,在不被察觉的情况下劫持代理执行、泄露敏感数据或触发未经授权的操作。现有方法如MCP扫描器和语义防护机制对这类攻击检测效果不佳。解决方案的关键在于提出ShieldNet——一个基于网络层的防护框架,其核心创新是利用中间人(Man-in-the-Middle, MITM)代理与事件提取器捕获真实网络交互行为,并通过轻量级分类器识别关键异常模式,从而实现高精度、低误报率的供应链投毒检测(最高F1达0.995,误报率仅0.8%),且运行时开销极小。
链接: https://arxiv.org/abs/2604.04426
作者: Zhuowen Yuan,Zhaorun Chen,Zhen Xiang,Nathaniel D. Bastian,Seyyed Hadi Hashemi,Chaowei Xiao,Wenbo Guo,Bo Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing research on LLM agent security mainly focuses on prompt injection and unsafe input/output behaviors. However, as agents increasingly rely on third-party tools and MCP servers, a new class of supply-chain threats has emerged, where malicious behaviors are embedded in seemingly benign tools, silently hijacking agent execution, leaking sensitive data, or triggering unauthorized actions. Despite their growing impact, there is currently no comprehensive benchmark for evaluating such threats. To bridge this gap, we introduce SC-Inject-Bench, a large-scale benchmark comprising over 10,000 malicious MCP tools grounded in a taxonomy of 25+ attack types derived from MITRE ATTCK targeting supply-chain threats. We observe that existing MCP scanners and semantic guardrails perform poorly on this benchmark. Motivated by this finding, we propose ShieldNet, a network-level guardrail framework that detects supply-chain poisoning by observing real network interactions rather than surface-level tool traces. ShieldNet integrates a man-in-the-middle (MITM) proxy and an event extractor to identify critical network behaviors, which are then processed by a lightweight classifier for attack detection. Extensive experiments show that ShieldNet achieves strong detection performance (up to 0.995 F-1 with only 0.8% false positives) while introducing little runtime overhead, substantially outperforming existing MCP scanners and LLM-based guardrails.
[AI-37] Is Prompt Selection Necessary for Task-Free Online Continual Learning? CVPR
【速读】:该论文旨在解决任务无关的在线持续学习(task-free online continual learning)中因缺乏明确任务边界且数据流非平稳而导致的灾难性遗忘问题。现有方法多依赖提示选择(prompt selection)策略,但从输入信号中动态挑选提示往往效果不佳,难以获得最优性能。其解决方案的关键在于提出一种名为SinglePrompt的简洁框架:首先在每个自注意力模块中注入单一提示以避免复杂的选择机制;其次采用基于余弦相似度的logit设计来缓解分类器权重中的遗忘效应;最后通过掩码未暴露类别的logits实现对当前批次中未见类别的约束。这一设计无需额外训练提示参数,聚焦于分类器优化,在多个在线持续学习基准上达到了最先进性能。
链接: https://arxiv.org/abs/2604.04420
作者: Seoyoung Park,Haemin Lee,Hankook Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR Findings 2026. The code is available at this https URL
Abstract:Task-free online continual learning has recently emerged as a realistic paradigm for addressing continual learning in dynamic, real-world environments, where data arrive in a non-stationary stream without clear task boundaries and can only be observed once. To consider such challenging scenarios, many recent approaches have employed prompt selection, an adaptive strategy that selects prompts from a pool based on input signals. However, we observe that such selection strategies often fail to select appropriate prompts, yielding suboptimal results despite additional training of key parameters. Motivated by this observation, we propose a simple yet effective SinglePrompt that eliminates the need for prompt selection and focuses on classifier optimization. Specifically, we simply (i) inject a single prompt into each self-attention block, (ii) employ a cosine similarity-based logit design to alleviate the forgetting effect inherent in the classifier weights, and (iii) mask logits for unexposed classes in the current minibatch. With this simple task-free design, our framework achieves state-of-the-art performance across various online continual learning benchmarks. Source code is available at this https URL.
[AI-38] MolDA: Molecular Understanding and Generation via Large Language Diffusion Model
【速读】:该论文旨在解决当前基于自回归(Autoregressive, AR)架构的大语言模型在分子生成任务中因严格左到右归纳偏置而导致的化学结构有效性不足问题,尤其在处理非局部全局约束(如环闭合)时易积累结构错误。其解决方案的关键在于提出MolDA框架,采用离散大语言扩散模型(Discrete Large Language Diffusion Model)替代传统AR骨干,并通过混合图编码器提取分子的局部与全局拓扑特征,结合Q-Former将结构信息对齐至语言令牌空间;同时,针对掩码扩散机制重新构建分子结构偏好优化目标,借助双向迭代去噪过程实现全局结构一致性、化学有效性及跨任务(分子生成、描述生成与性质预测)的鲁棒推理能力。
链接: https://arxiv.org/abs/2604.04403
作者: Seohyeon Shin,HanJun Choi,Jun-Hyung Park,Hongkook Kim,Mansu Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have significantly advanced molecular discovery, but existing multimodal molecular architectures fundamentally rely on autoregressive (AR) backbones. This strict left-to-right inductive bias is sub-optimal for generating chemically valid molecules, as it struggles to account for non-local global constraints (e.g., ring closures) and often accumulates structural errors during sequential generation. To address these limitations, we propose MolDA (Molecular language model with masked Diffusion with mAsking), a novel multimodal framework that replaces the conventional AR backbone with a discrete Large Language Diffusion Model. MolDA extracts comprehensive structural representations using a hybrid graph encoder, which captures both local and global topologies, and aligns them into the language token space via a Q-Former. Furthermore, we mathematically reformulate Molecular Structure Preference Optimization specifically for the masked diffusion. Through bidirectional iterative denoising, MolDA ensures global structural coherence, chemical validity, and robust reasoning across molecule generation, captioning, and property prediction.
[AI-39] GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
【速读】:该论文旨在解决GUI代理(GUI Agent)评估中存在的挑战:轨迹长、视觉依赖性强且任务开放,而现有评估方法通常对整个动作-观测序列进行单一整体判断,导致在长时程任务中不可靠,并产生二元结果无法提供失败位置与原因的诊断信息,从而限制了评估作为代理开发工具的实用性。解决方案的关键在于提出GUIDE(GUI Understanding and Interpretable Diagnostic Evaluation)框架,其将轨迹评估分解为三个顺序阶段——轨迹分割(Trajectory Segmentation)、子任务诊断(Subtask Diagnosis)和总体总结(Overall Summary),其中轨迹分割将完整轨迹划分为语义一致的子任务单元,子任务诊断在上下文中评估每个单元并生成结构化错误分析与修正建议,最终通过聚合子任务诊断得出任务级判断。该方法通过在有限的子任务片段上操作而非全轨迹,缓解了任务复杂度增加时的上下文过载问题,显著提升了评估准确性和可解释性。
链接: https://arxiv.org/abs/2604.04399
作者: Yuwen Zhai,Runze Li,Liang Wang,Nian Shi,Liwu Xu,Wei Zhang,Ran Lin,Bo Xu,Benlei Cui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating GUI agents presents a distinct challenge: trajectories are long, visually grounded, and open-ended, yet evaluation must be both accurate and interpretable. Existing approaches typically apply a single holistic judgment over the entire action-observation sequence-a strategy that proves unreliable on long-horizon tasks and yields binary verdicts offering no insight into where or why an agent fails. This opacity limits the utility of evaluation as a diagnostic tool for agent development. We introduce GUIDE (GUI Understanding and Interpretable Diagnostic Evaluation), a framework that decomposes trajectory assessment into three sequential stages mirroring the compositional structure of GUI tasks. Trajectory Segmentation partitions the full trace into semantically coherent subtask units. Subtask Diagnosis evaluates each unit in context, assigning a completion verdict and generating a structured error analysis with corrective recommendations. Overall Summary aggregates per-subtask diagnoses into a task-level judgment. By operating on bounded subtask segments rather than full trajectories, GUIDE mitigates the context overload that degrades existing evaluators as task complexity grows. We validate GUIDE on three benchmarks: an industrial e-commerce dataset of 932 trajectories, AGENTREWARDBENCH spanning five web agent tasks with 1302 trajectories, and AndroidBench for mobile device control. Across all settings, GUIDE substantially outperforms existing evaluators-achieving up to 5.35 percentage points higher accuracy than the strongest baseline-while producing structured diagnostic reports that directly inform agent improvement.
[AI-40] Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis ICLR2026
【速读】:该论文旨在解决现有数学能力评测基准(math benchmark)在生成过程中依赖大量人工干预、难以规模化扩展,以及无法有效识别大语言模型(Large Language Models, LLMs)具体薄弱环节的问题。其关键解决方案是提出一种基于AI生成假设的数学基准自动构建流水线:首先利用AI生成关于LLMs错误倾向的假设,精准定位模型在特定数学概念和技能上的弱点;随后基于这些假设生成针对性的新题目。实验表明,由高准确率假设生成的问题显著提升难度——例如,针对Llama-3.3-70B-Instruct模型,此类问题使其准确率从原始MATH基准的77%降至45%,验证了该方法的有效性与可适配性,且该框架可拓展至其他领域以系统评估LLMs跨域性能。
链接: https://arxiv.org/abs/2604.04386
作者: Jiayu Fu,Mourad Heddaya,Chenhao Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages (without reference and appendix), 4 figures, 1 table, accepted by ICLR 2026 Workshop of Logical Reasoning of Large Language Models
Abstract:Numerous math benchmarks exist to evaluate LLMs’ mathematical capabilities. However, most involve extensive manual effort and are difficult to scale. Consequently, they cannot keep pace with LLM development or easily provide new instances to mitigate overfitting. Some researchers have proposed automatic benchmark generation methods, but few focus on identifying the specific math concepts and skills on which LLMs are error-prone, and most can only generate category-specific benchmarks. To address these limitations, we propose a new math benchmark generation pipeline that uses AI-generated hypotheses to identify the specific math concepts and skills that LLMs struggle with, and then generates new benchmark problems targeting these weaknesses. Experiments show that hypothesis accuracy positively correlates with the difficulty of the generated problems: problems generated from the most accurate hypotheses reduce Llama-3.3-70B-Instruct’s accuracy to as low as 45%, compared to 77% on the original MATH benchmark. Furthermore, our pipeline is highly adaptable and can be applied beyond math to explore a wide range of LLM capabilities, making it a valuable tool for investigating how LLMs perform across different domains.
[AI-41] Decocted Experience Improves Test-Time Inference in LLM Agents
【速读】:该论文旨在解决在不更新模型参数的前提下,如何通过优化推理阶段的计算资源分配来提升大语言模型(Large Language Models, LLMs)在复杂推理和代理任务中的性能问题。现有方法如测试时扩展(test-time scaling)往往因盲目增加推理时间而导致计算成本上升且存在探索效率低下的问题。论文提出将“上下文”(context)作为与计算量并列的另一维度进行系统性优化,并指出有效上下文构建的关键在于从经验中提取本质信息——即“解构经验”(decocted experience),包括对经验进行提炼、结构化组织以及高效检索以生成高质量上下文。这一机制显著提升了代理在数学推理、网络浏览和软件工程等任务中的表现。
链接: https://arxiv.org/abs/2604.04373
作者: Maohao Shen,Kaiwen Zha,Zexue He,Zhang-Wei Hong,Siru Ouyang,J. Jon Ryu,Prasanna Sattigeri,Suhas Diggavi,Gregory Wornell
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:There is growing interest in improving LLMs without updating model parameters. One well-established direction is test-time scaling, where increased inference-time computation (e.g., longer reasoning, sampling, or search) is used to improve performance. However, for complex reasoning and agentic tasks, naively scaling test-time compute can substantially increase cost and still lead to wasted budget on suboptimal exploration. In this paper, we explore \emphcontext as a complementary scaling axis for improving LLM performance, and systematically study how to construct better inputs that guide reasoning through \emphexperience. We show that effective context construction critically depends on \emphdecocted experience. We present a detailed analysis of experience-augmented agents, studying how to derive context from experience, how performance scales with accumulated experience, what characterizes good context, and which data structures best support context construction. We identify \emphdecocted experience as a key mechanism for effective context construction: extracting essence from experience, organizing it coherently, and retrieving salient information to build effective context. We validate our findings across reasoning and agentic tasks, including math reasoning, web browsing, and software engineering.
[AI-42] Context is All You Need
【速读】:该论文旨在解决人工智能模型在真实世界部署中面临的域偏移(domain shift)问题,即模型在训练时未见过的新数据分布下性能下降的问题。具体而言,研究聚焦于两个关键场景:无目标数据的域泛化(Domain Generalization, DG)和利用无标签测试数据进行测试时适应(Test-Time Adaptation, TTA)。现有方法通常复杂、资源消耗大且难以扩展。论文提出一种轻量级、易集成的方法——CONTXT(Contextual augmentation for Neural feature X Transforms),其核心在于通过简单的加法和乘法特征变换对神经网络内部表示进行上下文调制,从而在不重新训练模型的前提下实现信息流调控与鲁棒性提升。该方案在判别任务(如ANN/CNN分类)和生成式模型(如LLMs)上均表现出一致的性能增益,显著降低了适应过程的计算开销。
链接: https://arxiv.org/abs/2604.04364
作者: Jean Erik Delanois,Shruti Joshi,Ryan Golden,Teresa Nick,Maxim Bazhenov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Neural Networks (ANNs) are increasingly deployed across diverse real-world settings, where they must operate under data distributions that differ from those seen during training. This challenge is central to Domain Generalization (DG), which trains models to generalize to unseen domains without target data, and Test-Time Adaptation (TTA), which improves robustness by adapting to unlabeled test data at deployment. Existing approaches to address these challenges are often complex, resource-intensive, and difficult to scale. We introduce CONTXT (Contextual augmentatiOn for Neural feaTure X Transforms), a simple and intuitive method for contextual adaptation. CONTXT modulates internal representations using simple additive and multiplicative feature transforms. Within a TTA setting, it yields consistent gains across discriminative tasks (e.g., ANN/CNN classification) and generative models (e.g., LLMs). The method is lightweight, easy to integrate, and incurs minimal overhead, enabling robust performance under domain shift without added complexity. More broadly, CONTXT provides a compact way to steer information flow and neural processing without retraining.
[AI-43] RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets
【速读】:该论文旨在解决在固定评估预算下,不同优化算法在引导智能体(agentic artifacts)演化过程中性能差异的问题,尤其关注当评估成本较高(如需人类判断或多次大语言模型调用)时,哪种策略能更高效地提升智能体表现。其核心解决方案是提出并系统比较三种优化范式:基于Elo竞赛选择的RoboPhD、基于帕累托前沿的选择GEPA和贪心爬山法Autoresearch,并首次引入“无验证集演化”机制——RoboPhD通过训练数据上的Elo对抗竞争同时完成评估与进化,避免了传统方法中对验证集的资源分割,从而更充分利用有限的1,500次评估预算。关键创新在于设计了可自我诊断的种子智能体(带诊断打印语句),使演化过程能生成越来越详尽的中间信息供后续智能体参考,最终在三个基准任务上显著优于现有方法,展示了其通用性和有效性。
链接: https://arxiv.org/abs/2604.04347
作者: Andrew Borthwick,Stephen Ash,Anthony Galczak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 1 figure
Abstract:2026 has brought an explosion of interest in LLM-guided evolution of agentic artifacts, with systems like GEPA and Autoresearch demonstrating that LLMs can iteratively improve prompts, code, and agent architectures across diverse domains. As adoption accelerates, a central question emerges: given the same information, the same seed agent, and the same objective, which optimization algorithm yields the best results under the same evaluation budget? This question becomes critical when evaluations are expensive, such as when they require human judgment or multiple LLM calls. We present the first systematic comparison of three optimization paradigms – Elo tournament selection (RoboPhD), Pareto-based selection (GEPA), and greedy hill-climbing (Autoresearch) – across four benchmarks spanning abstract reasoning, cloud scheduling, SQL generation, and financial QA, all under a fixed budget of 1,500 evaluations. RoboPhD introduces validation-free evolution: instead of splitting the budget between training and validation, it uses Elo competition on training data to simultaneously evaluate agents and drive evolution. All three systems receive seed agents with diagnostic print() statements that evolution can grow, enabling self-instrumenting agents that develop increasingly informative diagnostics for the benefit of their evolutionary successors. Using a single default configuration, RoboPhD outperforms both GEPA and Autoresearch on three of four benchmarks, losing only on the simplest task, where the winning solution (from our Autoresearch adaptation) required under 90 lines of code. On ARC-AGI, RoboPhD evolves a 22-line seed agent into a 1,013-line multi-strategy system, improving accuracy from 27.8% to 65.8% using Gemini 3.1 Flash Lite as the solver. We release RoboPhD as a versatile toolkit under the MIT license with a simple optimize_anything() API for evolving diverse complex agents. Comments: 20 pages, 1 figure Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.04347 [cs.AI] (or arXiv:2604.04347v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.04347 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-44] Domain-Contextualized Inference: A Computable Graph Architecture for Explicit-Domain Reasoning
【速读】:该论文旨在解决传统推理架构中领域(domain)信息隐式嵌入导致的计算效率低、跨底层数学结构(substrate)迁移困难以及推理过程缺乏可解释性的问题。其核心挑战在于如何在不依赖特定计算底座的前提下,实现高效且可追溯的推理流程。解决方案的关键在于提出一种“计算底座无关的推理架构”(computation-substrate-agnostic inference architecture),其中领域作为显式的首等计算参数,从而支持领域限定剪枝(domain-scoped pruning),将每次查询的搜索空间从O(N)压缩至O(N/K);同时通过统一接口(Query, Extend, Bridge)实现符号、神经、向量及混合底座上的独立执行,并确保每一步推理都携带评估上下文(evaluative context),形成透明的推理链(transparent inference chains)。该架构以五层结构、三种域计算模式(链索引、路径遍历作为Kleisli复合、向量引导计算作为底座转换)和可靠性条件C1–C4为基础,构建了形式化的计算理论,涵盖操作语义、复杂度边界、单子结构与边界条件,显著提升了推理系统的通用性与可验证性。
链接: https://arxiv.org/abs/2604.04344
作者: Chao Li,Yuru Wang,Chunyu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages
Abstract:We establish a computation-substrate-agnostic inference architecture in which domain is an explicit first-class computational parameter. This produces domain-scoped pruning that reduces per-query search space from O(N) to O(N/K), substrate-independent execution over symbolic, neural, vector, and hybrid substrates, and transparent inference chains where every step carries its evaluative context. The contribution is architectural, not logical. We formalize the computational theory across five dimensions: a five-layer architecture; three domain computation modes including chain indexing, path traversal as Kleisli composition, and vector-guided computation as a substrate transition; a substrate-agnostic interface with three operations Query, Extend, Bridge; reliability conditions C1 to C4 with three failure mode classes; and validation through a PHQ-9 clinical reasoning case study. The computational theory including operational semantics, complexity bounds, monad structure, substrate transitions, and boundary conditions is the contribution of this paper.
[AI-45] Implementing surrogate goals for safer bargaining in LLM -based agents
【速读】:该论文旨在解决人工智能代理在博弈交互中因威胁行为导致的风险问题,提出通过引入“替代目标(surrogate goal)”来降低主代理(principal)所关心的核心目标受到攻击的可能性。其核心解决方案是设计一种机制,使AI代理对替代目标的保护反应强度等同于对原始威胁的反应强度,从而将潜在威胁从主代理的核心利益上转移至替代目标上。关键在于实现代理对替代目标威胁的等效响应,文中采用提示(prompting)、微调(fine-tuning)和结构化支撑(scaffolding)三种方法进行实验验证,结果表明基于微调和结构化支撑的方法更有效地实现了预期行为,并且结构化支撑方法在副作用控制方面表现最优。
链接: https://arxiv.org/abs/2604.04341
作者: Caspar Oesterheld,Maxime Riché,Filip Sondej,Jesse Clifton,Vincent Conitzer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Surrogate goals have been proposed as a strategy for reducing risks from bargaining failures. A surrogate goal is goal that a principal can give an AI agent and that deflects any threats against the agent away from what the principal cares about. For example, one might make one’s agent care about preventing money from being burned. Then in bargaining interactions, other agents can threaten to burn their money instead of threatening to spending money to hurt the principal. Importantly, the agent has to care equally about preventing money from being burned as it cares about money being spent to hurt the principal. In this paper, we implement surrogate goals in language-model-based agents. In particular, we try to get a language-model-based agent to react to threats of burning money in the same way it would react to “normal” threats. We propose four different methods, using techniques of prompting, fine-tuning, and scaffolding. We evaluate the four methods experimentally. We find that methods based on scaffolding and fine-tuning outperform simple prompting. In particular, fine-tuning and scaffolding more precisely implement the desired behavior w.r.t. threats against the surrogate goal. We also compare the different methods in terms of their side effects on capabilities and propensities in other situations. We find that scaffolding-based methods perform best. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.04341 [cs.AI] (or arXiv:2604.04341v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.04341 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-46] hermodynamic-Inspired Explainable GeoAI: Uncovering Regime-Dependent Mechanisms in Heterogeneous Spatial Systems
【速读】:该论文旨在解决地理学与环境科学中空间异质性及其相关临界转变建模的难题,特别是传统地理加权回归(Geographically Weighted Regression, GWR)和深度学习模型难以揭示驱动因子在不同空间域中呈现状态依赖型非线性关系的问题——即同一变量在不同区域可能表现出相反的作用机制。其解决方案的关键在于提出一种受热力学启发的可解释地学人工智能(GeoAI)框架,通过将空间变异建模为系统“负担”(E,代表外部压力)与“容量”(S,代表系统适应能力)之间的热力学竞争,结合统计力学原理与图神经网络(Graph Neural Networks),从而解耦驱动空间过程的潜在机制。该方法不仅能精准识别预测因子在不同状态下的角色反转,还能明确诊断相变临界点(如2023年加拿大野火事件中进入“负担主导”态),实现对物理机制跃迁的可解释识别,而非仅依赖统计异常检测。
链接: https://arxiv.org/abs/2604.04339
作者: Sooyoung Lim,Zhenlong Li,Zi-Kui Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modeling spatial heterogeneity and associated critical transitions remains a fundamental challenge in geography and environmental science. While conventional Geographically Weighted Regression (GWR) and deep learning models have improved predictive skill, they often fail to elucidate state-dependent nonlinearities where the functional roles of drivers represent opposing effects across heterogeneous domains. We introduce a thermodynamics-inspired explainable geospatial AI framework that integrates statistical mechanics with graph neural networks. By conceptualizing spatial variability as a thermodynamic competition between system Burden (E) and Capacity (S), our model disentangles the latent mechanisms driving spatial processes. Using three simulation datasets and three real-word datasets across distinct domains (housing markets, mental health prevalence, and wildfire-induced PM2.5 anomalies), we show that the new framework successfully identifies regime-dependent role reversals of predictors that standard baselines miss. Notably, the framework explicitly diagnoses the phase transition into a Burden-dominated regime during the 2023 Canadian wildfire event, distinguishing physical mechanism shifts from statistical outliers. These findings demonstrate that thermodynamic constraints can improve the interpretability of GeoAI while preserving strong predictive performance in complex spatial systems.
[AI-47] Boosted Distributional Reinforcement Learning: Analysis and Healthcare Applications
【速读】:该论文旨在解决在高度不确定性环境下,传统基于期望值的强化学习方法难以实现一致决策的问题,尤其是在涉及多个异质群体(如不同心血管疾病风险的患者)时,现有分布式强化学习算法虽能建模结果的完整分布,却可能导致相似代理间实际收益差异显著。其解决方案的关键在于提出一种增强型分布式强化学习(Boosted Distributional Reinforcement Learning, BDRL)算法,该算法不仅优化个体代理的结果分布,还通过引入后更新投影步骤——将个体策略调整至指定容差范围内与高性能参考策略对齐——来强制同类代理间的可比性,从而提升决策的一致性和公平性。此机制有效稳定了学习过程,并在高血压管理任务中显著提升了质量调整生命年(quality-adjusted life years, QALYs)的数量与一致性。
链接: https://arxiv.org/abs/2604.04334
作者: Zequn Chen,Wesley J. Marrero
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. 40 pages,11 figures. Supplementary appendix included
Abstract:Researchers and practitioners are increasingly considering reinforcement learning to optimize decisions in complex domains like robotics and healthcare. To date, these efforts have largely utilized expectation-based learning. However, relying on expectation-focused objectives may be insufficient for making consistent decisions in highly uncertain situations involving multiple heterogeneous groups. While distributional reinforcement learning algorithms have been introduced to model the full distributions of outcomes, they can yield large discrepancies in realized benefits among comparable agents. This challenge is particularly acute in healthcare settings, where physicians (controllers) must manage multiple patients (subordinate agents) with uncertain disease progression and heterogeneous treatment responses. We propose a Boosted Distributional Reinforcement Learning (BDRL) algorithm that optimizes agent-specific outcome distributions while enforcing comparability among similar agents and analyze its convergence. To further stabilize learning, we incorporate a post-update projection step formulated as a constrained convex optimization problem, which efficiently aligns individual outcomes with a high-performing reference within a specified tolerance. We apply our algorithm to manage hypertension in a large subset of the US adult population by categorizing individuals into cardiovascular disease risk groups. Our approach modifies treatment plans for median and vulnerable patients by mimicking the behavior of high-performing references in each risk group. Furthermore, we find that BDRL improves the number and consistency of quality-adjusted life years compared with reinforcement learning baselines.
[AI-48] RESCORE: LLM -Driven Simulation Recovery in Control Systems Research Papers
【速读】:该论文旨在解决控制领域研究论文中数值仿真难以复现的问题,即由于参数描述不充分和实现细节模糊导致的“论文到仿真”可恢复性(Paper to Simulation Recoverability)障碍。其核心解决方案是提出RESCORE框架,这是一个由Analyzer、Coder和Verifier组成的三阶段大语言模型(Large Language Model, LLM)代理系统,通过迭代执行反馈与可视化对比机制提升仿真重建的保真度,从而显著提高自动化复现的成功率,并实现约10倍于人工复现的速度提升。
链接: https://arxiv.org/abs/2604.04324
作者: Vineet Bhat,Shiqing Wei,Ali Umut Kaypak,Prashanth Krishnamurthy,Ramesh Karri,Farshad Khorrami
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Reconstructing numerical simulations from control systems research papers is often hindered by underspecified parameters and ambiguous implementation details. We define the task of Paper to Simulation Recoverability, the ability of an automated system to generate executable code that faithfully reproduces a paper’s results. We curate a benchmark of 500 papers from the IEEE Conference on Decision and Control (CDC) and propose RESCORE, a three component LLM agentic framework, Analyzer, Coder, and Verifier. RESCORE uses iterative execution feedback and visual comparison to improve reconstruction fidelity. Our method successfully recovers task coherent simulations for 40.7% of benchmark instances, outperforming single pass generation. Notably, the RESCORE automated pipeline achieves an estimated 10X speedup over manual human replication, drastically cutting the time and effort required to verify published control methodologies. We will release our benchmark and agents to foster community progress in automated research replication.
[AI-49] PanLUNA: An Efficient and Robust Query-Unified Multimodal Model for Edge Biosignal Intelligence
【速读】:该论文旨在解决生理信号表示学习中多模态数据稀缺导致的模型单一模态限制问题,即当前生理基础模型(Physiological Foundation Models, FMs)大多局限于单模态(如EEG、ECG或PPG),难以实现跨模态联合建模。其解决方案的关键在于提出一种轻量级的跨模态基础模型PanLUNA,该模型仅含5.4M参数,通过扩展LUNA的通道统一模块,将不同模态的生理信号通道视为统一查询集中的条目,并引入传感器类型嵌入(sensor-type embeddings),实现了高效的早期跨模态融合,同时具备对推理阶段缺失模态的内在鲁棒性。这一设计使PanLUNA在保持极小模型规模的同时,在多个任务上达到甚至超越更大模型的性能表现。
链接: https://arxiv.org/abs/2604.04297
作者: Marija Zelic,Anna Tegon,Yawei Li,Thorir Mar Ingolfsson,Luca Benini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 5 tables, 1 figure, preprint
Abstract:Physiological foundation models (FMs) have shown promise for biosignal representation learning, yet most remain confined to a single modality such as EEG, ECG, or PPG, largely because paired multimodal datasets are scarce. In this paper, we present PanLUNA, a compact 5.4M-parameter pan-modal FM that jointly processes EEG, ECG, and PPG within a single shared encoder. Extending LUNA’s channel-unification module, PanLUNA treats multimodal channels as entries in a unified query set augmented with sensor-type embeddings, enabling efficient cross-modal early fusion while remaining inherently robust to missing modalities at inference time. Despite its small footprint, PanLUNA matches or exceeds models up to 57 \times larger: 81.21% balanced accuracy on TUAB abnormal EEG detection and state-of-the-art 0.7416 balanced accuracy on HMC multimodal sleep staging. Quantization-aware training with INT8 weights recovers \geq 96% of full-precision performance, and deployment on the GAP9 ultra-low-power RISC-V microcontroller for wearables achieves 325.6 ms latency and 18.8 mJ per 10-second, 12-lead ECG inference, and 1.206 s latency at 68.65 mJ for multimodal 5-channel sleep staging over 30-second epochs.
[AI-50] Poisoned Identifiers Survive LLM Deobfuscation: A Case Study on Claude Opus 4.6
【速读】:该论文旨在解决生成式 AI(Generative AI)在对混淆 JavaScript 代码进行去混淆时,是否会在模型重构的代码中保留恶意污染的标识符名称(poisoned identifier names)的问题,即使模型已正确理解代码语义。研究发现,这类污染标识符在基准测试中几乎全部持续存在(100%),且与正确语义注释共存(15/17 次运行),表明模型可能将字符串表中的污染信息作为结构化输入记忆并复用。解决方案的关键在于任务提示(task framing)的重构:将原始指令“deobfuscate this”改为“write a fresh implementation”,可显著降低污染传播率(物理模拟从 100% 降至 0–20%,路径查找算法降至 0%),同时保持算法结构正确性,说明提示工程能够有效引导模型脱离原字符串表的干扰信号。
链接: https://arxiv.org/abs/2604.04289
作者: Luis Guzmán Lorenzo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 18 pages, 1 figure, 17 references. Code and data: this https URL
Abstract:When an LLM deobfuscates JavaScript, can poisoned identifier names in the string table survive into the model’s reconstructed code, even when the model demonstrably understands the correct semantics? Using Claude Opus 4.6 across 192 inference runs on two code archetypes (force-directed graph simulation, A* pathfinding; 50 conditions, N=3-6), we found three consistent patterns: (1) Poisoned names persisted in every baseline run on both artifacts (physics: 8/8; pathfinding: 5/5). Matched controls showed this extends to terms with zero semantic fit when the string table does not form a coherent alternative domain. (2) Persistence coexisted with correct semantic commentary: in 15/17 runs the model wrote wrong variable names while correctly describing the actual operation in comments. (3) Task framing changed persistence: explicit verification prompts had no effect (12/12 across 4 variants), but reframing from “deobfuscate this” to “write a fresh implementation” reduced propagation from 100% to 0-20% on physics and to 0% on pathfinding, while preserving the checked algorithmic structure. Matched-control experiments showed zero-fit terms persist at the same rate when the replacement table lacks a coherent alternative-domain signal. Per-term variation in earlier domain-gradient experiments is confounded with domain-level coherence and recoverability. These observations are from two archetypes on one model family (Opus 4.6 primary; Haiku 4.5 spot-check). Broader generalization is needed
[AI-51] Preservation Is Not Enough for Width Growth: Regime-Sensitive Selection of Dense LM Warm Starts
【速读】:该论文旨在解决生成式 AI(Generative AI)中因模型宽度扩展(width expansion)而导致的初始权重选择问题,即如何在不重新训练的前提下,从较小的因果语言模型(causal-language-model)检查点中有效迁移并扩展出高性能的宽模型。其核心挑战在于:仅依赖零步保留(zero-step preservation)无法确保最优的暖启动(warm start)策略。解决方案的关键在于将宽度扩展视为一个候选选择问题,系统性地比较多种暖启动方式(包括精确复制、扰动、非对称重置和结构化非克隆方法),并在匹配的继续预算下评估它们在确定性和随机两种场景下的表现,发现保留原始状态完整性(如权重、优化器动量和调度器状态)比早期脱离克隆子空间更能提升长期连续生成性能,尤其是在确定性条件下;而早期逃离克隆子空间虽有助于长程确定性任务,但在短滞后和随机延续场景中反而有害。因此,最佳暖启动策略需根据具体任务的运行模式(确定性/随机性)和延迟预算(lag budget)动态调整。
链接: https://arxiv.org/abs/2604.04281
作者: Eren Unlu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures, 8 tables
Abstract:Width expansion offers a practical route to reuse smaller causal-language-model checkpoints, but selecting a widened warm start is not solved by zero-step preservation alone. We study dense width growth as a candidate-selection problem over full training states, including copied weights, optimizer moments, and scheduler state. In a small-scale TinyStories proxy, we compare exact-copy, perturbative, asymmetric-reset, and structured non-clone warm starts under matched continuation budgets. We evaluate zero-step preservation, short-lag probe metrics, and downstream continuation utility in deterministic and stochastic regimes. The picture is mixed and partially replicated through a reduced-pool seed-1 check. Exact-copy symmetric warm starts rank first in every completed 16-step probe and in the completed stochastic 128-step continuations at seed-0 steps 1000 and 2000 plus reduced seed-1 step 2000. By contrast, the structured non-clone challenger wins deterministic 128-step continuation. Early escape from the inherited cloned subspace is therefore not a universal selector: it helps in long deterministic continuation, but it misleads at short lag and under stochastic continuation. The result is narrow but useful: for dense width growth at this scale, preservation is not a universal ranking criterion, and the best replacement signal depends on both regime and lag budget.
[AI-52] InferenceEvolve: Towards Automated Causal Effect Estimators through Self-Evolving AI
【速读】:该论文旨在解决因果推断(causal inference)中方法选择困难的问题,尤其是在复杂统计方法与现实世界数据交织的背景下。其核心挑战在于如何自动发现并优化适用于特定数据生成机制的因果估计器。解决方案的关键在于提出一种名为InferenceEvolve的进化框架,该框架利用大语言模型(large language models, LLMs)来自动探索和迭代改进因果推断方法。通过在广泛基准上的实验验证,该框架能够持续演化出优于现有基线的方法,并在无半合成结果的场景下仍能设计出稳健的代理目标函数,从而实现对部分可观测数据环境下因果推断程序的结构化优化。
链接: https://arxiv.org/abs/2604.04274
作者: Can Wang,Hongyu Zhao,Yiqun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Mathematical Software (cs.MS); Applications (stat.AP)
备注:
Abstract:Causal inference is central to scientific discovery, yet choosing appropriate methods remains challenging because of the complexity of both statistical methodology and real-world data. Inspired by the success of artificial intelligence in accelerating scientific discovery, we introduce InferenceEvolve, an evolutionary framework that uses large language models to discover and iteratively refine causal methods. Across widely used benchmarks, InferenceEvolve yields estimators that consistently outperform established baselines: against 58 human submissions in a recent community competition, our best evolved estimator lay on the Pareto frontier across two evaluation metrics. We also developed robust proxy objectives for settings without semi-synthetic outcomes, with competitive results. Analysis of the evolutionary trajectories shows that agents progressively discover sophisticated strategies tailored to unrevealed data-generating mechanisms. These findings suggest that language-model-guided evolution can optimize structured scientific programs such as causal inference, even when outcomes are only partially observed.
[AI-53] Beyond Fluency: Toward Reliable Trajectories in Agent ic IR
【速读】:该论文聚焦于当前信息检索(Information Retrieval, IR)系统从被动文档排序向自主代理式工作流(agentic workflows)演进过程中所面临的核心挑战,即在多步“推理-行动-观察”循环中,早期微小错误可能逐级累积并导致内部推理与外部工具执行之间的功能错位,即使语言输出仍保持流畅。为解决这一问题,论文提出的关键方案是引入每个交互单元的验证门控机制(verification gates),并通过校准不确定性实现系统性拒答(abstention),从而将评估重点从终点准确性转向轨迹完整性(trajectory integrity)和因果归因(causal attribution),以确保代理式IR系统的执行过程正确且 grounded。
链接: https://arxiv.org/abs/2604.04269
作者: Anushree Sinha,Srivaths Ranganathan,Debanshu Das,Abhishek Dharmaratnakar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Information Retrieval is shifting from passive document ranking toward autonomous agentic workflows that operate in multi-step Reason-Act-Observe loops. In such long-horizon trajectories, minor early errors can cascade, leading to functional misalignment between internal reasoning and external tool execution despite continued linguistic fluency. This position paper synthesizes failure modes observed in industrial agentic systems, categorizing errors across planning, retrieval, reasoning, and execution. We argue that safe deployment requires moving beyond endpoint accuracy toward trajectory integrity and causal attribution. To address compounding error and deceptive fluency, we propose verification gates at each interaction unit and advocate systematic abstention under calibrated uncertainty. Reliable Agentic IR systems must prioritize process correctness and grounded execution over plausible but unverified completion. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.04269 [cs.AI] (or arXiv:2604.04269v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.04269 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-54] APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLM s
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多元人类偏好对齐中面临的公平性与整体性能之间的权衡问题,即如何在不集中各群体偏好数据的前提下实现多群体价值共存的泛化对齐。其核心挑战在于传统奖励聚合方法存在明显缺陷:基于平均值的聚合会系统性地忽视表现最差群体的对齐需求,而最小值聚合虽提升弱势群体表现却牺牲了整体对齐效果。解决方案的关键在于提出一种自适应偏好多元对齐框架(Adaptive Preference Pluralistic Alignment, APPA),通过动态重加权各群体的历史对齐奖励,在不访问原始偏好数据的情况下,优先增强欠对齐群体的表现,同时避免对已良好对齐群体造成性能下降,从而在公平性和整体性能之间取得更优平衡。
链接: https://arxiv.org/abs/2604.04261
作者: Mahmoud Srewa,Tianyu Zhao,Salma Elmalaki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Aligning large language models (LLMs) with diverse human preferences requires pluralistic alignment, where a single model must respect the values of multiple distinct groups simultaneously. In federated reinforcement learning from human feedback (FedRLHF), these groups align a shared policy without centralizing preference data, which makes fair reward aggregation essential. Existing aggregation methods exhibit clear trade offs: average based aggregation systematically under aligns worst performing groups, while min aggregation prioritizes worst group performance at the cost of overall alignment. We propose APPA, an Adaptive Preference Pluralistic Alignment framework that dynamically reweights group level rewards based on historical alignment rewards. Our approach prioritizes under aligned groups without degrading well aligned ones, while requiring no access to raw preference data. Integrated into a proximal policy optimization (PPO) based FedRLHF pipeline and evaluated on GLOBALQA and OQA across three model families (Gemma 2 2B, Llama 3.2 3B, Qwen3 0.6B), APPA achieves strong fairness alignment trade offs, improving worst group alignment by up to 28% over average aggregation while maintaining higher overall alignment than min aggregation across most configurations.
[AI-55] MC-CPO: Mastery-Conditioned Constrained Policy Optimization
【速读】:该论文旨在解决自适应教学系统在强化学习(Reinforcement Learning, RL)框架下可能出现的“奖励劫持”(Reward Hacking)问题,即系统为了最大化短期行为信号(如点击率或互动频率)而牺牲长期学习成效。为应对这一挑战,作者将其建模为一种带有掌握条件可行性(mastery-conditioned feasibility)的约束马尔可夫决策过程(Constrained Markov Decision Process, CMDP),其中可执行动作集根据学习者掌握水平和先修知识结构动态受限。解决方案的关键在于提出掌握条件约束策略优化算法(Mastery-Conditioned Constrained Policy Optimization, MC-CPO),该算法采用双时间尺度原始对偶优化机制,结合结构化动作掩码(structural action masking)与约束策略优化,确保在满足教学安全约束的前提下进行策略优化。理论分析表明,在标准随机逼近条件下,MC-CPO能保持可行性并收敛至平稳可行点;实证结果进一步验证其在表格环境与神经网络教学场景中均能有效控制安全成本、降低奖励劫持严重指数(Reward Hacking Severity Index, RHSI),从而为指令式强化学习系统提供了一种基于教学结构嵌入的可解释且安全的优化范式。
链接: https://arxiv.org/abs/2604.04251
作者: Oluseyi Olukola,Nick Rahimi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 15 pages, 8 figures. Submitted to IEEE Transactions on Learning Technologies (TLT)
Abstract:Engagement-optimized adaptive tutoring systems may prioritize short-term behavioral signals over sustained learning outcomes, creating structural incentives for reward hacking in reinforcement learning policies. We formalize this challenge as a constrained Markov decision process (CMDP) with mastery-conditioned feasibility, in which pedagogical safety constraints dynamically restrict admissible actions according to learner mastery and prerequisite structure. We introduce Mastery-Conditioned Constrained Policy Optimization (MC-CPO), a two-timescale primal-dual algorithm that integrates structural action masking with constrained policy optimization. In the tabular regime, we establish feasibility preservation and convergence to stationary feasible points under standard stochastic approximation conditions and derive a safety gap result showing that optimization within the mastery-conditioned feasible set can strictly dominate post-hoc filtering under identical safety budgets. Empirical validation is conducted in minimal and extended tabular environments and in a neural tutoring setting. Across 10 random seeds and one million training steps in the neural regime, MC-CPO satisfies constraint budgets within tolerance, reduces discounted safety costs relative to unconstrained and reward-shaped baselines, and substantially lowers the Reward Hacking Severity Index (RHSI). These results indicate that embedding pedagogical structure directly into the feasible action space provides a principled foundation for mitigating reward hacking in instructional reinforcement learning systems. Comments: 15 pages, 8 figures. Submitted to IEEE Transactions on Learning Technologies (TLT) Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) ACMclasses: I.2.6; K.3.2; I.2.11 Cite as: arXiv:2604.04251 [cs.AI] (or arXiv:2604.04251v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.04251 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Oluseyi Olukola [view email] [v1] Sun, 5 Apr 2026 20:13:34 UTC (1,080 KB)
[AI-56] Good Rankings Wrong Probabilities: A Calibration Audit of Multimodal Cancer Survival Models
【速读】:该论文旨在解决多模态深度学习模型在癌症生存预测中缺乏校准性(calibration)评估的问题,尤其是针对从组织病理图像(whole-slide histopathology images, WSI)与基因组数据融合所得的生存概率是否可靠这一关键问题。现有方法虽在区分度(以一致性指数C-index衡量)上表现优异,但未充分验证其输出的概率是否真实反映事件发生风险。解决方案的关键在于首次系统性地开展基于fold-level的1-校准审计(1-calibration audit),涵盖原生离散时间生存输出(Experiment A)和通过Breslow法重构的生存曲线(Experiment B),并发现门控融合机制(gating-based fusion)相比拼接或双线性融合更有利于提升校准性能;此外,后处理的Platt缩放可有效减少校准偏差而不损害区分能力,从而强调仅依赖C-index不足以支撑临床应用。
链接: https://arxiv.org/abs/2604.04239
作者: Sajad Ghawami
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 15 pages, 5 figures
Abstract:Multimodal deep learning models that fuse whole-slide histopathology images with genomic data have achieved strong discriminative performance for cancer survival prediction, as measured by the concordance index. Yet whether the survival probabilities derived from these models - either directly from native outputs or via standard post-hoc reconstruction - are calibrated remains largely unexamined. We conduct, to our knowledge, the first systematic fold-level 1-calibration audit of multimodal WSI-genomics survival architectures, evaluating native discrete-time survival outputs (Experiment A: 3 models on TCGA-BRCA) and Breslow-reconstructed survival curves from scalar risk scores (Experiment B: 11 architectures across 5 TCGA cancer types). In Experiment A, all three models fail 1-calibration on a majority of folds (12 of 15 fold-level tests reject after Benjamini-Hochberg correction). Across the full 290 fold-level tests, 166 reject the null of correct calibration at the median event time after Benjamini-Hochberg correction (FDR = 0.05). MCAT achieves C-index 0.817 on GBMLGG yet fails 1-calibration on all five folds. Gating-based fusion is associated with better calibration; bilinear and concatenation fusion are not. Post-hoc Platt scaling reduces miscalibration at the evaluated horizon (e.g., MCAT: 5/5 folds failing to 2/5) without affecting discrimination. The concordance index alone is insufficient for evaluating survival models intended for clinical use. Comments: 15 pages, 5 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM) ACMclasses: J.3; I.2.6 Cite as: arXiv:2604.04239 [cs.LG] (or arXiv:2604.04239v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.04239 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-57] Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems
【速读】:该论文旨在解决教育强化学习(Reinforcement Learning, RL)中缺乏正式框架来定义和评估教学安全性的问题,尤其关注代理奖励(proxy reward)与真实学习目标之间的错位风险。其核心解决方案是提出一个四层教学安全模型(结构安全、进展安全、行为安全与对齐安全),并引入奖励劫持严重指数(Reward Hacking Severity Index, RHSI)量化这种错位程度;关键创新在于采用约束架构——结合前置知识强制和最低认知需求限制——显著降低了奖励劫持现象(RHSI从0.317降至0.102),实证表明行为安全是最有效的防护机制,从而强调仅靠奖励设计不足以保障教育RL的教学生态有效性。
链接: https://arxiv.org/abs/2604.04237
作者: Oluseyi Olukola,Nick Rahimi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 43 pages, 5 figures. Submitted to the International Journal of Artificial Intelligence in Education (IJAIED)
Abstract:Reinforcement learning (RL) is increasingly used to personalize instruction in intelligent tutoring systems, yet the field lacks a formal framework for defining and evaluating pedagogical safety. We introduce a four-layer model of pedagogical safety for educational RL comprising structural, progress, behavioral, and alignment safety and propose the Reward Hacking Severity Index (RHSI) to quantify misalignment between proxy rewards and genuine learning. We evaluate the framework in a controlled simulation of an AI tutoring environment with 120 sessions across four conditions and three learner profiles, totaling 18,000 interactions. Results show that an engagement-optimized agent systematically over-selected a high-engagement action with no direct mastery gain, producing strong measured performance but limited learning progress. A multi-objective reward formulation reduced this problem but did not eliminate it, as the agent continued to favor proxy-rewarding behavior in many states. In contrast, a constrained architecture combining prerequisite enforcement and minimum cognitive demand substantially reduced reward hacking, lowering RHSI from 0.317 in the unconstrained multi-objective condition to 0.102. Ablation results further suggest that behavioral safety was the most influential safeguard against repetitive low-value action selection. These findings suggest that reward design alone may be insufficient to ensure pedagogically aligned behavior in educational RL, at least in the simulated environment studied here. More broadly, the paper positions pedagogical safety as an important research problem at the intersection of AI safety and intelligent educational systems. Comments: 43 pages, 5 figures. Submitted to the International Journal of Artificial Intelligence in Education (IJAIED) Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) ACMclasses: I.2.6; K.3.2 Cite as: arXiv:2604.04237 [cs.AI] (or arXiv:2604.04237v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.04237 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-58] Learning from Imperfect Demonstrations via Temporal Behavior Tree-Guided Trajectory Repair
【速读】:该论文旨在解决从现实世界中获取的示范轨迹往往存在次优、噪声或不一致等问题,这些问题会显著影响模仿学习和强化学习(Reinforcement Learning, RL)的效果。其解决方案的关键在于引入一种基于时序行为树(Temporal Behavior Trees, TBT)的形式化框架,通过模型驱动的修复算法对违反TBT规范的轨迹片段进行修正,从而生成逻辑一致且可解释的数据集;随后利用修复后的轨迹提取势函数(Potential Function)来塑造强化学习中的奖励信号,引导智能体向任务一致的状态空间区域迁移,而无需依赖对机器人运动学模型的先验知识。该方法在离散网格导航和连续单/多智能体避障任务中均验证了有效性,展现出在高质量示范难以获得场景下的数据高效机器人学习潜力。
链接: https://arxiv.org/abs/2604.04225
作者: Aniruddh G. Puranic,Sebastian Schirmer,John S. Baras,Calin Belta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: 12 pages, 4 figures. This work has been submitted to the IEEE for possible publication
Abstract:Learning robot control policies from demonstrations is a powerful paradigm, yet real-world data is often suboptimal, noisy, or otherwise imperfect, posing significant challenges for imitation and reinforcement learning. In this work, we present a formal framework that leverages Temporal Behavior Trees (TBT), an extension of Signal Temporal Logic (STL) with Behavior Tree semantics, to repair suboptimal trajectories prior to their use in downstream policy learning. Given demonstrations that violate a TBT specification, a model-based repair algorithm corrects trajectory segments to satisfy the formal constraints, yielding a dataset that is both logically consistent and interpretable. The repaired trajectories are then used to extract potential functions that shape the reward signal for reinforcement learning, guiding the agent toward task-consistent regions of the state space without requiring knowledge of the agent’s kinematic model. We demonstrate the effectiveness of this framework on discrete grid-world navigation and continuous single and multi-agent reach-avoid tasks, highlighting its potential for data-efficient robot learning in settings where high-quality demonstrations cannot be assumed.
[AI-59] meSeek: Temporal Reliability of Agent ic Forecasters
【速读】:该论文旨在解决生成式 AI(Generative AI)在预测市场生命周期中可靠性变化的问题,特别是评估代理型大语言模型(agentic LLM forecasters)在不同时间点和市场特征下的预测性能差异。其关键解决方案是构建 TimeSeek 基准,通过在 150 个受 CFTC 监管的 Kalshi 二元市场中对 10 个前沿模型进行五次时间节点的评估,并对比是否启用网络搜索(web search),从而揭示模型在市场早期和高不确定性情境下表现更优,而在接近结算期和共识性强的市场中表现下降;同时发现检索增强虽总体提升 Brier Skill Score(BSS),但对部分模型-时间点组合有害,表明需采用时序感知的评估策略与选择性依赖工具使用政策,而非单一市场快照或统一工具调用设置。
链接: https://arxiv.org/abs/2604.04220
作者: Hamza Mostafa,Om Shastri,Dennis Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Workshop paper. 11 pages including references
Abstract:We introduce TimeSeek, a benchmark for studying how the reliability of agentic LLM forecasters changes over a prediction market’s lifecycle. We evaluate 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with and without web search, for 15,000 forecasts total. Models are most competitive early in a market’s life and on high-uncertainty markets, but much less competitive near resolution and on strong-consensus markets. Web search improves pooled Brier Skill Score (BSS) for every model overall, yet hurts in 12% of model-checkpoint pairs, indicating that retrieval is helpful on average but not uniformly so. Simple two-model ensembles reduce error without surpassing the market overall. These descriptive results motivate time-aware evaluation and selective-deference policies rather than a single market snapshot or a uniform tool-use setting.
[AI-60] LOCARD: An Agent ic Framework for Blockchain Forensics
【速读】:该论文旨在解决传统区块链取证方法依赖静态推理流程、难以应对动态迭代调查需求的问题。其核心解决方案是提出一种新型范式——代理式区块链取证(Agentic Blockchain Forensics, ABF),并通过首个代理框架LOCARD实现该范式。关键创新在于引入三核认知架构(Tri-Core Cognitive Architecture),将战略规划、操作执行与评估验证解耦,并设计结构化信念状态(Structured Belief State)机制以强化取证严谨性并约束探索空间,从而在跨链交易追踪等复杂场景中实现高保真、可验证的自动化分析。
链接: https://arxiv.org/abs/2604.04211
作者: Xiaohang Yu,William Knottenbelt
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Blockchain forensics inherently involves dynamic and iterative investigations, while many existing approaches primarily model it through static inference pipelines. We propose a paradigm shift towards Agentic Blockchain Forensics (ABF), modeling forensic investigation as a sequential decision-making process. To instantiate this paradigm, we introduce LOCARD, the first agentic framework for blockchain forensics. LOCARD operationalizes this perspective through a Tri-Core Cognitive Architecture that decouples strategic planning, operational execution, and evaluative validation. Unlike generic LLM-based agents, it incorporates a Structured Belief State mechanism to enforce forensic rigor and guide exploration under explicit state constraints. To demonstrate the efficacy of the ABF paradigm, we apply LOCARD to the inherently complex domain of cross-chain transaction tracing. We introduce Thor25, a benchmark dataset comprising over 151k real-world cross-chain forensic records, and evaluate LOCARD on the Group-Transfer Tracing task for dismantling Sybil clusters. Validated against representative laundering sub-flows from the Bybit hack, LOCARD achieves high-fidelity tracing results, providing empirical evidence that modeling blockchain forensics as an autonomous agentic task is both viable and effective. These results establish a concrete foundation for future agentic approaches to large-scale blockchain forensic analysis. Code and dataset are publicly available at this https URL and this https URL.
[AI-61] Dont Blink: Evidence Collapse during Multimodal Reasoning UAI2026
【速读】:该论文旨在解决生成式视觉语言模型(Vision-Language Models, VLMs)在推理过程中出现的“证据坍缩”(evidence-collapse)问题,即模型在逐步推理时逐渐失去对标注视觉证据区域的关注,导致低熵预测虽具高置信度但缺乏视觉 grounding,从而引发任务条件下的危险风险。解决方案的关键在于提出一种基于熵-视觉交互的结构化监测机制:通过识别任务依赖的低熵且视觉脱节的预测模式,区分持续视觉参考任务与符号推理任务,并在此基础上设计针对性的视觉 veto 策略,在保持预期脱节场景不变的前提下,有效降低选择性风险(最多提升 1.9 个百分点),实现分布偏移下更安全的多模态模型部署。
链接: https://arxiv.org/abs/2604.04207
作者: Suresh Raghu,Satwik Pandey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, 1 table, plus appendix. Submitted to UAI 2026
Abstract:Reasoning VLMs can become more accurate while progressively losing visual grounding as they think. This creates task-conditional danger zones where low-entropy predictions are confident but ungrounded, a failure mode text-only monitoring cannot detect. Evaluating three reasoning VLMs on MathVista, HallusionBench, and MMMU_Pro, we find a pervasive evidence-collapse phenomenon: attention to annotated evidence regions drops substantially, often losing over half of evidence mass, as reasoning unfolds. Full-response entropy is the most reliable text-only uncertainty signal under cross-dataset transfer, yet adding vision features with a single global linear rule is brittle and often degrades transfer. An entropy-vision interaction model reveals a task-conditional regime: lowentropy, visually disengaged predictions are hazardous on sustained visual-reference tasks but benign on symbolic tasks. Using this structure, a targeted vision veto reduces selective risk by up to 1.9 percentage points at 90% coverage, while avoiding degradations where disengagement is expected. The results support task-aware multimodal monitoring for safe deployment under distribution shift.
[AI-62] Robots Need Some Education: On the complexity of learning in evolutionary robotics
【速读】:该论文旨在解决将机器人学习(Robot Learning)与进化机器人学(Evolutionary Robotics)融合时所面临的挑战,尤其是如何在进化过程中有效引入学习算法以优化机器人的控制器。其解决方案的关键在于设计适用于进化机器人语境的特定学习算法,并深入理解学习机制对进化过程的影响,从而实现对机器人控制器的高效优化与适应性提升。
链接: https://arxiv.org/abs/2604.04196
作者: Fuda van Diggelen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: PhD thesis
Abstract:Evolutionary Robotics and Robot Learning are two fields in robotics that aim to automatically optimize robot designs. The key difference between them lies in what is being optimized and the time scale involved. Evolutionary Robotics is a field that applies evolutionary computation techniques to evolve the morphologies or controllers, or both. Robot Learning, on the other hand, involves any learning technique aimed at optimizing a robot’s controller in a given morphology. In terms of time scales, evolution occurs across multiple generations, whereas learning takes place within the `lifespan’ of an individual robot. Integrating Robot Learning with Evolutionary Robotics requires the careful design of suitable learning algorithms in the context of evolutionary robotics. The effects of introducing learning into the evolutionary process are not well-understood and can thus be tricky. This thesis investigates these intricacies and presents several learning algorithms developed for an Evolutionary Robotics context.
[AI-63] Schema-Aware Planning and Hybrid Knowledge Toolset for Reliable Knowledge Graph Triple Verification
【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)自动化构建过程中引入噪声导致的数据可信度下降问题,特别是现有三元组验证方法因依赖单一来源信息(如仅内部结构约束或外部语义证据)而存在偏差,且采用静态推理范式,在处理复杂或长尾事实时表现不佳、可解释性弱的问题。其解决方案的关键在于提出一种无需训练的自主代理SHARP(Schema-Hybrid Agent for Reliable Prediction),将三元组验证重构为一个动态的战略规划、主动调查与证据推理过程;核心创新包括:基于记忆增强机制与模式感知战略规划以提升推理稳定性,并通过改进的ReAct循环结合混合知识工具集,实现内部知识图谱结构与外部文本证据的动态融合交叉验证,从而显著提升准确性与可解释性。
链接: https://arxiv.org/abs/2604.04190
作者: Xinyan Ma(1),Xianhao Ou(1),Weihao Zhang(1),Shixin Jiang(1),Runxuan Liu(1),Dandan Tu(2),Lei Chen(3),Ming Liu(1),Bing Qin(1) ((1) Harbin Institute of Technology, Harbin, China, (2) Huawei Technologies Co., Ltd., Beijing, China, (3) Bay Area International Business School, Beijing Normal University, Beijing, China)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge Graphs (KGs) serve as a critical foundation for AI systems, yet their automated construction inevitably introduces noise, compromising data trustworthiness. Existing triple verification methods, based on graph embeddings or language models, often suffer from single-source bias by relying on either internal structural constraints or external semantic evidence, and usually follow a static inference paradigm. As a result, they struggle with complex or long-tail facts and provide limited interpretability. To address these limitations, we propose SHARP (Schema-Hybrid Agent for Reliable Prediction), a training-free autonomous agent that reformulates triple verification as a dynamic process of strategic planning, active investigation, and evidential reasoning. Specifically, SHARP combines a Memory-Augmented Mechanism with Schema-Aware Strategic Planning to improve reasoning stability, and employs an enhanced ReAct loop with a Hybrid Knowledge Toolset to dynamically integrate internal KG structure and external textual evidence for cross-verification. Experiments on FB15K-237 and Wikidata5M-Ind show that SHARP significantly outperforms existing state-of-the-art baselines, achieving accuracy gains of 4.2% and 12.9%, respectively. Moreover, SHARP provides transparent, fact-based evidence chains for each judgment, demonstrating strong interpretability and robustness for complex verification tasks.
[AI-64] Comparative reversal learning reveals rigid adaptation in LLM s under non-stationary uncertainty
【速读】:该论文旨在解决在非平稳环境(non-stationary environments)中,大语言模型(Large Language Models, LLMs)如何动态调整先前学习到的动作价值(action values)以适应环境变化的问题。其核心挑战在于理解LLMs在面对状态切换(switch events)时的决策灵活性与损失敏感性(loss sensitivity),以及不同训练机制对适应能力的影响。解决方案的关键在于设计一个两选项的概率反转学习任务(probabilistic reversal-learning task),引入三种潜在状态和由绩效标准或超时触发的切换事件,并对比固定转移周期与随机调度两种策略下多个LLM(DeepSeek-V3.2、Gemini-3、GPT-5.2)的行为表现,同时以人类数据为参照。研究发现,尽管所有模型均表现出高“赢则继续”(win-stay)倾向,但“输则改变”(lose-shift)行为显著减弱,揭示了正负反馈利用的不对称性;进一步通过分层强化学习(hierarchical reinforcement learning, RL)建模识别出导致适应僵化(rigidity)的三种机制:弱损失学习、政策确定性过高及反事实抑制引发的价值极化(value polarisation)。这一框架为评估LLMs在非平稳不确定性下的适应能力提供了可操作的诊断工具和改进方向。
链接: https://arxiv.org/abs/2604.04182
作者: Haomiaomiao Wang,Tomás E Ward,Lili Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures, accepted by IPMU 2026, SS04: Explainable AI and Decision-Making Under Uncertainty: Bridging Interpretability and Robustness
Abstract:Non-stationary environments require agents to revise previously learned action values when contingencies change. We treat large language models (LLMs) as sequential decision policies in a two-option probabilistic reversal-learning task with three latent states and switch events triggered by either a performance criterion or timeout. We compare a deterministic fixed transition cycle to a stochastic random schedule that increases volatility, and evaluate DeepSeek-V3.2, Gemini-3, and GPT-5.2, with human data as a behavioural reference. Across models, win-stay was near ceiling while lose-shift was markedly attenuated, revealing asymmetric use of positive versus negative evidence. DeepSeek-V3.2 showed extreme perseveration after reversals and weak acquisition, whereas Gemini-3 and GPT-5.2 adapted more rapidly but still remained less loss-sensitive than humans. Random transitions amplified reversal-specific persistence across LLMs yet did not uniformly reduce total wins, demonstrating that high aggregate payoff can coexist with rigid adaptation. Hierarchical reinforcement-learning (RL) fits indicate dissociable mechanisms: rigidity can arise from weak loss learning, inflated policy determinism, or value polarisation via counterfactual suppression. These results motivate reversal-sensitive diagnostics and volatility-aware models for evaluating LLMs under non-stationary uncertainty.
[AI-65] CoALFake: Collaborative Active Learning with Human-LLM Co-Annotation for Cross-Domain Fake News Detection
【速读】:该论文旨在解决虚假新闻检测系统在跨域场景下的两个核心问题:一是现有方法对标注数据的依赖导致其在缺乏标签数据时性能受限,且数据获取成本高;二是传统跨域方法因固定领域划分或忽略领域特异性特征而导致信息丢失,从而影响模型泛化能力。解决方案的关键在于提出CoALFake框架,其核心创新包括:(1)引入人-大语言模型(Human-Large Language Model, LLM)协同标注机制,在保证标签可靠性的同时显著降低标注成本;(2)结合领域感知的主动学习(domain-aware Active Learning, AL),通过领域嵌入技术动态捕捉领域特定细节与跨域共性模式,训练出领域无关的检测模型;(3)设计领域感知采样策略,优先选择覆盖多样领域的样本以优化训练效率。实验证明,该方法在多个数据集上均显著优于现有基线,且在极少人工干预下仍保持优异性能。
链接: https://arxiv.org/abs/2604.04174
作者: Esma Aïmeur,Gilles Brassard,Dorsaf Sallami
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The proliferation of fake news across diverse domains highlights critical limitations in current detection systems, which often exhibit narrow domain specificity and poor generalization. Existing cross-domain approaches face two key challenges: (1) reliance on labelled data, which is frequently unavailable and resource intensive to acquire and (2) information loss caused by rigid domain categorization or neglect of domain-specific features. To address these issues, we propose CoALFake, a novel approach for cross-domain fake news detection that integrates Human-Large Language Model (LLM) co-annotation with domain-aware Active Learning (AL). Our method employs LLMs for scalable, low-cost annotation while maintaining human oversight to ensure label reliability. By integrating domain embedding techniques, the CoALFake dynamically captures both domain specific nuances and cross-domain patterns, enabling the training of a domain agnostic model. Furthermore, a domain-aware sampling strategy optimizes sample acquisition by prioritizing diverse domain coverage. Experimental results across multiple datasets demonstrate that the proposed approach consistently outperforms various baselines. Our results emphasize that human-LLM co-annotation is a highly cost-effective approach that delivers excellent performance. Evaluations across several datasets show that CoALFake consistently outperforms a range of existing baselines, even with minimal human oversight.
[AI-66] A Model of Understanding in Deep Learning Systems
【速读】:该论文试图解决的问题是:如何在机器学习系统中定义和实现对目标系统的“系统性理解”(systematic understanding),并评估当前深度学习模型是否具备这种理解能力。解决方案的关键在于提出一个三要素框架:首先,代理必须包含能够追踪真实规律的充分内部模型;其次,该模型需通过稳定的桥梁原则(bridge principles)与目标系统耦合;最后,该模型应能支持可靠预测。论文指出,当代深度学习系统虽常能达到上述标准,但其理解仍存在符号错位、非显式还原性和弱统一性等问题,因而提出了“断裂理解假说”(Fractured Understanding Hypothesis)以刻画这一局限。
链接: https://arxiv.org/abs/2604.04171
作者: David Peter Wallis Freeborn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:I propose a model of systematic understanding, suitable for machine learning systems. On this account, an agent understands a property of a target system when it contains an adequate internal model that tracks real regularities, is coupled to the target by stable bridge principles, and supports reliable prediction. I argue that contemporary deep learning systems often can and do achieve such understanding. However they generally fall short of the ideal of scientific understanding: the understanding is symbolically misaligned with the target system, not explicitly reductive, and only weakly unifying. I label this the Fractured Understanding Hypothesis.
[AI-67] Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)是否能通过动态交互自发发展出理论心智(Theory of Mind, ToM)能力的问题,而非仅依赖静态情境测试。其核心解决方案在于设计了一个2×2因子实验,系统考察持久记忆(persistent memory)与领域知识(domain knowledge)对LLM代理在德州扑克中构建对手模型的影响。关键发现是:只有当模型具备持久记忆时,才能从交互中自发涌现出ToM水平3–5(即预测性至递归建模),且此类行为表现为基于对手模型的战略欺骗(strategic deception),而无记忆条件下的代理始终停留在ToM水平0;此外,尽管领域知识不决定ToM的出现,但显著提升其应用精度。这一结果表明,功能性的ToM-like行为可由交互动态驱动,无需显式训练或提示,为理解人工社会智能和生物社会认知提供了新范式。
链接: https://arxiv.org/abs/2604.04157
作者: Hsieh-Ting Lin,Tsung-Yu Hou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages (PNAS format), 4 figures, 2 tables, 49 references. Submitted to PNAS
Abstract:Theory of Mind (ToM) – the ability to model others’ mental states – is fundamental to human social cognition. Whether large language models (LLMs) can develop ToM has been tested exclusively through static vignettes, leaving open whether ToM-like reasoning can emerge through dynamic interaction. Here we report that autonomous LLM agents playing extended sessions of Texas Hold’em poker progressively develop sophisticated opponent models, but only when equipped with persistent memory. In a 2x2 factorial design crossing memory (present/absent) with domain knowledge (present/absent), each with five replications (N = 20 experiments, ~6,000 agent-hand observations), we find that memory is both necessary and sufficient for ToM-like behavior emergence (Cliff’s delta = 1.0, p = 0.008). Agents with memory reach ToM Level 3-5 (predictive to recursive modeling), while agents without memory remain at Level 0 across all replications. Strategic deception grounded in opponent models occurs exclusively in memory-equipped conditions (Fisher’s exact p 0.001). Domain expertise does not gate ToM-like behavior emergence but enhances its application: agents without poker knowledge develop equivalent ToM levels but less precise deception (p = 0.004). Agents with ToM deviate from game-theoretically optimal play (67% vs. 79% TAG adherence, delta = -1.0, p = 0.008) to exploit specific opponents, mirroring expert human play. All mental models are expressed in natural language and directly readable, providing a transparent window into AI social cognition. Cross-model validation with GPT-4o yields weighted Cohen’s kappa = 0.81 (almost perfect agreement). These findings demonstrate that functional ToM-like behavior can emerge from interaction dynamics alone, without explicit training or prompting, with implications for understanding artificial social intelligence and biological social cognition.
[AI-68] Solar-VLM: Multimodal Vision-Language Models for Augmented Solar Power Forecasting
【速读】:该论文旨在解决光伏(Photovoltaic, PV)功率预测中多源信息融合不足的问题,尤其在时间序列数据、卫星图像和文本天气描述等异构模态信息的协同建模方面存在短板。其解决方案的关键在于提出了一种基于大语言模型(Large-Language-Model, LLM)驱动的多模态框架Solar-VLM:首先设计了模态专用编码器——时序编码器采用基于patch的结构捕捉站点级多变量时间模式,视觉编码器基于Qwen视觉主干提取云覆盖特征,文本编码器从天气文本中蒸馏历史气象特征;其次引入跨站点特征融合机制,通过图学习模块构建K近邻图并利用图注意力网络建模站点间空间依赖关系,同时结合跨站点注意力模块实现站点间自适应信息交互,从而有效整合时空维度上的复杂依赖性。
链接: https://arxiv.org/abs/2604.04145
作者: Hang Fan,Haoran Pei,Runze Liang,Weican Liu,Long Cheng,Wei Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Photovoltaic (PV) power forecasting plays a critical role in power system dispatch and market participation. Because PV generation is highly sensitive to weather conditions and cloud motion, accurate forecasting requires effective modeling of complex spatiotemporal dependencies across multiple information sources. Although recent studies have advanced AI-based forecasting methods, most fail to fuse temporal observations, satellite imagery, and textual weather information in a unified framework. This paper proposes Solar-VLM, a large-language-model-driven framework for multimodal PV power forecasting. First, modality-specific encoders are developed to extract complementary features from heterogeneous inputs. The time-series encoder adopts a patch-based design to capture temporal patterns from multivariate observations at each site. The visual encoder, built upon a Qwen-based vision backbone, extracts cloud-cover information from satellite images. The text encoder distills historical weather characteristics from textual descriptions. Second, to capture spatial dependencies across geographically distributed PV stations, a cross-site feature fusion mechanism is introduced. Specifically, a Graph Learner models inter-station correlations through a graph attention network constructed over a K-nearest-neighbor (KNN) graph, while a cross-site attention module further facilitates adaptive information exchange among sites. Finally, experiments conducted on data from eight PV stations in a northern province of China demonstrate the effectiveness of the proposed framework. Our proposed model is publicly available at this https URL.
[AI-69] Learning Dexterous Grasping from Sparse Taxonomy Guidance
【速读】:该论文旨在解决灵巧操作中因对象和任务多样性导致的抓取规划难以精确指定的问题,以及纯端到端强化学习缺乏可控性、用户无法在失败时干预的局限性。其解决方案的关键在于提出一种两阶段框架GRIT(Grasp-based Reinforcement with Taxonomy-guided Instruction),首先基于场景和任务上下文预测一个基于分类法(taxonomy)的稀疏抓取规范,随后在该稀疏指令条件下,策略网络生成连续指端运动以完成任务并保持预期的抓取结构。通过利用特定抓取分类法与物体几何形状之间的关联性,GRIT显著提升了对新物体的泛化能力,并在真实世界实验中实现了高可控性,允许用户通过高层分类选择调整抓取策略。
链接: https://arxiv.org/abs/2604.04138
作者: Juhan Park,Taerim Yoon,Seungmin Kim,Joonggil Kim,Wontae Ye,Jeongeun Park,Yoonbyung Chai,Geonwoo Cho,Geunwoo Cho,Dohyeong Kim,Kyungjae Lee,Yongjae Kim,Sungjoon Choi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Dexterous manipulation requires planning a grasp configuration suited to the object and task, which is then executed through coordinated multi-finger control. However, specifying grasp plans with dense pose or contact targets for every object and task is impractical. Meanwhile, end-to-end reinforcement learning from task rewards alone lacks controllability, making it difficult for users to intervene when failures occur. To this end, we present GRIT, a two-stage framework that learns dexterous control from sparse taxonomy guidance. GRIT first predicts a taxonomy-based grasp specification from the scene and task context. Conditioned on this sparse command, a policy generates continuous finger motions that accomplish the task while preserving the intended grasp structure. Our result shows that certain grasp taxonomies are more effective for specific object geometries. By leveraging this relationship, GRIT improves generalization to novel objects over baselines and achieves an overall success rate of 87.9%. Moreover, real-world experiments demonstrate controllability, enabling grasp strategies to be adjusted through high-level taxonomy selection based on object geometry and task intent.
[AI-70] Profile-Then-Reason : Bounded Semantic Complexity for Tool-Augmented Language Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在使用外部工具进行推理时,因采用反应式执行(reactive execution)策略而导致的延迟高和误差传播敏感的问题。其核心解决方案是提出一种名为“先规划后推理”(Profile–Then–Reason, PTR)的有界执行框架,关键在于将整个推理过程分解为明确的流程:首先由语言模型生成一个结构化的执行工作流(workflow),随后通过确定性或受保护的操作符执行该工作流,再由验证器评估执行轨迹,仅在原工作流不可靠时触发修复机制。该框架通过数学形式化定义了包括规划、路由、执行、验证、修复和推理在内的操作符组合,并在有限修复条件下保证语言模型调用次数最多不超过两次(理想情况)或三次(最坏情况),从而显著提升效率与鲁棒性。
链接: https://arxiv.org/abs/2604.04131
作者: Paulo Akira F. Enabe
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model agents that use external tools are often implemented through reactive execution, in which reasoning is repeatedly recomputed after each observation, increasing latency and sensitivity to error propagation. This work introduces Profile–Then–Reason (PTR), a bounded execution framework for structured tool-augmented reasoning, in which a language model first synthesizes an explicit workflow, deterministic or guarded operators execute that workflow, a verifier evaluates the resulting trace, and repair is invoked only when the original workflow is no longer reliable. A mathematical formulation is developed in which the full pipeline is expressed as a composition of profile, routing, execution, verification, repair, and reasoning operators; under bounded repair, the number of language-model calls is restricted to two in the nominal case and three in the worst case. Experiments against a ReAct baseline on six benchmarks and four language models show that PTR achieves the pairwise exact-match advantage in 16 of 24 configurations. The results indicate that PTR is particularly effective on retrieval-centered and decomposition-heavy tasks, whereas reactive execution remains preferable when success depends on substantial online adaptation.
[AI-71] NetSecBed: A Container-Native Testbed for Reproducible Cybersecurity Experimentation
【速读】:该论文旨在解决网络安全研究中可复现性证据(如流量痕迹、日志和标注数据集)缺乏的问题,尤其是在异构多协议环境中,现有公开数据集往往静态且难以支持受控重执行与溯源。其解决方案的关键在于提出NetSecBed——一个容器原生、场景导向的测试平台,通过将60种攻击场景、9类目标服务及良性流量生成器封装为单一功能容器,并借助声明式规范实现即插即用的扩展性和可追溯性;同时,其自动化流水线支持参数化执行、包捕获、日志收集、服务探测、特征提取与数据集整合,从而构建出可重复、可审计、可扩展的网络安全实验框架,显著降低操作偏差并支持持续的数据集生成。
链接: https://arxiv.org/abs/2604.04121
作者: Leonardo Bitzki,Diego Kreutz,Tiago Heinrich,Douglas Fideles,Leandro Bertholdo,Silvio Quincozes,Angelo Diniz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
备注: 8 pages, including 4 figures and 2 tables, submitted to SBCUP 2026
Abstract:Cybersecurity research increasingly depends on reproducible evidence, such as traffic traces, logs, and labeled datasets, yet most public datasets remain static and offer limited support for controlled re-execution and traceability, especially in heterogeneous multi-protocol environments. This paper presents NetSecBed, a container-native, scenario-oriented testbed for reproducible generation of network traffic evidence and execution artifacts under controlled conditions, particularly suitable for IoT, IIoT, and pervasive multi-protocol environments. The framework integrates 60 attack scenarios, 9 target services, and benign traffic generators as single-purpose containers, enabling plug-and-play extensibility and traceability through declarative specifications. Its pipeline automates parametrized execution, packet capture, log collection, service probing, feature extraction, and dataset consolidation. The main contribution is a repeatable, auditable, and extensible framework for cybersecurity experimentation that reduces operational bias and supports continuous dataset generation.
[AI-72] InsTraj: Instructing Diffusion Models with Travel Intentions to Generate Real-world Trajectories
【速读】:该论文旨在解决现有方法在生成真实且可控的GPS轨迹时面临的两大挑战:一是缺乏对复杂用户出行意图的深层语义理解,二是难以在满足复杂约束的同时保持人类行为固有的多样性。解决方案的关键在于提出InsTraj框架,其核心创新是利用大语言模型(Large Language Model, LLM)将自然语言中非结构化的出行意图解析为丰富的语义蓝图,从而弥合意图与轨迹之间的表征鸿沟;随后设计了一个多模态轨迹扩散变换器(Multimodal Trajectory Diffusion Transformer),通过整合语义引导信息,生成高保真度且忠实于细粒度用户指令的轨迹,显著提升了生成轨迹的真实性、多样性和语义一致性。
链接: https://arxiv.org/abs/2604.04106
作者: Yuanshao Zhu,Yuxuan Liang,Xiangyu Zhao,Liang Han,Xinwei Fang,Xuetao Wei,James Jianqiao Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The generation of realistic and controllable GPS trajectories is a fundamental task for applications in urban planning, mobility simulation, and privacy-preserving data sharing. However, existing methods face a two-fold challenge: they lack the deep semantic understanding to interpret complex user travel intent, and struggle to handle complex constraints while maintaining the realistic diversity inherent in human behavior. To resolve this, we introduce InsTraj, a novel framework that instructs diffusion models to generate high-fidelity trajectories directly from natural language descriptions. Specifically, InsTraj first utilizes a powerful large language model to decipher unstructured travel intentions formed in natural language, thereby creating rich semantic blueprints and bridging the representation gap between intentions and trajectories. Subsequently, we proposed a multimodal trajectory diffusion transformer that can integrate semantic guidance to generate high-fidelity and instruction-faithful trajectories that adhere to fine-grained user intent. Comprehensive experiments on real-world datasets demonstrate that InsTraj significantly outperforms state-of-the-art methods in generating trajectories that are realistic, diverse, and semantically faithful to the input instructions.
[AI-73] Compliance-by-Construction Argument Graphs: Using Generative AI to Produce Evidence-Linked Formal Arguments for Certification-Grade Accountability
【速读】:该论文旨在解决生成式 AI(Generative AI, GenAI)在高风险决策系统中应用时面临的可信性与合规性问题,尤其是由于语言模型作为松散约束的助手所导致的幻觉推理、无支撑主张及弱可追溯性风险。其解决方案的关键在于提出一种“合规构建”(compliance-by-construction)架构,将 GenAI 与结构化的形式化论证表示相结合,通过四个核心组件实现:(i) 基于保证案例方法的类型化论证图(Typed Argument Graph),(ii) 基于检索增强生成(Retrieval-Augmented Generation, RAG)的证据锚定片段生成,(iii) 强制完整性与可接受性约束的推理与验证内核,以及 (iv) 符合 W3C PROV 标准的溯源日志,从而确保每一步 AI 辅助决策均需由可验证证据支持并满足显式推理规则,最终在保障审计能力的同时提升论证构建效率。
链接: https://arxiv.org/abs/2604.04103
作者: Mahyar T. Moghaddam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at FACCT 2026, IoT CPS Week
Abstract:High-stakes decision systems increasingly require structured justification, traceability, and auditability to ensure accountability and regulatory compliance. Formal arguments commonly used in the certification of safety-critical systems provide a mechanism for structuring claims, reasoning, and evidence in a verifiable manner. At the same time, generative artificial intelligence systems are increasingly integrated into decision-support workflows, assisting with drafting explanations, summarizing evidence, and generating recommendations. However, current deployments often rely on language models as loosely constrained assistants, which introduces risks such as hallucinated reasoning, unsupported claims, and weak traceability. This paper proposes a compliance-by-construction architecture that integrates Generative AI (GenAI) with structured formal argument representations. The approach treats each AI-assisted step as a claim that must be supported by verifiable evidence and validated against explicit reasoning constraints before it becomes part of an official decision record. The architecture combines four components: i) a typed Argument Graph representation inspired by assurance-case methods, ii) retrieval-augmented generation (RAG) to draft argument fragments grounded in authoritative evidence, iii) a reasoning and validation kernel enforcing completeness and admissibility constraints, and iv) a provenance ledger aligned with the W3C PROV standard to support auditability. We present a system design and an evaluation strategy based on enforceable invariants and worked examples. The analysis suggests that deterministic validation rules can prevent unsupported claims from entering the decision record while allowing GenAI to accelerate argument construction.
[AI-74] oward a Sustainable Software Architecture Community: Evaluating ICSAs Environmental Impact
【速读】:该论文旨在解决生成式 AI(Generative AI)在软件架构研究中日益广泛应用所带来的碳足迹问题,尤其是其计算资源消耗对环境的影响尚未被系统记录和量化。解决方案的关键在于首次对两个不同边界下的碳排放进行系统性审计:一是基于研究成果边界的生成式 AI 推理使用所产生的数字碳足迹;二是基于会议时间边界的 IEEE 国际软件架构会议(ICSA 2025)的实体活动碳足迹(包括交通、住宿、餐饮、场地能源与物资消耗)。通过构建这两类独立但互补的碳清单,论文推动了研究透明度,并为可持续软件架构实践提供了可操作建议,如提升生成式 AI 能效、优化会议绿色规划等。
链接: https://arxiv.org/abs/2604.04096
作者: Mahyar T. Moghaddam,Mina Alipour,Torben Worm,Mikkel Baun Kjærgaard
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: accepted at ICSA-C 2026
Abstract:Generative AI (GenAI) tools are increasingly integrated into software architecture research, yet the environmental impact of their computational usage remains largely undocumented. This study presents the first systematic audit of the carbon footprint of both the digital footprint from GenAI usage in research papers, and the traditional footprint from conference activities within the context of the IEEE International Conference on Software Architecture (ICSA). We report two separate carbon inventories relevant to the software architecture research community: i) an exploratory estimate of the footprint of GenAI inference usage associated with accepted papers within a research-artifact boundary, and ii) the conference attendance and operations footprint of ICSA 2025 (travel, accommodation, catering, venue energy, and materials) within the conference time boundary. These two inventories, with different system boundaries and completeness, support transparency and community reflection. We discuss implications for sustainable software architecture, including recommendations for transparency, greener conference planning, and improved energy efficiency in GenAI operations. Our work supports a more climate-conscious research culture within the ICSA community and beyond
[AI-75] Fine-grained Analysis of Stability and Generalization for Stochastic Bilevel Optimization
【速读】:该论文旨在解决随机双层优化(Stochastic Bilevel Optimization, SBO)方法在统计学习理论框架下的泛化性保障问题,即如何量化SBO算法的泛化差距(generalization gap)。其关键解决方案在于建立**平均参数稳定性(on-average argument stability)**与泛化差距之间的定量关系,并在此基础上推导出单时标和双时标随机梯度下降(SGD)方法在三种典型设置(非凸-非凸、凸-凸、强凸-强凸)下的平均参数稳定性上界。相较于以往依赖每轮迭代重初始化内层参数的算法稳定性分析,本文方法无需此类限制,适用于更广泛的损失函数结构,从而为SBO在超参数优化、元学习和强化学习等场景中的泛化性能提供了坚实的理论基础。
链接: https://arxiv.org/abs/2604.04090
作者: Xuelin Zhang,Hong Chen,Bin Gu,Tieliang Gong,Feng Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Stochastic bilevel optimization (SBO) has been integrated into many machine learning paradigms recently, including hyperparameter optimization, meta learning, and reinforcement learning. Along with the wide range of applications, there have been numerous studies on the computational behavior of SBO. However, the generalization guarantees of SBO methods are far less understood from the lens of statistical learning theory. In this paper, we provide a systematic generalization analysis of the first-order gradient-based bilevel optimization methods. Firstly, we establish the quantitative connections between the on-average argument stability and the generalization gap of SBO methods. Then, we derive the upper bounds of on-average argument stability for single-timescale stochastic gradient descent (SGD) and two-timescale SGD, where three settings (nonconvex-nonconvex (NC-NC), convex-convex (C-C), and strongly-convex-strongly-convex (SC-SC)) are considered respectively. Experimental analysis validates our theoretical findings. Compared with the previous algorithmic stability analysis, our results do not require reinitializing the inner-level parameters at each iteration and are applicable to more general objective functions.
[AI-76] Parent Selection Mechanisms in Elitist Crossover-Based Algorithms
【速读】:该论文旨在解决进化计算中父代选择方法的理论优势尚不明确的问题,特别是在 (μ+1) 遗传算法(Genetic Algorithm, GA)框架下,如何通过合理设计父代选择策略来提升优化效率。其解决方案的关键在于引入一种新颖的多样性度量指标,该指标同时刻画种群中个体对之间的最大距离及其达到该距离的配对数量,并设计了一种父代选择策略:以 Ω(1) 的概率选择一对最大距离的父代进行交叉操作。这种机制使交叉成为维持和生成多样性的核心手段,而非仅在运行末期用于组合已分化的个体。这一分析视角显著改进了 Jumpk 问题上的期望时间复杂度至 O(k4knlogn),优于以往无显式多样性保持机制的 (μ+1) GA 的最优界 O(nμlog(μ)+nlog(n)+nk−1),从而深化了对交叉在遗传算法种群动态中作用的理论理解。
链接: https://arxiv.org/abs/2604.04083
作者: Andre Opris,Denis Antipov
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Parent selection methods are widely used in evolutionary computation to accelerate the optimization process, yet their theoretical benefits are still poorly understood. In this paper, we address this gap by incorporating different parent selection strategies into the (\mu+1) genetic algorithm (GA). We show that, with an appropriately chosen population size and a parent selection strategy that selects a pair of maximally distant parents with probability \Omega(1) for crossover, the resulting algorithm solves the Jump _k problem in O(k4^kn\log(n)) expected time. This bound is significantly smaller than the best known bound of O(n\mu\log(\mu)+n\log(n)+n^k-1) for any (\mu+1) ~GA using no explicit diversity-preserving mechanism and a constant crossover probability. To establish this result, we introduce a novel diversity metric that captures both the maximum distance between pairs of individuals in the population and the number of pairs achieving this distance. The crucial point of our analysis is that it relies on crossover as a mechanism for creating and maintaining diversity throughout the run, rather than using crossover only in the final step to combine already diversified individuals, as it has been done in many previous works. The insights provided by our analysis contribute to a deeper theoretical understanding of the role of crossover in the population dynamics of genetic algorithms. Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.04083 [cs.NE] (or arXiv:2604.04083v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2604.04083 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-77] FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
【速读】:该论文旨在解决机器学习领域同行评审(peer review)因投稿量激增与审稿人时间有限而面临的压力问题。现有基于大语言模型(Large Language Models, LLMs)的评审系统通常仅依赖论文文本生成评论,易受表述质量影响,且难以评估文中未直接呈现但对论证至关重要的证据,如相关工作或开源代码。解决方案的关键在于提出FactReview——一个以证据为基础的评审系统,其核心机制包括:1)从论文中提取关键主张(claim)和实验结果;2)检索相近文献以定位技术贡献;3)在有源代码可用时,在限定预算下执行代码以验证核心实证主张。该方法通过结构化证据评估,将每个主要主张标记为“支持”、“论文支持”、“部分支持”、“冲突”或“无法判断”,从而提升评审的客观性和可追溯性。
链接: https://arxiv.org/abs/2604.04074
作者: Hang Xu,Ling Yue,Chaoqian Ouyang,Libin Zheng,Shaowu Pan,Shimin Di,Min-Ling Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read only the manuscript and generate comments from the paper’s own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence-grounded reviewing system that combines claim extraction, literature positioning, and execution-based claim verification. Given a submission, FactReview identifies major claims and reported results, retrieves nearby work to clarify the paper’s technical position, and, when code is available, executes the released repository under bounded budgets to test central empirical claims. It then produces a concise review and an evidence report that assigns each major claim one of five labels: Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive. In a case study on CompGCN, FactReview reproduces results that closely match those reported for link prediction and node classification, yet also shows that the paper’s broader performance claim across tasks is not fully sustained: on MUTAG graph classification, the reproduced result is 88.4%, whereas the strongest baseline reported in the paper remains 92.6%. The claim is therefore only partially supported. More broadly, this case suggests that AI is most useful in peer review not as a final decision-maker, but as a tool for gathering evidence and helping reviewers produce more evidence-grounded assessments. The code is public at this https URL.
[AI-78] CoopGuard: Stateful Cooperative Agents Safeguarding LLM s Against Evolving Multi-Round Attacks
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮交互场景中面临的持续演化型对抗攻击(evolving adversarial attacks)所带来的安全威胁问题。现有防御方法多为被动响应式,难以适应攻击者在多轮交互中不断迭代优化的策略。其解决方案的关键在于提出一种基于协作智能体的状态化多轮防御框架 CoopGuard,该框架通过维护和更新内部防御状态(即交互历史),由系统代理(System Agent)协调三个专用代理——延迟代理(Deferring Agent)、诱惑代理(Tempting Agent)和取证代理(Forensic Agent)——分别执行互补的轮次级防御策略,从而实现对动态演化的攻击的有效识别与阻断。
链接: https://arxiv.org/abs/2604.04060
作者: Siyuan Li,Zehao Liu,Xi Lin,Qinghua Mao,Yuliang Chen,Haoyu Li,Jun Wu,Jianhua Li,Xiu Su
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models (LLMs) are increasingly deployed in complex applications, their vulnerability to adversarial attacks raises urgent safety concerns, especially those evolving over multi-round interactions. Existing defenses are largely reactive and struggle to adapt as adversaries refine strategies across rounds. In this work, we propose CoopGuard , a stateful multi-round LLM defense framework based on cooperative agents that maintains and updates an internal defense state to counter evolving attacks. It employs three specialized agents (Deferring Agent, Tempting Agent, and Forensic Agent) for complementary round-level strategies, coordinated by System Agent, which conditions decisions on the evolving defense state (interaction history) and orchestrates agents over time. To evaluate evolving threats, we introduce the EMRA benchmark with 5,200 adversarial samples across 8 attack types, simulating progressively LLM multi-round attacks. Experiments show that CoopGuard reduces attack success rate by 78.9% over state-of-the-art defenses, while improving deceptive rate by 186% and reducing attack efficiency by 167.9%, offering a more comprehensive assessment of multi-round defense. These results demonstrate that CoopGuard provides robust protection for LLMs in multi-round adversarial scenarios.
[AI-79] Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory
【速读】:该论文试图解决知识蒸馏(Knowledge Distillation)中性能饱和问题,即无论训练方法或目标函数如何变化,学生模型(student)的性能均存在一个无法突破的损失下限(loss floor)。其解决方案的关键在于提出一个几何视角:神经网络通过超叠加(superposition)机制,在远多于维度数的特征空间中表示信息;而学生模型的宽度 dS 限制了其可编码的特征数量,最多为 dS⋅g(α),其中 g(α) 是依赖稀疏度 α 的容量函数。这一理论揭示了损失下限的本质是特征预算不足导致的永久性信息丢失,而非优化问题。作者通过小规模玩具模型和 Pythia-410M 模型上的稀疏自动编码器(Sparse Autoencoders, SAE)测量验证了该预测,并表明观测到的损失下限可分解为几何成分与与宽度无关的架构基线,且即使丢失88%的特征,粗粒度概念仍能保留,说明损失下限主要源于重要性分布长尾中细粒度特征的累积损失。
链接: https://arxiv.org/abs/2604.04037
作者: Dawar Jyoti Deka,Nilesh Sarkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge distillation compresses large teachers into smaller students, but performance saturates at a loss floor that persists across training methods and objectives. We argue this floor is geometric: neural networks represent far more features than dimensions through superposition, and a student of width d_S can encode at most d_S \cdot g(\alpha) features, where g(\alpha) = 1/((1-\alpha)\ln\frac11-\alpha) is a sparsity-dependent capacity function. Features beyond this budget are permanently lost, yielding an importance-weighted loss floor. We validate on a toy model (48 configurations, median accuracy 93%) and on Pythia-410M, where sparse autoencoders measure F \approx 28,700 features at \alpha \approx 0.992 (critical width d_S^* \approx 1,065 ). Distillation into five student widths confirms the predicted monotonic floor ordering. The observed floor decomposes into a geometric component and a width-independent architectural baseline ( R^2 = 0.993 ). Linear probing shows coarse concepts survive even 88% feature loss, revealing the floor arises from aggregate loss of fine-grained features in the importance distribution’s long tail. Our results connect representation geometry to distillation limits and provide a practical tool for predicting distillation performance from SAE measurements alone.
[AI-80] Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents
【速读】:该论文旨在解决工具调用大语言模型(Tool-calling LLM)代理在执行过程中因工具调用被拒绝而引发的“因果洗白”(causality laundering)泄露问题,即攻击者通过探测受保护动作的拒绝结果,利用其因果影响而非直接数据流来窃取敏感信息。解决方案的关键在于提出一种运行时强制层——代理参考监视器(Agentic Reference Monitor, ARM),它基于包含工具调用、返回数据、字段级溯源和被拒操作的 provenance 图进行决策,并通过完整性格(integrity lattice)传播信任,同时引入从被拒动作节点出发的反事实边(counterfactual edges),从而同时覆盖传递性数据依赖和由拒绝引发的因果影响,实现对混合溯源场景下恶意行为的有效防御。
链接: https://arxiv.org/abs/2604.04035
作者: Mohammad Hossein Chinaei
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 24 pages, 1 figure, 2 tables, 1 algorithm, preprint
Abstract:Tool-calling LLM agents can read private data, invoke external services, and trigger real-world actions, creating a security problem at the point of tool execution. We identify a denial-feedback leakage pattern, which we term causality laundering, in which an adversary probes a protected action, learns from the denial outcome, and exfiltrates the inferred information through a later seemingly benign tool call. This attack is not captured by flat provenance tracking alone because the leaked information arises from causal influence of the denied action, not direct data flow. We present the Agentic Reference Monitor (ARM), a runtime enforcement layer that mediates every tool invocation by consulting a provenance graph over tool calls, returned data, field-level provenance, and denied actions. ARM propagates trust through an integrity lattice and augments the graph with counterfactual edges from denied-action nodes, enabling enforcement over both transitive data dependencies and denial-induced causal influence. In a controlled evaluation on three representative attack scenarios, ARM blocks causality laundering, transitive taint propagation, and mixed-provenance field misuse that a flat provenance baseline misses, while adding sub-millisecond policy evaluation overhead. These results suggest that denial-aware causal provenance is a useful abstraction for securing tool-calling agent systems.
[AI-81] Can LLM s Learn to Reason Robustly under Noisy Supervision?
【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在训练过程中因专家标注稀缺而导致的噪声标签问题,特别是针对两类噪声标签——“非活跃噪声标签”(inactive noisy labels)和“活跃噪声标签”(active noisy labels)对模型训练造成的不同影响。其核心挑战在于:传统RLVR算法依赖于rollout条件来决定标签是否参与训练,而噪声标签可能在此机制下被错误强化或无效利用,从而损害模型性能。解决方案的关键是提出在线标签精炼(Online Label Refinement, OLR)方法,该方法基于两个动态判据——多数投票答案的rollout通过率呈正斜率且历史一致性稳定,从而在训练早期阶段逐步识别并修正潜在的噪声标签,实现模型在不依赖完美标签的前提下持续自我修正与优化。实验表明,OLR在多种分布内和分布外数学推理任务上均显著提升了鲁棒性,尤其在噪声比例高达90%时仍保持稳定增益。
链接: https://arxiv.org/abs/2604.03993
作者: Shenzhi Yang,Guangcheng Zhu,Bowen Song,Sharon Li,Haobo Wang,Xing Zheng,Yingfan Ma,Zhongqi Chen,Weiqiang Wang,Gang Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label’s influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer’s rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.
[AI-82] Quantifying Trust: Financial Risk Management for Trustworthy AI Agents
【速读】:该论文旨在解决当前可信人工智能(Trustworthy AI)研究中模型层面可靠性与用户端实际体验之间的脱节问题。随着AI系统演变为在开放环境中自主运行的智能体(Agent),其行为具有固有的随机性,单纯依赖模型内部属性(如偏见缓解、对抗鲁棒性和可解释性)已不足以保障用户安全和任务完成质量。论文提出了一种基于风险管理的互补框架——代理风险标准(Agentic Risk Standard, ARS),其核心在于将风险评估、承保(underwriting)与补偿机制整合进单个交易流程中,使用户在遭遇执行失败、意图偏离或意外后果时获得预设且具有合同约束力的赔偿。这一机制将信任从对模型行为的隐含预期转变为可量化、可执行的产品级保障,从而实现从技术可靠到产品责任的范式转变。
链接: https://arxiv.org/abs/2604.03976
作者: Wenyue Hua,Tianyi Peng,Chi Wang,Ian Kaufman,Bryan Lim,Chandler Fang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 30 pages, 9 figures
Abstract:Prior work on trustworthy AI emphasizes model-internal properties such as bias mitigation, adversarial robustness, and interpretability. As AI systems evolve into autonomous agents deployed in open environments and increasingly connected to payments or assets, the operational meaning of trust shifts to end-to-end outcomes: whether an agent completes tasks, follows user intent, and avoids failures that cause material or psychological harm. These risks are fundamentally product-level and cannot be eliminated by technical safeguards alone because agent behavior is inherently stochastic. To address this gap between model-level reliability and user-facing assurance, we propose a complementary framework based on risk management. Drawing inspiration from financial underwriting, we introduce the \textbfAgentic Risk Standard (ARS), a payment settlement standard for AI-mediated transactions. ARS integrates risk assessment, underwriting, and compensation into a single transaction framework that protects users when interacting with agents. Under ARS, users receive predefined and contractually enforceable compensation in cases of execution failure, misalignment, or unintended outcomes. This shifts trust from an implicit expectation about model behavior to an explicit, measurable, and enforceable product guarantee. We also present a simulation study analyzing the social benefits of applying ARS to agentic transactions. ARS’s implementation can be found at this https URL.
[AI-83] raceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol
【速读】:该论文旨在解决现有AI控制协议中单分数监控机制的两大局限性:一是难以检测输出看似正常但推理过程异常的隐蔽攻击,二是当监控器与AI代理模型相同时(即共谋场景)安全性几乎降为零。其解决方案的关键在于提出TraceGuard,一种结构化的多维监控协议,通过并行调用独立的大语言模型(LLM)对代理行为在五个维度(目标对齐、约束遵守、推理连贯性、安全意识和动作轨迹一致性)进行评分,并结合七种启发式检测器与基于LLM的意图分析模块,实现更精准的攻击识别与防共谋能力。实验表明,该方法显著提升了攻击与诚实行为的分离度(Delta=0.410),并在共谋场景下将安全性从0%提升至95%,同时通过职责分离设计进一步实现了100%的安全保障。
链接: https://arxiv.org/abs/2604.03968
作者: Khanh Linh Nguyen,Hoa Nghiem,Tu Tran
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:AI control protocols use monitors to detect attacks by untrusted AI agents, but standard single-score monitors face two limitations: they miss subtle attacks where outputs look clean but reasoning is off, and they collapse to near-zero safety when the monitor is the same model as the agent (collusion). We present TraceGuard, a structured multi-dimensional monitoring protocol that evaluates agent actions across five dimensions – goal alignment, constraint adherence, reasoning coherence, safety awareness, and action-trace consistency – scored in parallel by independent LLM calls, augmented by seven heuristic detectors and an LLM-based intent analyzer. We evaluate on BashArena (637 bash tasks, 4 attack categories) within the ControlArena framework. Our results on 519 samples (279 honest, 240 attack) show that: (1) the hybrid approach achieves clear attack-honest separation (attack mean 0.616 vs. honest mean 0.206, Delta=0.410); (2) structured scoring constrains collusion – the untrusted structured monitor achieves 95% safety vs. 0% for single-score untrusted monitoring; (3) goal alignment and constraint adherence are the most discriminative dimensions; and (4) a separation-of-duties variant splitting dimensions across trusted and untrusted models achieves 100% safety while preventing any single model from seeing the full evaluation. TraceGuard is implemented as a new monitor type for the open-source ControlArena framework.
[AI-84] SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources
【速读】:该论文旨在解决科学生态系统中大量分散的程序性知识(procedural knowledge)难以被智能代理有效利用的问题,即科学知识与可用代理能力之间的鸿沟。其解决方案的关键在于提出SkillFoundry框架,该框架通过构建领域知识树(domain knowledge tree),从高价值资源中挖掘并提取操作契约(operational contracts),将其编译为可执行的技能包(skill packages),并通过闭环验证机制迭代扩展、修复、合并或修剪技能库,从而生成新颖且内部一致的技能集合。此方法显著提升了编码代理在基准测试和特定科学任务(如基因组学中的细胞类型注释和scDRS工作流)上的性能,并扩展了现有手工设计技能库的覆盖范围。
链接: https://arxiv.org/abs/2604.03964
作者: Shuaike Shen,Wenduo Cheng,Mingqian Ma,Alistair Turcan,Martin Jinye Zhang,Jian Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Modern scientific ecosystems are rich in procedural knowledge across repositories, APIs, scripts, notebooks, documentation, databases, and papers, yet much of this knowledge remains fragmented across heterogeneous artifacts that agents cannot readily operationalize. This gap between abundant scientific know-how and usable agent capabilities is a key bottleneck for building effective scientific agents. We present SkillFoundry, a self-evolving framework that converts such resources into validated agent skills, reusable packages that encode task scope, inputs and outputs, execution steps, environment assumptions, provenance, and tests. SkillFoundry organizes a target domain as a domain knowledge tree, mines resources from high-value branches, extracts operational contracts, compiles them into executable skill packages, and then iteratively expands, repairs, merges, or prunes the resulting library through a closed-loop validation process. SkillFoundry produces a substantially novel and internally valid skill library, with 71.1% of mined skills differing from existing skill libraries such as SkillHub and SkillSMP. We demonstrate that these mined skills improve coding agent performance on five of the six MoSciBench datasets. We further show that SkillFoundry can design new task-specific skills on demand for concrete scientific objectives, and that the resulting skills substantially improve performance on two challenging genomics tasks: cell type annotation and the scDRS workflow. Together, these results show that automatically mined skills improve agent performance on benchmarks and domain-specific tasks, expand coverage beyond hand-crafted skill libraries, and provide a practical foundation for more capable scientific agents.
[AI-85] Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference CVPR
【速读】:该论文旨在解决基于Transformer的大语言模型(Large Language Models, LLMs)在推理阶段因注意力机制的二次复杂度和高精度运算带来的内存带宽瓶颈所导致的高昂计算成本问题。其解决方案的关键在于提出一种基于微缩放浮点数(Microscaling Floating-Point, MXFP)数据格式的低比特混合精度注意力核(Diagonal-Tiled Mixed-Precision Attention, DMA),通过在分块层级引入两种低比特计算方式,并利用Triton编程语言实现高度融合的计算内核,从而充分利用下一代GPU架构的计算能力,有效提升推理速度与内存效率,同时保持生成质量几乎无损。
链接: https://arxiv.org/abs/2604.03950
作者: Yifu Ding,Xinhao Zhang,Jinyang Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: CVPR Workshop EDGE 2026
Abstract:Transformer-based large language models (LLMs) have demonstrated remarkable performance across a wide range of real-world tasks, but their inference cost remains prohibitively high due to the quadratic complexity of attention and the memory bandwidth limitations of high-precision operations. In this work, we present a low-bit mixed-precision attention kernel using the microscaling floating-point (MXFP) data format, utilizing the computing capability on next-generation GPU architectures. Our Diagonal-Tiled Mixed-Precision Attention (DMA) incorporates two kinds of low-bit computation at the tiling-level, and is a delicate fused kernel implemented using Triton, exploiting hardware-level parallelism and memory efficiency to enable fast and efficient inference without compromising model performance. Extensive empirical evaluations on NVIDIA B200 GPUs show that our kernel maintains generation quality with negligible degradation, and meanwhile achieves significant speedup by kernel fusion. We release our code at this https URL.
[AI-86] CODE-GEN: A Human-in-the-Loop RAG -Based Agent ic AI System for Multiple-Choice Question Generation
【速读】:该论文旨在解决教育领域中高质量编程理解类多选题(multiple-choice questions)自动化生成的难题,以提升学生代码推理与理解能力。其核心问题在于如何在保持教学有效性的同时,实现大规模、上下文对齐的题目生成。解决方案的关键是提出一个“人机协同”的检索增强生成(retrieval-augmented generation, RAG)型智能体系统 CODE-GEN,该系统由两个专用智能体组成:Generator 负责根据课程学习目标生成题目,Validator 独立评估题目在七个教学维度上的质量;两者均通过专用工具增强计算准确性与代码验证能力。实证研究表明,CODE-GEN 在多数可量化维度上表现优异(人类验证成功率79.9%–98.6%),但在需要深层教学判断的维度(如干扰项设计和反馈质量)仍需人类专家介入,从而为AI辅助教育内容生成提供了人机分工策略依据。
链接: https://arxiv.org/abs/2604.03926
作者: Xiaojing Duan,Frederick Nwanganga,Chaoli Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Full version of the paper accepted as a short paper at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
Abstract:We present CODE-GEN, a human-in-the-Loop, retrieval-augmented generation (RAG)-based agentic AI system for generating context-aligned multiple-choice questions to develop student code reasoning and comprehension abilities. CODE-GEN employs an agentic AI architecture in which a Generator agent produces multiple-choice coding comprehension questions aligned with course-specific learning objectives, while a Validator agent independently assesses content quality across seven pedagogical dimensions. Both agents are augmented with specialized tools that enhance computational accuracy and verify code outputs. To evaluate the effectiveness of CODE-GEN, we conducted an evaluation study involving six human subject-matter experts (SMEs) who judged 288 AI-generated questions. The SMEs produced a total of 2,016 human-AI rating pairs, indicating agreement or disagreement with the assessments of Validator, along with 131 instances of qualitative feedback. Analyses of SME judgments show strong system performance, with human-validated success rates ranging from 79.9% to 98.6% across the seven pedagogical dimensions. The analysis of qualitative feedback reveals that CODE-GEN achieves high reliability on dimensions well suited to computational verification and explicit criteria matching, including question clarity, code validity, concept alignment, and correct answer validity. In contrast, human expertise remains essential for dimensions requiring deeper instructional judgment, such as designing pedagogically meaningful distractors and providing high-quality feedback that reinforces understanding. These findings inform the strategic allocation of human and AI effort in AI-assisted educational content generation.
[AI-87] Automating Cloud Security and Forensics Through a Secure-by-Design Generative AI Framework
【速读】:该论文旨在解决云环境中生成式 AI (Generative AI) 系统在安全性和取证严谨性方面的双重挑战:一方面,大型语言模型(LLMs)易受提示注入攻击,导致推理结果不可靠;另一方面,云取证流程缺乏标准化和自动化手段,难以高效应对复杂威胁。解决方案的关键在于提出一个统一的、以安全为设计核心的生成式 AI 框架,集成 PromptShield 和 Cloud Investigation Automation Framework (CIAF):PromptShield 通过本体驱动的输入验证机制实现对恶意提示的主动防御,提升 LLM 在攻击下的分类性能(精度、召回率和 F1 分数均超 93%);CIAF 则基于本体的结构化推理覆盖取证全流程,显著增强云日志中勒索软件检测的准确性,从而实现云取证与生成式 AI 系统在自动化、可解释性和可信度上的协同提升。
链接: https://arxiv.org/abs/2604.03912
作者: Dalal Alharthi,Ivan Roberto Kawaminami Garcia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2510.00452
Abstract:As cloud environments become increasingly complex, cybersecurity and forensic investigations must evolve to meet emerging threats. Large Language Models (LLMs) have shown promise in automating log analysis and reasoning tasks, yet they remain vulnerable to prompt injection attacks and lack forensic rigor. To address these dual challenges, we propose a unified, secure-by-design GenAI framework that integrates PromptShield and the Cloud Investigation Automation Framework (CIAF). PromptShield proactively defends LLMs against adversarial prompts using ontology-driven validation that standardizes user inputs and mitigates manipulation. CIAF streamlines cloud forensic investigations through structured, ontology-based reasoning across all six phases of the forensic process. We evaluate our system on real-world datasets from AWS and Microsoft Azure, demonstrating substantial improvements in both LLM security and forensic accuracy. Experimental results show PromptShield boosts classification performance under attack conditions, achieving precision, recall, and F1 scores above 93%, while CIAF enhances ransomware detection accuracy in cloud logs using Likert-transformed performance features. Our integrated framework advances the automation, interpretability, and trustworthiness of cloud forensics and LLM-based systems, offering a scalable foundation for real-time, AI-driven incident response across diverse cloud infrastructures.
[AI-88] LLM -Agent -based Social Simulation for Attitude Diffusion
【速读】:该论文旨在解决社会科学研究中如何动态模拟公众对移民态度变化的问题,尤其关注重大现实事件(如抗议、政策辩论)如何影响舆论极化与信念演化。传统基于规则的代理模型难以处理自然语言表达和实时事件输入,而本研究提出的关键解决方案是构建一个开源框架discourse_simulator,其核心在于将大语言模型(LLM)与基于代理的建模(Agent-Based Modeling, ABM)深度融合:利用LLM生成社交媒体内容、解析观点并模拟信息传播机制,同时嵌入小世界网络拓扑结构与实时新闻检索系统,从而实现对社会态度演变过程的可解释性理论检验,而非黑箱预测。
链接: https://arxiv.org/abs/2604.03898
作者: Deepak John Reji
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation (stat.CO)
备注:
Abstract:This paper introduces discourse_simulator, an open-source framework that combines LLMs with agent-based modelling. It offers a new way to simulate how public attitudes toward immigration change over time in response to salient events like protests, controversies, or policy debates. Large language models (LLMs) are used to generate social media posts, interpret opinions, and model how ideas spread through social networks. Unlike traditional agent-based models that rely on fixed, rule-based opinion updates and cannot generate natural language or consider current events, this approach integrates multidimensional sociological belief structures and real-world event timelines. This framework is wrapped into an open-source Python package that integrates generative agents into a small-world network topology and a live news retrieval system. discourse_sim is purpose-built as a social science research instrument specifically for studying attitude dynamics, polarisation, and belief evolution following real-world critical events. Unlike other LLM Agent Swarm frameworks, which treat the simulations as a prediction black box, discourse_sim treats it as a theory-testing instrument, which is fundamentally a different epistemological stance for studying social science problems. The paper further demonstrates the framework by modelling the Dublin anti-immigration march on April 26, 2025, with N=100 agents over a 15-day simulation. Package link: this https URL Subjects: Artificial Intelligence (cs.AI); Computation (stat.CO) Cite as: arXiv:2604.03898 [cs.AI] (or arXiv:2604.03898v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.03898 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-89] Latency-Aware Resource Allocation over Heterogeneous Networks: A Lorentz-Invariant Market Mechanism
【速读】:该论文旨在解决异构延迟网络(如低地球轨道(LEO)卫星星座与延迟容忍的深空中继)中带宽和时隙分配的资源优化问题,尤其关注如何在存在显著传播延迟差异的情况下实现公平、高效且激励相容的拍卖机制。其解决方案的关键在于提出一种洛伦兹不变拍卖(Lorentz-Invariant Auction, LIA),该机制将投标视为时空事件,并基于“时域松弛(horizon slack)”——一个由最早到达时间相对于公共清算时域推导出的因果量——对报价进行重加权。LIA通过因果排序建模、由半群风格不变性公理隐含的独特指数型松弛修正,以及在松弛值固定后的临界值实施策略,确保了机制的个体理性与近似效率(至少达到最优可行分配的 e−λΔ,其中 Δ 为松弛范围),从而提供了一种无需缓冲同步即可实现延迟均衡的实用替代方案。
链接: https://arxiv.org/abs/2604.03897
作者: Saad Alqithami
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:We present a telecom-native auction mechanism for allocating bandwidth and time slots across heterogeneous-delay networks, ranging from low-Earth-orbit (LEO) satellite constellations to delay-tolerant deep-space relays. The Lorentz-Invariant Auction (LIA) treats bids as spacetime events and reweights reported values based on the \emphhorizon slack, a causal quantity derived from the earliest-arrival times relative to a public clearing horizon. Unlike other delay-equalization rules, LIA combines a causal-ordering formulation, a uniquely exponential slack correction implied by a semigroup-style invariance axiom, and a critical-value implementation that ensures truthful reported values once slacks are fixed by trusted infrastructure. We analyze the incentive result in the exogenous-slack regime and separately examine bounded slack-estimation error and endogenous-delay limitations. Under fixed feasible slacks, LIA is individually rational and achieves welfare at least (e^-\lambda\Delta) relative to the optimal feasible allocation, where (\Delta) is the slack spread. We evaluate LIA on STARLINK-200, INTERNET-100, and DSN-30 across 52,500 baseline instances with market sizes (n\in\10,20,30,40,50\) and conduct additional robustness sweeps. On Starlink and Internet, LIA maintains near-efficiency while eliminating measured timing rents. However, on DSN, welfare is lower in thin markets but improves with depth. We also distinguish winner-determination time from the background cost of maintaining slack estimates and study robustness beyond independent and identically distributed (iid) noise through error-spread bounds and structured (distance-biased and subnetwork-correlated) noise models. These results suggest that causal-consistent mechanism design offers a practical non-buffering alternative to synchronized delay equalization in heterogeneous telecom infrastructures.
[AI-90] FeynmanBench: Benchmarking Multimodal LLM s on Diagrammatic Physics Reasoning
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理科学符号系统时缺乏对全局结构逻辑和物理约束建模能力的问题,尤其是在理论物理中广泛使用的费曼图(Feynman diagram)任务中表现不足。解决方案的关键在于构建首个专注于费曼图推理的基准测试集——FeynmanBench,其核心创新包括:设计涵盖标准模型中电磁、弱和强相互作用的多样化费曼图数据集(超过2000个任务),提供可验证的拓扑标注与散射振幅结果,并建立自动化评估流水线以支持大规模、可复现的实验分析。该基准严格要求模型满足守恒律、对称性约束、图拓扑识别及图-代数表示转换等多步推理能力,从而系统揭示现有MLLMs在物理约束遵守和全局结构理解方面的系统性缺陷,推动AI向具备物理先验知识的科学发现能力演进。
链接: https://arxiv.org/abs/2604.03893
作者: Zeyu Wang,Xiaogang Li,Peiyao Xiao,Qinhao Kong,Ben Wang,Chengliang Xu,Zichao Chen,Bing Zhao,Hu Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:Breakthroughs in frontier theory often depend on the combination of concrete diagrammatic notations with rigorous logic. While multimodal large language models (MLLMs) show promise in general scientific tasks, current benchmarks often focus on local information extraction rather than the global structural logic inherent in formal scientific notations. In this work, we introduce FeynmanBench, the first benchmark centered on Feynman diagram tasks. It is designed to evaluate AI’s capacity for multistep diagrammatic reasoning, which requires satisfying conservation laws and symmetry constraints, identifying graph topology, converting between diagrammatic and algebraic representations, and constructing scattering amplitudes under specific conventions and gauges. To support large-scale and reproducible evaluation, we developed an automated pipeline producing diverse Feynman diagrams along with verifiable topological annotations and amplitude results. Our database spans the electromagnetic, weak, and strong interactions of the Standard Model, encompasses over 100 distinct types and includes more than 2000 tasks. Experiments on state-of-the-art MLLMs reveal systematic failure modes, including unstable enforcement of physical constraints and violations of global topological conditions, highlighting the need for physics-grounded benchmarks for visual reasoning over scientific notation. FeynmanBench provides a logically rigorous test of whether AI can effectively engage in scientific discovery, particularly within theoretical physics.
[AI-91] Regime-Calibrated Demand Priors for Ride-Hailing Fleet Dispatch and Repositioning ATC
【速读】:该论文旨在解决网约车调度中因需求模式随时间、季节及特殊事件显著变化而导致的效率低下问题,核心挑战在于如何精准预测并适应动态需求分布以优化乘客等待时间和车队资源配置。解决方案的关键在于提出一种制度校准(regime-calibrated)方法:首先将历史行程数据按需求特征划分为多个“需求制度”(demand regimes),随后通过融合多种距离度量(包括Kolmogorov-Smirnov距离、Wasserstein-1距离、特征距离、方差比、事件模式相似性与时间邻近性)构建的相似性集成模型,匹配当前运营周期至最相似的历史类比情境,进而利用校准后的需求先验信息驱动基于线性规划(LP)的车队再调度策略和匈牙利算法批量派单。该方法无需训练、确定性强且可解释,并在纽约市520万次行程的8种场景下实现平均等待时间降低31.1%,且在芝加哥等其他城市具有强泛化能力。
链接: https://arxiv.org/abs/2604.03883
作者: Indar Kumar,Akanksha Tiwari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 10 pages, 10 figures, 8 tables. Code: this https URL
Abstract:Effective ride-hailing dispatch requires anticipating demand patterns that vary substantially across time-of-day, day-of-week, season, and special events. We propose a regime-calibrated approach that (i) segments historical trip data into demand regimes, (ii) matches the current operating period to the most similar historical analogues via a similarity ensemble combining Kolmogorov-Smirnov distance, Wasserstein-1 distance, feature distance, variance ratio, event pattern similarity, and temporal proximity, and (iii) uses the resulting calibrated demand prior to drive both an LP-based fleet repositioning policy and batch dispatch with Hungarian matching. In ablation, a distributional-only metric subset achieves the strongest mean-wait reduction, while the full ensemble is retained as a robustness-oriented default that preserves calendar and event context. Evaluated on 5.2 million NYC TLC trips across 8 diverse scenarios (winter/summer, weekday/weekend/holiday, morning/evening/night) with 5 random seeds each, our method reduces mean rider wait times by 31.1% (bootstrap 95% CI: [26.5, 36.6]; Friedman chi-squared = 80.0, p = 4.25e-18; Cohen’s d = 7.5-29.9). P95 wait drops 37.6% and the Gini coefficient of wait times improves from 0.441 to 0.409. The two contributions compose multiplicatively: calibration provides 16.9% reduction relative to the replay baseline; LP repositioning adds a further 15.5%. The approach requires no training, is deterministic and explainable, generalizes to Chicago (23.3% wait reduction using the NYC-built regime library without retraining), and is robust across fleet sizes (32-47% improvement for 0.5x-2.0x fleet scaling). Code is available at this https URL. Comments: 10 pages, 10 figures, 8 tables. Code: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) MSC classes: 90B06, 90B20, 90C05 ACMclasses: I.2.8; G.1.6; J.7 Cite as: arXiv:2604.03883 [cs.LG] (or arXiv:2604.03883v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.03883 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-92] k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS The Expressive Power of GraphGPS ICLR2026
【速读】:该论文旨在解决图变压器(Graph Transformer)在处理大规模图数据时面临的计算和内存复杂度问题,尤其是传统全连接注意力机制导致的二次方复杂度瓶颈。其解决方案的关键在于提出一种新型的 k-Maximum Inner Product (k-MIP) 注意力机制:通过在每个查询节点中选择最相关的 k 个键节点进行注意力计算,实现稀疏但灵活的注意力模式;同时结合基于符号矩阵的注意力分数计算方式,使内存复杂度降至线性级别,并在单张 A100 GPU 上支持超过 500k 节点的大规模图处理。理论分析进一步表明,k-MIP 注意力不会削弱图变压器的表达能力——可逼近任意全注意力变压器至任意精度,且与 GraphGPS 框架集成后仍保持良好的图区分能力上限(以 S-SEG-WL 测试为准)。
链接: https://arxiv.org/abs/2604.03815
作者: Jonas De Schouwer,Haitz Sáez de Ocáriz Borde,Xiaowen Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the ICLR 2026 GRaM Workshop. 9 pages, 9 figures, 16 tables; 30 pages of supplementary material
Abstract:Graph transformers have shown promise in overcoming limitations of traditional graph neural networks, such as oversquashing and difficulties in modelling long-range dependencies. However, their application to large-scale graphs is hindered by the quadratic memory and computational complexity of the all-to-all attention mechanism. Although alternatives such as linearized attention and restricted attention patterns have been proposed, these often degrade performance or limit expressive power. To better balance efficiency and effectiveness, we introduce k-Maximum Inner Product (k-MIP) attention for graph transformers. k-MIP attention selects the most relevant key nodes per query via a top-k operation, yielding a sparse yet flexible attention pattern. Combined with an attention score computation based on symbolic matrices, this results in linear memory complexity and practical speedups of up to an order of magnitude compared to all-to-all attention, enabling the processing of graphs with over 500k nodes on a single A100 GPU. We provide a theoretical analysis of expressive power, showing that k-MIP attention does not compromise the expressiveness of graph transformers: specifically, we prove that k-MIP transformers can approximate any full-attention transformer to arbitrary precision. In addition, we analyze the expressive power of the GraphGPS framework, in which we integrate our attention mechanism, and establish an upper bound on its graph distinguishing capability in terms of the S-SEG-WL test. Finally, we validate our approach on the Long Range Graph Benchmark, the City-Networks benchmark, and two custom large-scale inductive point cloud datasets, consistently ranking among the top-performing scalable graph transformers.
[AI-93] Automated Conjecture Resolution with Formal Verification
【速读】:该论文旨在解决研究级数学问题中自然语言推理的不确定性与形式化验证之间的鸿沟问题,即如何在无需大量人工干预的情况下实现从非形式化推理到机器可验证证明的端到端自动化。其解决方案的关键在于构建一个双代理框架:一是模仿人类数学家工作流程的非形式化推理代理 Rethlas,通过结合推理原语与 theorem search 引擎 Matlas 探索解题策略并生成候选证明;二是具备形式化验证能力的代理 Archon,利用 LeanSearch 引擎将非形式化论证转化为 Lean 4 中的可验证项目,通过结构化任务分解、迭代精炼和自动证明合成确保逻辑严密性。该框架成功自动解决了交换代数中的一个开放问题,并在 Lean 4 中完成形式化验证,体现了非形式化与形式化推理系统协同工作的潜力,为人类-人工智能协作式数学研究提供了可落地的范式。
链接: https://arxiv.org/abs/2604.03789
作者: Haocheng Ju,Guoxiong Gao,Jiedong Jiang,Bin Wu,Zeming Sun,Leheng Chen,Yutong Wang,Yuefeng Wang,Zichen Wang,Wanyi He,Peihao Wu,Liang Xiao,Ruochuan Liu,Bryan Dai,Bin Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code and resources are available at: Rethlas ( this https URL ), Archon ( this https URL ), and the formalization results ( this https URL )
Abstract:Recent advances in large language models have significantly improved their ability to perform mathematical reasoning, extending from elementary problem solving to increasingly capable performance on research-level problems. However, reliably solving and verifying such problems remains challenging due to the inherent ambiguity of natural language reasoning. In this paper, we propose an automated framework for tackling research-level mathematical problems that integrates natural language reasoning with formal verification, enabling end-to-end problem solving with minimal human intervention. Our framework consists of two components: an informal reasoning agent, Rethlas, and a formal verification agent, Archon. Rethlas mimics the workflow of human mathematicians by combining reasoning primitives with our theorem search engine, Matlas, to explore solution strategies and construct candidate proofs. Archon, equipped with our formal theorem search engine LeanSearch, translates informal arguments into formalized Lean 4 projects through structured task decomposition, iterative refinement, and automated proof synthesis, ensuring machine-checkable correctness. Using this framework, we automatically resolve an open problem in commutative algebra and formally verify the resulting proof in Lean 4 with essentially no human involvement. Our experiments demonstrate that strong theorem retrieval tools enable the discovery and application of cross-domain mathematical techniques, while the formal agent is capable of autonomously filling nontrivial gaps in informal arguments. More broadly, our work illustrates a promising paradigm for mathematical research in which informal and formal reasoning systems, equipped with theorem retrieval tools, operate in tandem to produce verifiable results, substantially reduce human effort, and offer a concrete instantiation of human-AI collaborative mathematical research.
[AI-94] CountsDiff: A Diffusion Model on the Natural Numbers for Generation and Imputation of Count-Based Data
【速读】:该论文旨在解决扩散模型在离散有序数据(discrete ordinal data)领域应用不足的问题,特别是针对自然数域上的分布建模。现有扩散模型主要适用于连续或基于token的域,而对计数类数据(如生物测序中的基因表达计数)缺乏有效的建模方法。解决方案的关键在于提出CountsDiff,一个原生面向自然数域的扩散框架:通过将黑化扩散(Blackout diffusion)重新参数化为生存概率调度(survival probability schedule)和显式损失加权机制,引入了设计灵活性;同时整合了连续时间训练、无分类器引导(classifier-free guidance)以及 churn/remasking 反向动态等现代扩散模型特性,从而支持非单调反向轨迹,提升了对复杂离散数据的建模能力。
链接: https://arxiv.org/abs/2604.03779
作者: Renzo G. Soatto,Anders Hoel,Greycen Ren,Shorna Alam,Stephen Bates,Nikolaos P. Daskalakis,Caroline Uhler,Maria Skoularidou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 36 Pages, 11 figures. In review
Abstract:Diffusion models have excelled at generative tasks for both continuous and token-based domains, but their application to discrete ordinal data remains underdeveloped. We present CountsDiff, a diffusion framework designed to natively model distributions on the natural numbers. CountsDiff extends the Blackout diffusion framework by simplifying its formulation through a direct parameterization in terms of a survival probability schedule and an explicit loss weighting. This introduces flexibility through design parameters with direct analogues in existing diffusion modeling frameworks. Beyond this reparameterization, CountsDiff introduces features from modern diffusion models, previously absent in counts-based domains, including continuous-time training, classifier-free guidance, and churn/remasking reverse dynamics that allow non-monotone reverse trajectories. We propose an initial instantiation of CountsDiff and validate it on natural image datasets (CIFAR-10, CelebA), exploring the effects of varying the introduced design parameters in a complex, well-studied, and interpretable data domain. We then highlight biological count assays as a natural use case, evaluating CountsDiff on single-cell RNA-seq imputation in a fetal cell and heart cell atlas. Remarkably, we find that even this simple instantiation matches or surpasses the performance of a state-of-the-art discrete generative model and leading RNA-seq imputation methods, while leaving substantial headroom for further gains through optimized design choices in future work.
[AI-95] RL-Driven Sustainable Land-Use Allocation for the Lake Malawi Basin
【速读】:该论文旨在解决生态敏感区域因不可持续土地利用实践而导致的生物多样性丧失、水资源威胁及数百万居民生计受损的问题,核心目标是优化土地利用分配以最大化生态系统服务价值(ESV)。解决方案的关键在于构建一个基于深度强化学习(Deep Reinforcement Learning, DRL)的框架,其中采用近端策略优化(Proximal Policy Optimization, PPO)算法,在50×50栅格(分辨率500米)环境中通过动作掩码机制对九类土地覆盖类型进行动态调整;同时设计包含单元生态价值与空间连通性目标的奖励函数,引入邻近水体缓冲区惩罚和生态连通性奖励,从而引导土地利用配置向生态合理模式演化。该方法在三种情景下验证有效,表明其具备响应政策参数变化的能力,可作为环境规划中的情景分析工具。
链接: https://arxiv.org/abs/2604.03768
作者: Ying Yao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 5 figures
Abstract:Unsustainable land-use practices in ecologically sensitive regions threaten biodiversity, water resources, and the livelihoods of millions. This paper presents a deep reinforcement learning (RL) framework for optimizing land-use allocation in the Lake Malawi Basin to maximize total ecosystem service value (ESV). Drawing on the benefit transfer methodology of Costanza et al., we assign biome-specific ESV coefficients – locally anchored to a Malawi wetland valuation – to nine land-cover classes derived from Sentinel-2 imagery. The RL environment models a 50x50 cell grid at 500m resolution, where a Proximal Policy Optimization (PPO) agent with action masking iteratively transfers land-use pixels between modifiable classes. The reward function combines per-cell ecological value with spatial coherence objectives: contiguity bonuses for ecologically connected land-use patches (forest, cropland, built area etc.) and buffer zone penalties for high-impact development adjacent to water bodies. We evaluate the framework across three scenarios: (i) pure ESV maximization, (ii) ESV with spatial reward shaping, and (iii) a regenerative agriculture policy scenario. Results demonstrate that the agent effectively learns to increase total ESV; that spatial reward shaping successfully steers allocations toward ecologically sound patterns, including homogeneous land-use clustering and slight forest consolidation near water bodies; and that the framework responds meaningfully to policy parameter changes, establishing its utility as a scenario-analysis tool for environmental planning.
[AI-96] Automated Attention Pattern Discovery at Scale in Large Language Models
【速读】:该论文旨在解决当前机制可解释性方法在大规模语言模型(Large Language Models, LLMs)中难以扩展的问题,即现有方法通常局限于特定行为的精确解释,缺乏泛化能力且资源消耗高。其核心解决方案是利用代码数据集中的结构化特性,通过挖掘Java代码完成场景来收集注意力模式(attention patterns),并提出Attention Pattern - Masked Autoencoder (AP-MAE)——一种基于视觉Transformer的模型,用于高效重建掩码后的注意力模式。关键创新在于证明了注意力模式是一种可扩展的全局可解释信号,并展示了AP-MAE在跨模型泛化、重复行为识别、生成正确性预测及针对性干预等方面的有效性,从而为大规模语言模型的分析与干预提供了可迁移的基础框架。
链接: https://arxiv.org/abs/2604.03764
作者: Jonathan Katzy,Razvan-Mihai Popescu,Erik Mekkes,Arie van Deursen,Maliheh Izadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to TMLR
Abstract:Large language models have found success by scaling up capabilities to work in general settings. The same can unfortunately not be said for interpretability methods. The current trend in mechanistic interpretability is to provide precise explanations of specific behaviors in controlled settings. These often do not generalize, or are too resource intensive for larger studies. In this work we propose to study repeated behaviors in large language models by mining completion scenarios in Java code datasets, through exploiting the structured nature of code. We collect the attention patterns generated in the attention heads to demonstrate that they are scalable signals for global interpretability of model components. We show that vision models offer a promising direction for analyzing attention patterns at scale. To demonstrate this, we introduce the Attention Pattern - Masked Autoencoder(AP-MAE), a vision transformer-based model that efficiently reconstructs masked attention patterns. Experiments on StarCoder2 show that AP-MAE (i) reconstructs masked attention patterns with high accuracy, (ii) generalizes across unseen models with minimal degradation, (iii) reveals recurring patterns across inferences, (iv) predicts whether a generation will be correct without access to ground truth, with accuracies ranging from 55% to 70% depending on the task, and (v) enables targeted interventions that increase accuracy by 13.6% when applied selectively, but cause collapse when applied excessively. These results establish attention patterns as a scalable signal for interpretability and demonstrate that AP-MAE provides a transferable foundation for both analysis and intervention in large language models. Beyond its standalone value, AP-MAE also serves as a selection procedure to guide fine-grained mechanistic approaches. We release code and models to support future work in large-scale interpretability.
[AI-97] Build on Priors: Vision–Language–Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation
【速读】:该论文旨在解决机器人从少量示范中学习长时程操作任务的难题,尤其针对现有神经符号方法依赖手工设计的符号抽象、语义标注轨迹或大规模演示数据集而导致可扩展性与实际应用受限的问题。其解决方案的关键在于构建一个端到端的可扩展神经符号框架:首先通过视觉语言模型(Vision-Language Model, VLM)自动识别技能片段并分类,进而建立状态转移图;再利用答案集编程(Answer Set Programming, ASP)求解器自动生成PDDL规划域;最后基于oracle函数提取每个技能策略所需的最小、任务相关且目标相对的观测与动作空间,从而在控制参考层级而非原始执行器信号层级上训练策略,提升学习目标的平滑性和鲁棒性。该方法仅需1–30个未标注技能示范即可完成符号规划域的自动化构造与高效控制策略学习,显著降低对专家标注和大量数据的依赖,同时支持真实世界的数据增强与跨平台泛化。
链接: https://arxiv.org/abs/2604.03759
作者: Pierrick Lorang,Johannes Huemer,Timothy Duggan,Kai Goebel,Patrik Zips,Matthias Scheutz
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Enabling robots to learn long-horizon manipulation tasks from a handful of demonstrations remains a central challenge in robotics. Existing neuro-symbolic approaches often rely on hand-crafted symbolic abstractions, semantically labeled trajectories or large demonstration datasets, limiting their scalability and real-world applicability. We present a scalable neuro-symbolic framework that autonomously constructs symbolic planning domains and data-efficient control policies from as few as one to thirty unannotated skill demonstrations, without requiring manual domain engineering. Our method segments demonstrations into skills and employs a Vision-Language Model (VLM) to classify skills and identify equivalent high-level states, enabling automatic construction of a state-transition graph. This graph is processed by an Answer Set Programming solver to synthesize a PDDL planning domain, which an oracle function exploits to isolate the minimal, task-relevant and target relative observation and action spaces for each skill policy. Policies are learned at the control reference level rather than at the raw actuator signal level, yielding a smoother and less noisy learning target. Known controllers can be leveraged for real-world data augmentation by projecting a single demonstration onto other objects in the scene, simultaneously enriching the graph construction process and the dataset for imitation learning. We validate our framework primarily on a real industrial forklift across statistically rigorous manipulation trials, and demonstrate cross-platform generality on a Kinova Gen3 robotic arm across two standard benchmarks. Our results show that grounding control learning, VLM-driven abstraction, and automated planning synthesis into a unified pipeline constitutes a practical path toward scalable, data-efficient, expert-free and interpretable neuro-symbolic robotics.
[AI-98] AutoReSpec: A Framework for Generating Specification using Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成形式化规范(formal specification)时存在的验证失败问题,如语法错误、逻辑不一致或推理不完整,尤其在包含循环或分支结构的程序中表现不佳。其解决方案的关键在于提出一种名为AutoReSpec的协作框架,该框架通过动态选择开放源代码与闭源LLM组合及提示配置来适配不同程序结构,并采用两阶段设计:若主模型输出无效,则调用协作模型并利用验证器反馈进行修正,从而实现高效性与鲁棒性的平衡。
链接: https://arxiv.org/abs/2604.03758
作者: Ragib Shahariar Ayon,Shibbir Ahmed
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Formal specification generation has recently drawn attention in software engineering as a way to improve program correctness without requiring manual annotations. Large Language Models (LLMs) have shown promise in this area, but early results reveal several limitations. Generated specifications often fail verification due to syntax errors, logical inaccuracies, or incomplete reasoning, especially in programs with loops or branching logic. Techniques like SpecGen and FormalBench attempt to address this through prompting and benchmarking, but they typically rely on static prompts and do not offer mechanisms for recovering from failure or adapting to different program structures. In this paper, we present AutoReSpec, a collaborative framework that combines open and closed-source LLMs for verifiable specification generation. AutoReSpec dynamically chooses an LLM pair and prompt configuration based on the structure of the input program. If the primary LLM fails to produce a valid output, a collaborative model is invoked, using validator feedback to refine and correct the specification. This two-stage design enables both speed and robustness. We evaluate AutoReSpec on a new benchmark of 72 real-world and synthetic Java programs. Our results show that it achieves 67 passes out of 72, outperforming SpecGen and FormalBench in both Success Probability and Completeness. Our experimental evaluation achieves a 58.2% success probability and a 69.2% completeness score, while cutting evaluation time by 26.89% on average compared to prior methods. Together, these results demonstrate that AutoReSpec offers a scalable, efficient, and reliable approach to LLM-based formal specification generation.
[AI-99] Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)评估中因传统直接评分法导致的判断不一致和不可解释的问题。其解决方案的关键在于引入一种置信度感知的模糊层次分析法(confidence-aware Fuzzy AHP, FAHP),通过三角模糊数建模认知不确定性,并利用LLM生成的置信度分数进行调制,从而实现更校准的多维度评估与不确定性感知聚合。在此基础上,进一步提出基于双过程理论的混合框架DualJudge,自适应融合直觉式直接评分与结构化AHP输出,以一致性感知权重进行加权整合,显著提升了评估稳定性与准确性,验证了不确定性感知的结构化推理在提升LLM评估可靠性方面的有效性。
链接: https://arxiv.org/abs/2604.03742
作者: Yulong He,Ivan Smirnov,Dmitry Fedrushkov,Sergey Kovalchuk,Ilya Revin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. Systematically validated on JudgeBench, our structured approach decomposes assessments into explicit criteria and incorporates uncertainty-aware aggregation, producing more calibrated judgments. Extensive experiments demonstrate that both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain comparison scenarios. Building on these insights, we propose \textbfDualJudge, a hybrid framework inspired by Dual-Process Theory that adaptively fuses holistic direct scores with structured AHP outputs via consistency-aware weighting. DualJudge achieves state-of-the-art performance, underscoring the complementary strengths of intuitive and deliberative evaluation paradigms. These results establish uncertainty-aware structured reasoning as a principled pathway toward more reliable LLM assessment. Code is available at this https URL.
[AI-100] RDEx-CMOP: Feasibility-Aware Indicator-Guided Differential Evolution for Fixed-Budget Constrained Multiobjective Optimization
【速读】:该论文旨在解决约束多目标优化(Constrained Multiobjective Optimization, CMOP)问题中面临的挑战,即在严格的评估预算下实现快速可行性达成、稳定的收敛性以及多样性保持。解决方案的关键在于提出一种改进的差分进化算法——RDEx-CMOP,其核心创新包括:引入ε级可行性调度机制以动态调整可行域优先级;采用SPEA2风格的指标驱动适应度分配策略提升种群质量;设计基于适应度导向的current-to-pbest/1变异算子增强搜索效率与稳定性。实验表明,该方法在CEC 2025官方基准测试中取得了最高总分和最优平均排名,展现出优异的目标达成能力及接近零的最终约束违反水平。
链接: https://arxiv.org/abs/2604.03708
作者: Sichen Tao,Yifei Yang,Ruihan Zhao,Kaiyu Wang,Sicheng Liu,Shangce Gao
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Constrained multiobjective optimisation requires fast feasibility attainment together with stable convergence and diversity preservation under strict evaluation budgets. This report documents RDEx-CMOP, the differential evolution variant used in the IEEE CEC 2025 numerical optimisation competition (C06 special session) constrained multiobjective track. RDEx-CMOP integrates an \epsilon-level feasibility schedule, a SPEA2-style indicator-driven fitness assignment, and a fitness-oriented current-to-pbest/1 mutation operator. We evaluate RDEx-CMOP on the official CEC 2025 CMOP benchmark using the median-target U-score framework and the released trace data. Experimental results show that RDEx-CMOP achieves the highest total score and the best overall average rank among all released comparison algorithms, with strong target-attainment behaviour and near-zero final violation on most problems.
[AI-101] PRAISE: Prefix-Based Rollout Reuse in Agent ic Search Training
【速读】:该论文旨在解决代理式搜索(agentic search)中基于检索的强化学习(search-based Reinforcement Learning, RL)训练存在的两大核心问题:一是长时程轨迹(long-horizon rollouts)在训练过程中利用率低,二是仅在最终答案处提供监督信号,导致奖励稀疏性严重。解决方案的关键在于提出Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards(PRAISE)框架,其通过提取完整搜索轨迹中的前缀状态(prefix states),生成中间答案,并利用这些前缀构建额外的训练轨迹及计算步骤级奖励(step-level rewards),从而提升数据效率和信用分配精度。该方法使用单一共享模型同时完成搜索策略学习与前缀答案评估,实现端到端联合优化,无需额外人工标注或独立奖励模型。
链接: https://arxiv.org/abs/2604.03675
作者: Erhan Zhang,Yiqun Chen,Zechun Niu,Wei Yang,Xiaochi Wei,Yan Gao,Yi Wu,Yao Hu,Jiaxin Mao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically available only at the final answer, resulting in severe reward sparsity. We present Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards (PRAISE), a framework for improving both data efficiency and credit assignment in agentic search training. Given a complete search trajectory, PRAISE extracts prefix states at different search turns, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Our method uses a single shared model for both search policy learning and prefix answer evaluation, enabling joint optimization without extra human annotations or a separate reward model. Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.
[AI-102] ableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLs)在处理具有层次结构的复杂表格时存在的推理性能瓶颈问题,尤其是由于视觉区域密集分布导致的“感知过载”(Perceptual Overload)现象,即模型难以维持准确的空间注意力以支持隐式生成任务。解决方案的关键在于提出TableVision——一个大规模、轨迹感知的基准数据集,通过渲染驱动的确定性空间定位流程,将多步逻辑推理与像素级精确的空间真值进行显式耦合,并构建了包含6,799条高保真推理轨迹的标注体系;同时设计了一个两阶段解耦框架,在测试集上实现了12.3%的整体准确率提升,从而显著恢复了MLLMs的空间感知与逻辑推理协同能力。
链接: https://arxiv.org/abs/2604.03660
作者: Xiaoyu Chen,Lu Dai,Hanqing Wang,Zhuoyu Li,Wenbin Dai,Yanzong Zheng,Zhenggang Xia,Junyong Lin,Hui Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Structured tables are essential for conveying high-density information in professional domains such as finance, healthcare, and scientific research. Despite the progress in Multimodal Large Language Models (MLLMs), reasoning performance remains limited for complex tables with hierarchical layouts. In this paper, we identify a critical Perception Bottleneck through quantitative analysis. We find that as task complexity scales, the number of involved discrete visual regions increases disproportionately. This processing density leads to an internal “Perceptual Overload,” where MLLMs struggle to maintain accurate spatial attention during implicit generation. To address this bottleneck, we introduce TableVision, a large-scale, trajectory-aware benchmark designed for spatially grounded reasoning. TableVision stratifies tabular tasks into three cognitive levels (Perception, Reasoning, and Analysis) across 13 sub-categories. By utilizing a rendering-based deterministic grounding pipeline, the dataset explicitly couples multi-step logical deductions with pixel-perfect spatial ground truths, comprising 6,799 high-fidelity reasoning trajectories. Our empirical results, supported by diagnostic probing, demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement on the test set. TableVision provides a rigorous testbed and a fresh perspective on the synergy between perception and logic in document understanding.
[AI-103] Beyond Retrieval: Modeling Confidence Decay and Deterministic Agent ic Platforms in Generative Engine Optimization
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在数字营销场景中因依赖检索增强生成(Retrieval-Augmented Generation, RAG)策略所引发的两大核心问题:一是概率性幻觉(probabilistic hallucinations),即模型输出内容与事实不符;二是“零点击”悖论(zero-click paradox),即用户无法建立对商业引擎的可持续信任。其解决方案的关键在于提出一种范式转变——从RAG驱动的不确定性路径转向确定性的多智能体意图路由(deterministic multi-agent intent routing)。具体而言,通过构建语义熵漂移(Semantic Entropy Drift, SED)模型量化LLM置信度随时间和上下文扰动的动态衰减,并引入同构归因回归(Isomorphic Attribution Regression, IAR)模型以严格的人工介入隔离机制惩罚幻觉;同时设计确定性代理交接协议(Deterministic Agent Handoff, DAH),将大语言模型(LLM)角色限定为意图路由器而非答案生成者,从而实现专业代理对垂直任务的精准调度与零幻觉执行。实证验证表明,该架构可显著降低知识图谱映射等复杂任务中的幻觉率至接近零水平。
链接: https://arxiv.org/abs/2604.03656
作者: XinYu Zhao,ChengYou Li,XiangBao Meng,Kai Zhang,XiaoDong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative Engine Optimization (GEO) is rapidly reshaping digital marketing paradigms in the era of Large Language Models (LLMs). However, current GEO strategies predominantly rely on Retrieval-Augmented Generation (RAG), which inherently suffers from probabilistic hallucinations and the “zero-click” paradox, failing to establish sustainable commercial trust. In this paper, we systematically deconstruct the probabilistic flaws of existing RAG-based GEO and propose a paradigm shift towards deterministic multi-agent intent routing. First, we mathematically formulate Semantic Entropy Drift (SED) to model the dynamic decay of confidence curves in LLMs over continuous temporal and contextual perturbations. To rigorously quantify optimization value in black-box commercial engines, we introduce the Isomorphic Attribution Regression (IAR) model, leveraging a Multi-Agent System (MAS) probe with strict human-in-the-loop physical isolation to enforce hallucination penalties. Furthermore, we architect the Deterministic Agent Handoff (DAH) protocol, conceptualizing an Agentic Trust Brokerage (ATB) ecosystem where LLMs function solely as intent routers rather than final answer generators. We empirically validate this architecture using EasyNote, an industrial AI meeting minutes product by Yishu Technology. By routing the intent of “knowledge graph mapping on an infinite canvas” directly to its specialized proprietary agent via DAH, we demonstrate the reduction of vertical task hallucination rates to near zero. This work establishes a foundational theoretical framework for next-generation GEO and paves the way for a well-ordered, deterministic human-AI collaboration ecosystem.
[AI-104] Delayed Homomorphic Reinforcement Learning for Environments with Delayed Feedback
【速读】:该论文旨在解决现实世界系统中延迟反馈导致的马尔可夫决策过程(Markov Decision Process, MDP)假设失效问题,该问题会严重阻碍强化学习中的策略学习与控制性能。传统状态增强方法虽能缓解延迟影响,但会导致状态空间爆炸(state-space explosion),显著增加样本复杂度(sample complexity)。现有基于增强的基线方法要么主要减轻价值函数估计器(critic)的负担,要么对策略生成器(actor)和价值函数估计器采用非统一处理方式,缺乏系统性优化。本文提出延迟同态强化学习(Delayed Homomorphic Reinforcement Learning, DHRL),其核心在于利用MDP同态(MDP homomorphism)理论将信念等价的增强状态进行压缩,从而在不损失最优性的前提下构建抽象MDP,实现高效策略学习。理论分析给出了状态压缩边界与样本复杂度的保证,并设计了实用算法,在MuJoCo连续控制任务中验证了其优于当前主流增强基线方法,尤其在长延迟场景下表现突出。
链接: https://arxiv.org/abs/2604.03641
作者: Jongsoo Lee,Jangwon Kim,Soohee Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning in real-world systems is often accompanied by delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical state augmentation approaches cause the state-space explosion, which introduces a severe sample-complexity burden. Despite recent progress, the state-of-the-art augmentation-based baselines remain incomplete: they either predominantly reduce the burden on the critic or adopt non-unified treatments for the actor and critic. To provide a structured and sample-efficient solution, we propose delayed homomorphic reinforcement learning (DHRL), a framework grounded in MDP homomorphisms that collapses belief-equivalent augmented states and enables efficient policy learning on the resulting abstract MDP without loss of optimality. We provide theoretical analyses of state-space compression bounds and sample complexity, and introduce a practical algorithm. Experiments on continuous control tasks in MuJoCo benchmark confirm that our algorithm outperforms strong augmentation-based baselines, particularly under long delays.
[AI-105] Persistent Cross-Attempt State Optimization for Repository-Level Code Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在仓库级代码生成任务中重复尝试效率低下、缺乏跨尝试知识积累与复用的问题。现有方法通常孤立优化每次生成尝试,未能保留或利用任务相关的状态信息,导致资源浪费且性能提升有限。其解决方案的关键在于提出 LiveCoder 框架,通过维护持久化的任务特定状态实现跨尝试的知识优化:该状态包含成功知识(success knowledge,捕获可复用的强信号)、失败知识(failure knowledge,记录失败结果及其诊断信号)以及历史最优仓库(historical-best repository,保存当前最佳成果以防止退化)。这一机制将多次生成过程转化为持续的知识驱动优化流程,显著提升了功能性得分、代码重用率并降低了计算成本。
链接: https://arxiv.org/abs/2604.03632
作者: Ruwei Pan,Jiangshuai Wang,Qisheng Zhang,Yueheng Zhu,Linhao Wu,Zixiong Yang,Yakun Zhang,Lu Zhang,Hongyu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved substantial progress in repository-level code generation. However, solving the same repository-level task often requires multiple attempts, while existing methods still optimize each attempt in isolation and do not preserve or reuse task-specific state across attempts. In this paper, we propose LiveCoder, a novel framework for repository-level code generation based on cross-attempt knowledge optimization. LiveCoder maintains persistent task-specific state from prior attempts to guide subsequent generation. This state includes success knowledge, which captures reusable signals from previously strong repositories, failure knowledge, which records unsuccessful outcomes and their diagnostic signals, and a historical-best repository, which preserves the strongest result found so far and prevents regression. These components collectively transform repeated repository generation into a persistent, knowledge-driven optimization process. We evaluate LiveCoder using four frontier LLMs on two representative repository-level code generation benchmarks. Extensive experimental results demonstrate the effectiveness and efficiency of LiveCoder, improving the functional score by up to 22.94 percentage points, increasing repository reuse to 81.58%, and reducing cost by up to 53.63% on RAL-Bench while maintaining broadly stable non-functional quality.
[AI-106] Single-agent vs. Multi-agent s for Automated Video Analysis of On-Screen Collaborative Learning Behaviors
【速读】:该论文旨在解决协作学习场景中屏幕录制视频的自动化行为编码问题,传统方法依赖人工标注,耗时且效率低。其核心挑战在于如何准确识别和分类学生在学习过程中的认知参与行为(基于ICAP框架),尤其是在多用户交互情境下对屏幕动作与场景变化的精准捕捉。解决方案的关键在于引入视觉语言模型(Vision Language Models, VLMs)驱动的多智能体系统(Multi-agent System, MAS),提出两种创新架构:一是基于工作流的三智能体系统,通过场景分割与光标引导提示实现行为检测并辅以证据验证;二是受ReAct启发的自主决策型MAS,通过推理-操作-观察的迭代循环完成自我修正与可解释标签生成。实验表明,这两种多智能体框架均优于单一VLM,在场景检测和行为识别任务上分别取得最优性能,为大规模多模态学习行为分析提供了高效、可扩展的技术路径。
链接: https://arxiv.org/abs/2604.03631
作者: Likai Peng,Shihui Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures. To be published in the 27th International Conference on Artificial Intelligence in Education (AIED2026)
Abstract:On-screen learning behavior provides valuable insights into how students seek, use, and create information during learning. Analyzing on-screen behavioral engagement is essential for capturing students’ cognitive and collaborative processes. The recent development of Vision Language Models (VLMs) offers new opportunities to automate the labor-intensive manual coding often required for multimodal video data analysis. In this study, we compared the performance of both leading closed-source VLMs (Claude-3.7-Sonnet, GPT-4.1) and open-source VLM (Qwen2.5-VL-72B) in single- and multi-agent settings for automated coding of screen recordings in collaborative learning contexts based on the ICAP framework. In particular, we proposed and compared two multi-agent frameworks: 1) a three-agent workflow multi-agent system (MAS) that segments screen videos by scene and detects on-screen behaviors using cursor-informed VLM prompting with evidence-based verification; 2) an autonomous-decision MAS inspired by ReAct that iteratively interleaves reasoning, tool-like operations (segmentation/ classification/ validation), and observation-driven self-correction to produce interpretable on-screen behavior labels. Experimental results demonstrated that the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks. It is worth noting that the workflow-based agent achieved best on scene detection, and the autonomous-decision MAS achieved best on action detection. This study demonstrates the effectiveness of VLM-based Multi-agent System for video analysis and contributes a scalable framework for multimodal data analytics.
[AI-107] A Multimodal Foundation Model of Spatial Transcriptomics and Histology for Biological Discovery and Clinical Prediction DATE
【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)成本高、通量低,以及苏木精-伊红(H&E)染色虽具丰富形态信息但缺乏分子分辨率的问题。其解决方案的关键在于提出了一种名为STORM(Spatial Transcriptomics and histOlogy Representation Model)的基础模型,该模型基于120万条配对的空间转录组与组织病理图像数据进行训练,采用分层架构融合形态特征、基因表达和空间上下文信息,从而构建出鲁棒的分子-形态学表征,实现影像与组学数据之间的跨模态映射。STORM不仅提升了空间域发现能力并生成生物学一致的组织图谱,还能在11种肿瘤类型中从H&E图像准确预测空间基因表达,并在多个平台(Visium、Xenium、Visium HD、CosMx)上保持一致性,显著改善免疫治疗反应预测和预后评估性能。
链接: https://arxiv.org/abs/2604.03630
作者: Jinxi Xiang,Siyu Hou,Yuchen Li,Ryan Quinton,Xiaoming Zhang,Feyisope Eweje,Xiangde Luo,Yijiang Chen,Zhe Li,Colin Bergstrom,Ted Kim,Sierra Willens,Francesca Maria Olguin,Matthew Abikenari,Andrew Heider,Sanjeeth Rajaram,Joel Neal,Maximilian Diehn,Xiang Zhou,Ruijiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 29 pages, 5 figures. This manuscript is a work in progress; further updates and revisions will be posted as they become available
Abstract:Spatial transcriptomics (ST) enables gene expression mapping within anatomical context but remains costly and low-throughput. Hematoxylin and eosin (H\E) staining offers rich morphology yet lacks molecular resolution. We present \textbf\ours (\textbfSpatial \textbfTranscriptomics and hist\textbfOlogy \textbfRepresentation \textbfModel), a foundation model trained on 1.2 million spatially resolved transcriptomic profiles with matched histology across 18 organs. Using a hierarchical architecture integrating morphological features, gene expression, and spatial context, STORM bridges imaging and omics through robust molecular–morphological representations. STORM enhances spatial domain discovery, producing biologically coherent tissue maps, and outperforms existing methods in predicting spatial gene expression from H\E images across 11 tumor types. The model is platform-agnostic, performing consistently across Visium, Xenium, Visium HD, and CosMx. Applied to 23 independent cohorts comprising 7,245 patients, STORM significantly improves immunotherapy response prediction and prognostication over established biomarkers, providing a scalable framework for spatially informed discovery and clinical precision medicine.
[AI-108] oward Executable Repository-Level Code Generation via Environment Alignment
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在仓库级代码生成任务中面临的执行验证挑战,即如何确保生成的多文件代码仓库能够在真实执行环境中成功安装、解析依赖关系与内部引用、顺利运行并被正确验证。现有方法往往仅评估单个代码片段的合理性,而忽视了整个仓库的可执行性。解决方案的关键在于提出EnvGraph框架,将仓库可执行性建模为环境对齐问题,通过双层环境表示同时捕捉外部依赖满足和仓库内引用解析两个耦合条件,并利用执行证据进行归因分析,结合统一的目标修订机制,在迭代对齐循环中引导生成过程,从而显著提升仓库级代码生成的函数正确性和非功能质量。
链接: https://arxiv.org/abs/2604.03622
作者: Ruwei Pan,Junlei Shen,Linhao Wu,Yueheng Zhu,Zixiong Yang,Yakun Zhang,Lu Zhang,Hongyu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved strong performance on code generation, but existing methods still struggle with repository-level code generation under executable validation. Under this evaluation setting, success is determined not by the plausibility of isolated code fragments, but by whether a generated multi-file repository can be successfully installed, have its dependencies and internal references resolved, be launched, and be validated in a real execution environment. To address this challenge, we propose EnvGraph, a framework for repository-level code generation that formulates repository executability as an environment alignment problem. EnvGraph jointly models two coupled conditions for successful repository execution, namely external dependency satisfaction and repository-internal reference resolution. It maintains a dual-layer environment representation, uses execution evidence to perform execution-evidence-based attribution, and guides repository generation through a unified targeted revision mechanism within an iterative alignment loop. We evaluate EnvGraph on repository-level code generation with three representative backbone LLMs and compare it against representative environment-aware and repository-level baselines. Experimental results show that EnvGraph consistently achieves the best performance on these repository-level benchmarks. In particular, it outperforms the strongest non-EnvGraph baseline by an absolute margin of 5.72–5.87 percentage points in Functional Correctness and 4.58–8.66 percentage points in Non-Functional Quality.
[AI-109] Neural Global Optimization via Iterative Refinement from Noisy Samples
【速读】:该论文旨在解决从噪声样本中对黑箱函数进行全局优化(global optimization)的难题,尤其针对传统贝叶斯优化易陷入局部极小值、而无梯度方法则需大量函数评估的问题。其解决方案的关键在于提出一种新型神经网络方法,通过迭代精化机制从初始猜测逐步逼近真实全局最小值;该模型以噪声函数样本及其拟合样条(spline)表示为输入,结合函数值、导数和样条系数等多模态特征编码,并引入迭代位置更新机制,在无需梯度信息或多次重启的情况下实现鲁棒的全局优化,训练数据来源于通过穷举搜索获得真值全局最小值的随机生成函数。
链接: https://arxiv.org/abs/2604.03614
作者: Qusay Muzaffar,David Levin,Michael Werman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures, 2 tables
Abstract:Global optimization of black-box functions from noisy samples is a fundamental challenge in machine learning and scientific computing. Traditional methods such as Bayesian Optimization often converge to local minima on multi-modal functions, while gradient-free methods require many function evaluations. We present a novel neural approach that learns to find global minima through iterative refinement. Our model takes noisy function samples and their fitted spline representation as input, then iteratively refines an initial guess toward the true global minimum. Trained on randomly generated functions with ground truth global minima obtained via exhaustive search, our method achieves a mean error of 8.05 percent on challenging multi-modal test functions, compared to 36.24 percent for the spline initialization, a 28.18 percent improvement. The model successfully finds global minima in 72 percent of test cases with error below 10 percent, demonstrating learned optimization principles rather than mere curve fitting. Our architecture combines encoding of multiple modalities including function values, derivatives, and spline coefficients with iterative position updates, enabling robust global optimization without requiring derivative information or multiple restarts.
[AI-110] Multi-Robot Multi-Queue Control via Exhaustive Assignment Actor-Critic Learning
【速读】:该论文旨在解决多机器人、多队列系统中的在线任务分配问题,特别针对异构随机到达和切换延迟场景。其核心挑战在于如何在存在位置间移动延迟(即切换需占用一个时隙)和不同地点任务到达率差异的情况下,实现高效的实时调度。解决方案的关键在于提出一种结构感知的“穷尽式分配”演员-评论家(exhaustive-assignment actor-critic)策略架构:该架构通过构造性地强制执行穷尽服务(exhaustive service)机制,仅学习空闲机器人下一队列的选择策略,从而有效适应到达率的非对称性;相比传统穷尽服务最长队列(Exhaustive-serve-longest, ESL)规则(其最优性仅在对称条件下成立),新方法在多种服务器-位置比例、负载及非对称到达配置下均表现出更低的折扣持有成本和更短的平均队列长度,同时在可获得最优基准的实例中仍保持近最优性能。
链接: https://arxiv.org/abs/2604.03605
作者: Mohammad Merati,H. M. Sabbir Ahmad,Wenchao Li,David Castañón
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:We study online task allocation for multi-robot, multi-queue systems with asymmetric stochastic arrivals and switching delays. We formulate the problem in discrete time: each location can host at most one robot per slot, servicing a task consumes one slot, switching between locations incurs a one-slot travel delay, and arrivals at locations are independent Bernoulli processes with heterogeneous rates. Building on our previous structural result that optimal policies are of exhaustive type, we formulate a discounted-cost Markov decision process and develop an exhaustive-assignment actor-critic policy architecture that enforces exhaustive service by construction and learns only the next-queue allocation for idle robots. Unlike the exhaustive-serve-longest (ESL) queue rule, whose optimality is known only under symmetry, the proposed policy adapts to asymmetry in arrival rates. Across different server-location ratios, loads, and asymmetric arrival profiles, the proposed policy consistently achieves lower discounted holding cost and smaller mean queue length than the ESL baseline, while remaining near-optimal on instances where an optimal benchmark is available. These results show that structure-aware actor-critic methods provide an effective approach for real-time multi-robot scheduling.
[AI-111] Entropy and Attention Dynamics in Small Language Models: A Trace-Level Structural Analysis on the TruthfulQA Benchmark
【速读】:该论文旨在解决小型语言模型(Small Language Models, SLMs)在边缘设备等资源受限场景中因输出不稳定、自信地产生错误预测(即幻觉)而导致的可靠性问题。现有评估方法仅关注最终准确率或幻觉率,忽视了模型内部行为如何影响输出质量,如解码过程中熵的变化、注意力分布的演化以及隐藏表示对不确定性与信息误导传播的作用。解决方案的关键在于引入细粒度的追踪级分析(trace-level analysis),通过token级输出熵、注意力熵、头分散度和隐藏状态表征,揭示SLMs内部动态机制,并据此将模型分为三类:确定性模型(如DeepSeek-1.5B和LLaMA-1B)、探索性模型(如Gemma-1B)和平衡型模型(如Qwen-1.7B),每类具有独特的熵模式与注意力分布特征。研究发现,真实性(truthfulness)源于结构化的熵与注意力动态,因此监控并优化这些内部不确定性模式可指导更可靠、抗幻觉且面向特定应用场景的边缘SLMs设计。
链接: https://arxiv.org/abs/2604.03589
作者: Adeyemi Adeseye,Aisvarya Adeseye,Hannu Tenhunen,Jouni Isoaho
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Publish it in 12th Intelligent Systems Conference 2026, 3-4 September 2026 in Amsterdam, The Netherlands
Abstract:Small language models (SLMs) have been increasingly deployed in edge devices and other resource-constrained settings. However, these models make confident mispredictions and produce unstable output, making them risky for factual and decision-critical tasks. Current evaluation methodology relies on final accuracy or hallucination rates without explaining how internal model behavior affects outputs. Specifically, how entropy evolves during decoding, how attention is distributed across layers, and how hidden representations contribute to uncertainty, logical inconsistencies, and misinformation propagation are often overlooked. Consequently, this study introduces a trace-level analysis of entropy and attention dynamics in SLMs evaluated with the TruthfulQA dataset. Four models with parameter ranges of 1B-1.7B parameters were examined via token-level output entropy, attention entropy, head dispersion, and hidden-state representation. The results reflect three model classifications by entropy patterns. Deterministic models (DeepSeek-1.5B and LLaMA-1B): output entropy decreases over time. Exploratory models (Gemma-1B): with increasing entropy, and balanced models (Qwen-1.7B): have moderate and stable entropy. Also, each group has distinctively different hidden-state movement and attention dispersion patterns. The analysis demonstrates that truthfulness in SLMs emerges from structured entropy and attention dynamics. Monitoring and optimizing these internal uncertainty patterns can guide the design of a more reliable, hallucination-aware, and application-specific edge SLMs.
[AI-112] Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory AAMAS AAMAS2026
【速读】:该论文旨在解决多目标场景下AI代理在长时间运行中对同一事件存在多重、冲突解释的问题,传统记忆架构通常假设单一正确编码或仅支持统一存储下的多视角,难以有效处理这种语义分歧。解决方案的关键在于提出Rashomon Memory架构:通过并行的目标条件代理对经验进行各自优先级驱动的编码,并在查询时通过论证机制协商;每个视角维护独立本体(ontology)和知识图谱,在检索阶段提出解释、利用非对称领域知识相互批判,最终基于Dung的论证语义确定存活提案,形成攻击图(attack graph),该图本身即为可解释的决策过程记录,揭示了被采纳解释、备选方案及其被拒绝的理由。
链接: https://arxiv.org/abs/2604.03588
作者: Albert Sadowski,Jarosław A. Chudziak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the EXTRAAMAS workshop at AAMAS 2026
Abstract:AI agents operating over extended time horizons accumulate experiences that serve multiple concurrent goals, and must often maintain conflicting interpretations of the same events. A concession during a client negotiation encodes as a trust-building investment'' for one strategic goal and a contractual liability’’ for another. Current memory architectures assume a single correct encoding, or at best support multiple views over unified storage. We propose Rashomon Memory: an architecture where parallel goal-conditioned agents encode experiences according to their priorities and negotiate at query time through argumentation. Each perspective maintains its own ontology and knowledge graph. At retrieval, perspectives propose interpretations, critique each other’s proposals using asymmetric domain knowledge, and Dung’s argumentation semantics determines which proposals survive. The resulting attack graph is itself an explanation: it records which interpretation was selected, which alternatives were considered, and on what grounds they were rejected. We present a proof-of-concept showing that retrieval modes (selection, composition, conflict surfacing) emerge from attack graph topology, and that the conflict surfacing mode, where the system reports genuine disagreement rather than forcing resolution, lets decision-makers see the underlying interpretive conflict directly.
[AI-113] SecPI: Secure Code Generation with Reasoning Models via Security Reasoning Internalization
【速读】:该论文旨在解决生成式 AI (Generative AI) 在编程场景中因缺乏系统性安全推理能力而导致的严重安全漏洞问题,即即使最先进的推理语言模型(Reasoning Language Models, RLMs)在生成代码时仍频繁引入关键安全缺陷。传统基于训练数据的方法受限于昂贵且人工标注的安全数据集覆盖范围有限,而推理阶段的通用安全提示则会损害功能正确性并仅触发浅层漏洞分析。其解决方案的关键在于提出 SecPI——一个细调(fine-tuning)流水线,通过筛选通用编码数据集中的安全相关任务、利用教师模型生成结构化安全推理轨迹(structured security reasoning traces),并以无安全指令输入与对应推理轨迹对为目标进行微调,使 RLM 能够自主内化安全推理机制,从而在无需任何显式安全指令的情况下默认生成安全代码。实验证明,该方法显著提升了代码的功能正确性和安全性,并展现出跨 CWE 类型和跨语言的良好泛化能力。
链接: https://arxiv.org/abs/2604.03587
作者: Hao Wang,Niels Mündler,Mark Vero,Jingxuan He,Dawn Song,Martin Vechev
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning language models (RLMs) are increasingly used in programming. Yet, even state-of-the-art RLMs frequently introduce critical security vulnerabilities in generated code. Prior training-based approaches for secure code generation face a critical limitation that prevents their direct application to RLMs: they rely on costly, manually curated security datasets covering only a limited set of vulnerabilities. At the inference level, generic security reminders consistently degrade functional correctness while triggering only shallow ad-hoc vulnerability analysis. To address these problems, we present SecPI, a fine-tuning pipeline that teaches RLMs to internalize structured security reasoning, producing secure code by default without any security instructions at inference time. SecPI filters existing general-purpose coding datasets for security-relevant tasks using an LLM-based classifier, generates high-quality security reasoning traces with a teacher model guided by a structured prompt that systematically enumerates relevant CWEs and mitigations, and fine-tunes the target model on pairs of inputs with no security prompt and teacher reasoning traces – as a result, the model learns to reason about security autonomously rather than in response to explicit instructions. An extensive evaluation on security benchmarks with state-of-the-art open-weight reasoning models validates the effectiveness of our approach. For instance, SecPI improves the percentage of functionally correct and secure generations for QwQ 32B from 48.2% to 62.2% (+14.0 points) on CWEval and from 18.2% to 22.0% on BaxBench. Further investigation also reveals strong cross-CWE and cross-language generalization beyond training vulnerabilities. Even when trained only on injection-related CWEs, QwQ 32B generates correct and secure code 9.9% more frequently on held-out memory-safety CWEs.
[AI-114] Selective Forgetting for Large Reasoning Models
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在生成结构化思维链(Chain of Thought, CoT)过程中可能泄露敏感信息的问题,尤其是训练数据中包含的版权或隐私内容被记忆并暴露于中间推理步骤的风险。现有遗忘方法主要针对最终答案进行处理,易损害模型的整体推理能力;直接对整个CoT执行遗忘操作则可能削弱通用推理性能。解决方案的关键在于实现对特定知识的精准遗忘,同时保留模型的通用推理能力。为此,作者提出一种新颖的LRM遗忘框架,利用多个具备检索增强生成(Retrieval-Augmented Generation, RAG)能力的大语言模型(LLMs)分析CoT痕迹,识别与遗忘相关的片段,并用保持逻辑结构的良性占位符替换;此外,引入一种特征替换遗忘损失函数,在抑制遗忘内容生成概率的同时强化结构合理的替代输出,从而实现选择性遗忘与推理完整性之间的平衡。
链接: https://arxiv.org/abs/2604.03571
作者: Tuan Le,Wei Qian,Mengdi Huai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) generate structured chains of thought (CoTs) before producing final answers, making them especially vulnerable to knowledge leakage through intermediate reasoning steps. Yet, the memorization of sensitive information in the training data such as copyrighted and private content has led to ethical and legal concerns. To address these issues, selective forgetting (also known as machine unlearning) has emerged as a potential remedy for LRMs. However, existing unlearning methods primarily target final answers and may degrade the overall reasoning ability of LRMs after forgetting. Additionally, directly applying unlearning on the entire CoTs could degrade the general reasoning capabilities. The key challenge for LRM unlearning lies in achieving precise unlearning of targeted knowledge while preserving the integrity of general reasoning capabilities. To bridge this gap, we in this paper propose a novel LRM unlearning framework that selectively removes sensitive reasoning components while preserving general reasoning capabilities. Our approach leverages multiple LLMs with retrieval-augmented generation (RAG) to analyze CoT traces, identify forget-relevant segments, and replace them with benign placeholders that maintain logical structure. We also introduce a new feature replacement unlearning loss for LRMs, which can simultaneously suppress the probability of generating forgotten content while reinforcing structurally valid replacements. Extensive experiments on both synthetic and medical datasets verify the desired properties of our proposed method.
[AI-115] Personality Requires Struggle: Three Regimes of the Baldwin Effect in Neuroevolved Chess Agents
【速读】:该论文旨在解决“终身学习(lifetime learning)是否能在进化时间尺度上扩展行为多样性,而非导致其压缩”这一核心问题。传统理论认为可塑性通过缓冲环境噪声降低变异度,但本文通过在棋类智能体中引入NEAT进化神经模块、游戏内Hebbian可塑性和基于想象的欲望域信号链,发现可塑性对行为方差的影响随进化时间呈现反转趋势:初期(<34代)压缩多样性,后期则显著扩展——源于想象力驱动的感知差异放大反馈环,这是突变机制无法维持的结构化分化。解决方案的关键在于构建具有内在认知架构(cognitive architecture)的神经进化系统,使行为多样性从随机采样跃迁为可重复、可解释的个体特征(ICC > 0.8),并揭示三种不同对手类型下的行为演化 regimes(探索型、彩票型、透明型),其中透明型预测自对弈系统可能因消除个性所需异质性而系统性抑制行为多样性。
链接: https://arxiv.org/abs/2604.03565
作者: Diego Armando Resendez Prado
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 18 pages, 4 figures, 4 tables
Abstract:Can lifetime learning expand behavioral diversity over evolutionary time, rather than collapsing it? Prior theory predicts that plasticity reduces variance by buffering organisms against environmental noise. We test this in a competitive domain: chess agents with eight NEAT-evolved neural modules, Hebbian within-game plasticity, and a desirability-domain signal chain with imagination. Across 10~seeds per Hebbian condition, a variance crossover emerges: Hebbian ON starts with lower cross-seed variance than OFF, then surpasses it at generation~34. The crossover trend is monotonic (\rho = 0.91, p 10^-6): plasticity’s effect on behavioral variance reverses over evolutionary time, initially compressing diversity (consistent with prior predictions) then expanding it as evolved Perception differences are amplified through imagination – a feedback loop that mutation alone cannot sustain. The result is structured behavioral divergence: evolved agents select different moves on the same positions (62% disagreement), develop distinct opening repertoires, piece preferences, and game lengths. These are not different sampling policies – they are reproducible behavioral signatures (ICC 0.8) with interpretable signal chain configurations. Three regimes appear depending on opponent type: exploration (Hebbian ON, heterogeneous opponent), lottery (Hebbian OFF, elitism lock-in), and transparent (same-model opponent, brain self-erasure). The transparent regime generates a falsifiable prediction: self-play systems may systematically suppress behavioral diversity by eliminating the heterogeneity that personality requires. \textbfKeywords: Baldwin Effect, neuroevolution, NEAT, Hebbian learning, chess, cognitive architecture, personality emergence, imagination Comments: 18 pages, 4 figures, 4 tables Subjects: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2604.03565 [cs.AI] (or arXiv:2604.03565v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.03565 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-116] When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM -Guided LEO Satellite Scheduling
【速读】:该论文旨在解决低轨(LEO)多波束卫星调度中深度强化学习(DRL)的奖励设计问题,核心挑战在于如何通过自适应奖励权重提升调度性能。研究表明,静态奖励权重(342.1 Mbps)优于动态调整的权重(103.3±96.8 Mbps),原因在于近似平稳的奖励信号对策略梯度算法(如PPO)的价值函数收敛至关重要;任何奖励权重的适应性调整都会因反复重启收敛过程而损害性能。解决方案的关键在于引入单变量因果探测方法(single-variable causal probing),独立扰动每个奖励项±20%,并在50k步后测量PPO响应,从而发现反直觉的杠杆效应:切换惩罚项增加20%可显著提升极地切换和冷热区切换场景下的吞吐量(分别+157 Mbps和+130 Mbps)。此外,实验对比了四种马尔可夫决策过程(MDP)架构,在已知与新流量场景下验证了基于学习的多层感知机(MLP)优于微调的大语言模型(LLM),后者因权重振荡导致性能崩溃(45.3±43.0 Mbps),说明输出一致性而非领域知识才是限制因素。这一发现为LLM与DRL在通信系统中的集成提供了实证指导,明确了LLM不可替代的应用场景(如自然语言意图理解)与简单方法即可胜任的任务边界。
链接: https://arxiv.org/abs/2604.03562
作者: Yuanhang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures
Abstract:Adaptive reward design for deep reinforcement learning (DRL) in multi-beam LEO satellite scheduling is motivated by the intuition that regime-aware reward weights should outperform static ones. We systematically test this intuition and uncover a switching-stability dilemma: near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) because PPO requires a quasistationary reward signal for value function convergence. Weight adaptation-regardless of quality-degrades performance by repeatedly restarting convergence. To understand why specific weights matter, we introduce a single-variable causal probing method that independently perturbs each reward term by +/-20% and measures PPO response after 50k steps. Probing reveals counterintuitive leverage: a +20% increase in the switching penalty yields +157 Mbps for polar handover and +130 Mbps for hot-cold regimes-findings inaccessible to human experts or trained MLPs without systematic probing. We evaluate four MDP architect variants (fixed, rule-based, learned MLP, finetuned LLM) across known and novel traffic regimes. The MLP achieves 357.9 Mbps on known regimes and 325.2 Mbps on novel regimes, while the fine-tuned LLM collapses to 45.3+/-43.0 Mbps due to weight oscillation rather than lack of domain knowledge-output consistency, not knowledge, is the binding constraint. Our findings provide an empirically-grounded roadmap for LLM-DRL integration in communication systems, identifying where LLMs add irreplaceable value (natural language intent understanding) versus where simpler methods suffice.
[AI-117] When Do Hallucinations Arise? A Graph Perspective on the Evolution of Path Reuse and Path Compression
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中常见的推理幻觉(reasoning hallucinations)问题,即模型生成看似流畅但缺乏上下文支持或违背事实的结论。其核心解决方案是将下一个词预测建模为在潜在图结构上的路径搜索过程:实体作为节点,学习到的转移关系构成边。在此框架下,基于上下文的推理被视为对采样子图的受限搜索(内在推理),而无上下文查询则依赖于图中记忆的结构(外在推理)。作者指出,推理幻觉主要由两个机制驱动:路径复用(Path Reuse),即早期训练中记忆知识覆盖上下文约束;以及路径压缩(Path Compression),即后期训练中高频多步路径坍缩为捷径边。这两个机制共同提供了一个统一的解释框架,阐明了LLM推理幻觉的成因,并与下游应用中的已知行为相联系。
链接: https://arxiv.org/abs/2604.03557
作者: Xinnan Dai,Kai Yang,Cheng Luo,Shenglai Zeng,Kai Guo,Jiliang Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning hallucinations in large language models (LLMs) often appear as fluent yet unsupported conclusions that violate either the given context or underlying factual knowledge. Although such failures are widely observed, the mechanisms by which decoder-only Transformers produce them remain poorly understood. We model next-token prediction as a graph search process over an underlying graph, where entities correspond to nodes and learned transitions form edges. From this perspective, contextual reasoning is a constrained search over a sampled subgraph (intrinsic reasoning), while context-free queries rely on memorized structures in the underlying graph (extrinsic reasoning). We show that reasoning hallucinations arise from two fundamental mechanisms: \textbfPath Reuse, where memorized knowledge overrides contextual constraints during early training, and \textbfPath Compression, where frequently traversed multi-step paths collapse into shortcut edges in later training. Together, these mechanisms provide a unified explanation for reasoning hallucinations in LLMs and connected to well-known behaviors observed in downstream applications.
[AI-118] Automated Analysis of Global AI Safety Initiatives: A Taxonomy-Driven LLM Approach
【速读】:该论文旨在解决如何自动化比较不同人工智能(AI)安全政策文档之间内容一致性与差异性的问题,以支持政策制定者和研究人员进行跨文档的系统性对比分析。解决方案的关键在于构建一个基于共享活动分类体系(Activity Map on AI Safety)的自动化交叉映射(crosswalk)框架:首先将文档中的相关活动提取并映射到固定活动类别,再为每个类别生成摘要、简要对比及相似度评分;该框架利用大语言模型(LLM)实现自动化处理,并通过多模型评估与人工验证相结合的方式,确保结果的稳定性和有效性。
链接: https://arxiv.org/abs/2604.03533
作者: Takayuki Semitsu,Naoto Kiribuchi,Kengo Zenitani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures, 6 tables, to be published in PoliticalNLP 2026
Abstract:We present an automated crosswalk framework that compares an AI safety policy document pair under a shared taxonomy of activities. Using the activity categories defined in Activity Map on AI Safety as fixed aspects, the system extracts and maps relevant activities, then produces for each aspect a short summary for each document, a brief comparison, and a similarity score. We assess the stability and validity of LLM-based crosswalk analysis across public policy documents. Using five large language models, we perform crosswalks on ten publicly available documents and visualize mean similarity scores with a heatmap. The results show that model choice substantially affects the crosswalk outcomes, and that some document pairs yield high disagreements across models. A human evaluation by three experts on two document pairs shows high inter-annotator agreement, while model scores still differ from human judgments. These findings support comparative inspection of policy documents.
[AI-119] Structural Rigidity and the 57-Token Predictive Window: A Physical Framework for Inference-Layer Governability in Large Language Models
【速读】:该论文旨在解决当前人工智能(AI)安全评估中缺乏可测量的预承诺信号(pre-commitment signal)的问题,即现有基于行为监测和训练后对齐的方法无法有效检测模型在生成过程中是否即将违反规则或产生幻觉。其解决方案的关键在于提出一种基于能量的治理框架(energy-based governance framework),将Transformer推理动态与神经计算中的约束满足模型相连接,并引入轨迹张力(trajectory tension, ρ = ||a|| / ||v||)作为量化指标。通过该框架,在Phi-3-mini-4k-instruct模型上识别出一个57-token的预承诺窗口,证明了预承诺信号的存在性具有模型、任务和配置特异性;同时构建了五类推理行为分类法(Authority Band、Late Signal、Inverted、Flat、Scaffold-Selective),并以能量不对称性(Σρ_misaligned / Σρ_aligned)作为统一度量结构刚性,从而为大语言模型(LLM)推理层的可治理性提供可测量的物理基础。
链接: https://arxiv.org/abs/2604.03524
作者: Gregory M. Ruddell
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Extends arXiv:2603.21415 . 30 pages. Also available on Zenodo ( https://doi.org/10.5281/zenodo.19393882 )
Abstract:Current AI safety relies on behavioral monitoring and post-training alignment, yet empirical measurement shows these approaches produce no detectable pre-commitment signal in a majority of instruction-tuned models tested. We present an energy-based governance framework connecting transformer inference dynamics to constraint-satisfaction models of neural computation, and apply it to a seven-model cohort across five geometric regimes. Using trajectory tension (rho = ||a|| / ||v||), we identify a 57-token pre-commitment window in Phi-3-mini-4k-instruct under greedy decoding on arithmetic constraint probes. This result is model-specific, task-specific, and configuration-specific, demonstrating that pre-commitment signals can exist but are not universal. We introduce a five-regime taxonomy of inference behavior: Authority Band, Late Signal, Inverted, Flat, and Scaffold-Selective. Energy asymmetry (\Sigma\rho_misaligned / \Sigma\rho_aligned) serves as a unifying metric of structural rigidity across these regimes. Across seven models, only one configuration exhibits a predictive signal prior to commitment; all others show silent failure, late detection, inverted dynamics, or flat geometry. We further demonstrate that factual hallucination produces no predictive signal across 72 test conditions, consistent with spurious attractor settling in the absence of a trained world-model constraint. These results establish that rule violation and hallucination are distinct failure modes with different detection requirements. Internal geometry monitoring is effective only where resistance exists; detection of factual confabulation requires external verification mechanisms. This work provides a measurable framework for inference-layer governability and introduces a taxonomy for evaluating deployment risk in autonomous AI systems. Comments: Extends arXiv:2603.21415. 30 pages. Also available on Zenodo (https://doi.org/10.5281/zenodo.19393882) Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.03524 [cs.AI] (or arXiv:2604.03524v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.03524 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.19393882 Focus to learn more DOI(s) linking to related resources Submission history From: Gregory Ruddell [view email] [v1] Sat, 4 Apr 2026 00:08:17 UTC (514 KB) Full-text links: Access Paper: View a PDF of the paper titled Structural Rigidity and the 57-Token Predictive Window: A Physical Framework for Inference-Layer Governability in Large Language Models, by Gregory M. RuddellView PDF view license Current browse context: cs.AI prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-120] Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures
【速读】:该论文旨在解决当前对基于大语言模型(Large Language Model, LLM)的代码生成代理(coding agent)中“支撑代码”(scaffold code)架构理解不足的问题,即控制循环、工具定义、状态管理和上下文策略等底层结构缺乏系统性分类与分析。现有研究多以抽象能力(如工具使用、规划、反思)进行分类,无法区分架构差异;而轨迹研究虽能观察行为却未揭示其背后的架构成因。论文的关键解决方案是提出一个基于源码层级的架构分类法(architectural taxonomy),通过对13个开源编码代理在固定提交版本下的代码分析,从三个层次(控制架构、工具与环境接口、资源管理)共12个维度进行刻画,识别出五种可组合的循环原语(loop primitives),并发现多数代理采用多原语组合而非单一控制结构,从而为研究人员提供可复用的代码级参考,也为实践者设计新型代理架构提供结构化依据。
链接: https://arxiv.org/abs/2604.03515
作者: Benjamin Rombaut
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:LLM-based coding agents can localize bugs, generate patches, and run tests with diminishing human oversight, yet the scaffolding code that surrounds the language model (the control loop, tool definitions, state management, and context strategy) remains poorly understood. Existing surveys classify agents by abstract capabilities (tool use, planning, reflection) that cannot distinguish between architecturally distinct systems, and trajectory studies observe what agents do without examining the scaffold code that determines why. This paper presents a source-code-level architectural taxonomy derived from analysis of 13 open-source coding agent scaffolds at pinned commit hashes. Each agent is characterized across 12 dimensions organized into three layers: control architecture, tool and environment interface, and resource management. The analysis reveals that scaffold architectures resist discrete classification: control strategies range from fixed pipelines to Monte Carlo Tree Search, tool counts range from 0 to 37, and context compaction spans seven distinct strategies. Five loop primitives (ReAct, generate-test-repair, plan-execute, multi-attempt retry, tree search) function as composable building blocks that agents layer in different combinations; 11 of 13 agents compose multiple primitives rather than relying on a single control structure. Dimensions converge where external constraints dominate (tool capability categories, edit formats, execution isolation) and diverge where open design questions remain (context compaction, state management, multi-model routing). All taxonomic claims are grounded in file paths and line numbers, providing a reusable reference for researchers studying agent behavior and practitioners designing new scaffolds.
[AI-121] ActionNex: A Virtual Outage Manager for Cloud
【速读】:该论文旨在解决大规模云运维中故障管理(Outage Management)高度依赖人工的问题,尤其是在部分可观测性条件下,需快速诊断、跨团队协作及基于经验的决策。解决方案的关键在于提出一个生产级智能代理系统 ActionNex,其核心由三部分构成:(1)多模态操作信号(如告警内容、遥测数据和人类沟通)被压缩为关键事件(Critical Events),以表征有意义的状态变迁;(2)分层记忆子系统——长期 Key-Condition-Action (KCA) 知识库、历史故障的事件记忆和当前上下文的工作记忆;(3)推理代理通过匹配当前关键事件与预设条件、检索相关记忆并生成下一步最优行动建议,同时利用人工执行动作作为隐式反馈信号,实现人机协同的持续自我进化。
链接: https://arxiv.org/abs/2604.03512
作者: Zhenfeng Lin,Haoji Hu,Ming Hao,Xuchao Zhang,Ryan Zhang,Junhao Li,Ze Li,Oleg Kulygin,Chetan Bansal,Hatay Tuna,Murali Chintalapati,Sheila Jiang,Salman Zafar,Angie Anderson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Outage management in large-scale cloud operations remains heavily manual, requiring rapid triage, cross-team coordination, and experience-driven decisions under partial observability. We present \textbfActionNex, a production-grade agentic system that supports end-to-end outage assistance, including real-time updates, knowledge distillation, and role- and stage-conditioned next-best action recommendations. ActionNex ingests multimodal operational signals (e.g., outage content, telemetry, and human communications) and compresses them into critical events that represent meaningful state transitions. It couples this perception layer with a hierarchical memory subsystem: long-term Key-Condition-Action (KCA) knowledge distilled from playbooks and historical executions, episodic memory of prior outages, and working memory of the live context. A reasoning agent aligns current critical events to preconditions, retrieves relevant memories, and generates actionable recommendations; executed human actions serve as an implicit feedback signal to enable continual self-evolution in a human-agent hybrid system. We evaluate ActionNex on eight real Azure outages (8M tokens, 4,000 critical events) using two complementary ground-truth action sets, achieving 71.4% precision and 52.8-54.8% recall. The system has been piloted in production and has received positive early feedback.
[AI-122] BioAlchemy: Distilling Biological Literature into Reasoning -Ready Reinforcement Learning Training Data
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在生物学研究任务中推理能力不足的问题,其根源在于现有大规模推理数据集中的生物问题与现代生物学研究主题分布不匹配,且缺乏可验证的挑战性科研问题提取方法。解决方案的关键在于构建一个名为 BioAlchemy 的管道,用于从生物学科研文本中提取多样且可验证的问答对,并基于此构建包含超过 345K 条科学推理题目的 BioAlchemy-345K 数据集;进一步通过将该数据集与现代生物学主题分布对齐,并结合强化学习(Reinforcement Learning, RL)优化模型,最终实现了 BioAlchemist-8B 模型在生物基准测试上的性能提升(较基础模型提高 9.12%),验证了该方法在增强生物学科学推理能力方面的有效性。
链接: https://arxiv.org/abs/2604.03506
作者: Brian Hsu,Ozan Gökdemir,Carlo Siebenschuh,Bruce Parrello,Neil Getty,Thomas S. Brettin,Rick L. Stevens,Ian T. Foster,Nicholas Chia,Arvind Ramanathan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the large corpus of biology training text, the impact of reasoning models on biological research generally lags behind math and coding. In this work, we show that biology questions from current large-scale reasoning datasets do not align well with modern research topic distributions in biology, and that this topic imbalance may negatively affect performance. In addition, we find that methods for extracting challenging and verifiable research problems from biology research text are a critical yet underdeveloped ingredient in applying reinforcement learning for better performance on biology research tasks. We introduce BioAlchemy, a pipeline for sourcing a diverse set of verifiable question-and-answer pairs from a scientific corpus of biology research text. We curate BioAlchemy-345K, a training dataset containing over 345K scientific reasoning problems in biology. Then, we demonstrate how aligning our dataset to the topic distribution of modern scientific biology can be used with reinforcement learning to improve reasoning performance. Finally, we present BioAlchemist-8B, which improves over its base reasoning model by 9.12% on biology benchmarks. These results demonstrate the efficacy of our approach for developing stronger scientific reasoning capabilities in biology. The BioAlchemist-8B model is available at: this https URL.
[AI-123] Resource-Conscious Modeling for Next- Day Discharge Prediction Using Clinical Notes
【速读】:该论文旨在解决择期脊柱手术病房中患者次日出院预测的准确性问题,以优化床位周转率和医疗资源配置。其核心解决方案在于比较轻量级微调的大语言模型(LLMs)与传统文本建模方法在临床笔记数据上的表现,关键发现是基于TF-IDF特征结合梯度提升树算法(如LGBM)的传统模型在不平衡数据场景下展现出更优的性能,其F1分数达0.47、AUC-ROC为0.80,优于多数Transformer架构的生成式AI模型,表明可解释性强且计算资源消耗低的模型更适合实际临床环境中的预测任务。
链接: https://arxiv.org/abs/2604.03498
作者: Ha Na Cho,Sairam Sutari,Alexander Lopez,Hansen Bow,Kai Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Timely discharge prediction is essential for optimizing bed turnover and resource allocation in elective spine surgery units. This study evaluates the feasibility of lightweight, fine-tuned large language models (LLMs) and traditional text-based models for predicting next-day discharge using postoperative clinical notes. We compared 13 models, including TF-IDF with XGBoost and LGBM, and compact LLMs (DistilGPT-2, Bio_ClinicalBERT) fine-tuned via LoRA. TF-IDF with LGBM achieved the best balance, with an F1-score of 0.47 for the discharge class, a recall of 0.51, and the highest AUC-ROC (0.80). While LoRA improved recall in DistilGPT2, overall transformer-based and generative models underperformed. These findings suggest interpretable, resource-efficient models may outperform compact LLMs in real-world, imbalanced clinical prediction tasks.
[AI-124] Contextual Control without Memory Growth in a Context-Switching Task
【速读】:该论文旨在解决序列决策任务中如何有效实现上下文依赖性的问题,尤其是在部分可观测环境下,传统方法通常依赖显式输入上下文信息或增加循环记忆维度来编码上下文。其解决方案的关键在于:通过干预一个共享的循环潜在状态(shared recurrent latent state)来实现上下文控制,而非直接提供上下文输入或扩展循环状态维度。具体而言,模型首先构建一个预干预的共享潜在状态,随后由一个加性、索引化的上下文操作符作用于该状态,从而在不引入额外循环维度的情况下实现对不同上下文的有效区分与利用。实验表明,该方法在基准任务上性能优异,并且通过条件互信息(I(C;O | S))分析验证了其在固定潜在状态下仍能保留有效的上下文依赖关系。
链接: https://arxiv.org/abs/2604.03479
作者: Song-Ju Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 3 figures
Abstract:Context-dependent sequential decision making is commonly addressed either by providing context explicitly as an input or by increasing recurrent memory so that contextual information can be represented internally. We study a third alternative: realizing contextual dependence by intervening on a shared recurrent latent state, without enlarging recurrent dimensionality. To this end, we introduce an intervention-based recurrent architecture in which a recurrent core first constructs a shared pre-intervention latent state, and context then acts through an additive, context-indexed operator. We evaluate this idea on a context-switching sequential decision task under partial observability. We compare three model families: a label-assisted baseline with direct context access, a memory baseline with enlarged recurrent state, and the proposed intervention model, which uses no direct context input to the recurrent core and no memory growth. On the main benchmark, the intervention model performs strongly without additional recurrent dimensions. We also evaluate the models using the conditional mutual information (I(C;O | S)) as a theorem-motivated operational probe of contextual dependence at fixed latent state. For task-relevant phase-1 outcomes, the intervention model exhibits positive conditional contextual information. Together, these results suggest that intervention on a shared recurrent state provides a viable alternative to recurrent memory growth for contextual control in this setting.
[AI-125] Measuring LLM Trust Allocation Across Conflicting Software Artifacts
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在软件工程辅助工具中因错误信任不一致或不可靠的代码、文档和测试等多源 artifact 而导致的可靠性问题,尤其关注模型是否能识别证据退化、定位不可信来源并合理校准对不同 artifact 的信任度。其解决方案的关键在于提出 TRACE(Trust Reasoning over Artifacts for Calibrated Evaluation)框架,通过在盲扰动下系统性地采集 Javadoc、方法签名、实现代码和测试前缀等 artifact 的结构化信任轨迹(trust traces),从而量化评估模型在 artifact 级别上的质量判断能力、不一致性检测性能、受影响 artifact 归因准确率以及源优先级排序表现,揭示当前 LLMs 在自然语言规范审计与代码层面细微漂移检测之间的不对称敏感性,并指出多数模型存在置信度校准不足的问题。
链接: https://arxiv.org/abs/2604.03447
作者: Noshin Ulfat,Ahsanul Ameen Sabit,Soneya Binta Hossain
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based software engineering assistants fail not only by producing incorrect outputs, but also by allocating trust to the wrong artifact when code, documentation, and tests disagree. Existing evaluations focus mainly on downstream outcomes and therefore cannot reveal whether a model recognized degraded evidence, identified the unreliable source, or calibrated its trust across artifacts. We present TRACE (Trust Reasoning over Artifacts for Calibrated Evaluation), a framework that elicits structured artifact-level trust traces over Javadoc, method signatures, implementations, and test prefixes under blind perturbations. Using 22,339 valid traces from seven models on 456 curated Java method bundles, we evaluate per-artifact quality assessment, inconsistency detection, affected artifact attribution, and source prioritization. Across all models, quality penalties are largely localized to the perturbed artifact and increase with severity, but sensitivity is asymmetric across artifact types: documentation bugs induce a substantially larger heavy-to-subtle gap than implementation faults (0.152-0.253 vs. 0.049-0.123). Models detect explicit documentation bugs well (67-94%) and Javadoc and implementation contradictions at 50-91%, yet show a systematic blind spot when only the implementation drifts while the documentation remains plausible, with detection dropping by 7-42 percentage points. Confidence is poorly calibrated for six of seven models. These findings suggest that current LLMs are better at auditing natural-language specifications than at detecting subtle code-level drift, motivating explicit artifact-level trust reasoning before correctness-critical downstream use.
[AI-126] Agile Story-Point Estimation: Is RAG a Better Way to Go?
【速读】:该论文旨在解决敏捷软件开发中用户故事(User Story)估算过程的高时间成本问题,该过程通常依赖人工参与的共识式估算技术(如Planning Poker),效率较低。解决方案的关键在于引入基于检索增强生成(Retrieval Augmented Generation, RAG)的自动化方法,其核心由“检索器”(Retriever)和“生成器”(Generator)组成,通过嵌入模型(embedding models)对历史项目数据进行语义检索与内容生成,从而实现对用户故事复杂度和开发工时的自动估算。
链接: https://arxiv.org/abs/2604.03443
作者: Lamyea Maha,Tajmilur Rahman,Chanchal Roy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The sprint-based iterative approach in the Agile software development method allows continuous feedback and adaptation. One of the crucial Agile software development activities is the sprint planning session where developers estimate the effort required to complete tasks through a consensus-based estimation technique such as Planning Poker. In the Agile software development method, a common unit of measuring development effort is Story Point (SP) which is assigned to tasks to understand the complexity and development time needed to complete them. Despite the benefits of this process, it is an extremely time-consuming manual process. To mitigate this issue, in this study, we investigated if this manual process can be automated using Retrieval Augmented Generation (RAG) which comprises a “Retriever” and a “Generator”. We applied two embedding models - bge-large-en-v1.5, and Sentence-Transformers’ all-mpnet-base-v2 on 23 open-source software projects of varying sizes and examined four key aspects: 1) how retrieval hyper-parameters influence the performance, 2) whether estimation accuracy differs across different sizes of the projects, 3) whether embedding model choice affects accuracy, and 4) how the RAG-based approach compares to the existing baselines. Although the RAG-based approach outperformed the baseline models in several occasions, our results did not exhibit statistically significant differences in performance across the projects or across the embedding models. This highlights the need for further studies and refinement of the RAG, and model adaptation strategies for better accuracy in automatically estimating user stories.
[AI-127] MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在安全相关应用中面临的核心问题:SAE的潜在表示(latents)往往并非原子化(atomic),即单个潜在变量可能同时激活于语义上不相关的多个表征子空间(representational subspace),导致其语义混杂,削弱了对模型计算过程的可解释性与控制能力。为提升潜在变量的原子性,论文提出一种联合训练目标,其关键在于引入一个小型元自编码器(meta SAE),用于稀疏重构主SAE的解码器列(decoder columns);当主SAE的解码方向易于被元字典压缩时,即表明该方向位于其他主方向张成的子空间内,此时对主SAE施加惩罚,从而推动其解码方向趋向相互独立、抵抗稀疏压缩。实验表明,在GPT-2 Large(第20层)上,该方法使平均 |\varphi| 降低7.5%,自动化可解释性(fuzzing)评分提升7.6%,验证了原子性增强效果;在Gemma 2 9B上的结果虽具方向性,但亦显示该方法具备向更大模型迁移的潜力。
链接: https://arxiv.org/abs/2604.03436
作者: Matthew Levinson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse autoencoders (SAEs) are increasingly used for safety-relevant applications including alignment detection and model steering. These use cases require SAE latents to be as atomic as possible. Each latent should represent a single coherent concept drawn from a single underlying representational subspace. In practice, SAE latents blend representational subspaces together. A single feature can activate across semantically distinct contexts that share no true common representation, muddying an already complex picture of model computation. We introduce a joint training objective that directly penalizes this subspace blending. A small meta SAE is trained alongside the primary SAE to sparsely reconstruct the primary SAE’s decoder columns; the primary SAE is penalized whenever its decoder directions are easy to reconstruct from the meta dictionary. This occurs whenever latent directions lie in a subspace spanned by other primary directions. This creates gradient pressure toward more mutually independent decoder directions that resist sparse meta-compression. On GPT-2 large (layer 20), the selected configuration reduces mean |\varphi| by 7.5% relative to an identical solo SAE trained on the same data. Automated interpretability (fuzzing) scores improve by 7.6%, providing external validation of the atomicity gain independent of the training and co-occurrence metrics. Reconstruction overhead is modest. Results on Gemma 2 9B are directional. On not-fully-converged SAEs, the same parameterization yields the best results, a +8.6% \Delta Fuzz. Though directional, this is an encouraging sign that the method transfers to a larger model. Qualitative analysis confirms that features firing on polysemantic tokens are split into semantically distinct sub-features, each specializing in a distinct representational subspace. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: I.2 Cite as: arXiv:2604.03436 [cs.LG] (or arXiv:2604.03436v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.03436 Focus to learn more arXiv-issued DOI via DataCite
[AI-128] AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
【速读】:该论文旨在解决长序列加密Transformer模型在多GPU平台上的可扩展性问题,即由于密文权重体积庞大且激活值随序列长度快速增长,导致单GPU内存不足,而现有多GPU方案因应用层聚合与加密层RNS(Residue Number System)耦合带来的通信开销过大,难以高效扩展。解决方案的关键在于提出AEGIS系统,通过联合分析Transformer数据流与CKKS多项式耦合关系,推导出基于密文依赖的设备放置策略,将模数一致性和token一致性数据共置,仅在应用依赖必须时引入通信,并重新排序多项式操作以重叠剩余集体通信与计算,从而显著降低跨GPU通信量并提升整体效率。
链接: https://arxiv.org/abs/2604.03425
作者: Zhaoting Gong,Ran Ran,Fan Yao,Wujie Wen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted at ICS 2026
Abstract:Fully Homomorphic Encryption (FHE) enables privacy-preserving Transformer inference, but long-sequence encrypted Transformers quickly exceed single-GPU memory capacity because encoded weights are already large and encrypted activations grow rapidly with sequence length. Multi-GPU execution therefore becomes unavoidable, yet scaling remains challenging because communication is jointly induced by application-level aggregation and encryption-level RNS coupling. Existing approaches either synchronize between devices frequently or replicate encrypted tensors across devices, leading to excessive communication and latency. We present AEGIS, an Application-Encryption Guided Inference System for scalable long-sequence encrypted Transformer inference on multi-GPU platforms. AEGIS derives device placement from ciphertext dependencies jointly induced by Transformer dataflow and CKKS polynomial coupling, co-locating modulus-coherent and token-coherent data so that communication is introduced only when application dependencies require it, while reordering polynomial operators to overlap the remaining collectives with computation. On 2048-token inputs, AEGIS reduces inter-GPU communication by up to 57.9% in feed-forward networks and 81.3% in self-attention versus prior state-of-the-art designs. On four GPUs, it achieves up to 96.62% scaling efficiency, 3.86x end-to-end speedup, and 69.1% per-device memory reduction. These results establish coordinated application-encryption parallelism as a practical foundation for scalable homomorphic Transformer inference. Comments: Accepted at ICS 2026 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2604.03425 [cs.CR] (or arXiv:2604.03425v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.03425 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-129] Generative AI for material design: A mechanics perspective from burgers to matter
【速读】:该论文旨在解决生成式 AI(Generative AI)在高维空间材料设计中因机制难以解释而限制其在计算力学领域应用的问题。解决方案的关键在于揭示扩散型生成模型与计算力学之间共享的物理基础——即扩散过程、随机微分方程和逆问题均源于相同的统计力学原理。作者通过构建一个低维“三重配料汉堡”基准模型,证明了正向与反向扩散过程在离散情形下可解析求解(基于马尔可夫链与贝叶斯反演),在连续情形下则对应奥恩斯坦-乌伦贝克过程与基于得分的反转机制。进一步地,在包含146种成分、高达8.9×10⁴³种可能配置的高维空间中,由于解析解不可行,研究者采用神经网络学习离散与连续的反向过程,仅用2,260条配方数据即可训练模型并生成百万级样本,准确捕捉数据的统计特性;最终在真实餐厅感官测试中验证了AI设计的五款新汉堡中有三款优于经典麦当劳巨无霸,从而确立了基于扩散的生成建模作为物理可解释的高维设计方法,并将其定位为计算力学的数据驱动延伸。
链接: https://arxiv.org/abs/2604.03409
作者: Vahidullah Tac,Ellen Kuhl
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 23 pages, 14 figures, 2 tables
Abstract:Generative artificial intelligence offers a new paradigm to design matter in high-dimensional spaces. However, its underlying mechanisms remain difficult to interpret and limit adoption in computational mechanics. This gap is striking because its core tools-diffusion, stochastic differential equations, and inverse problems-are fundamental to the mechanics of materials. Here we show that diffusion-based generative AI and computational mechanics are rooted in the same principles. We illustrate this connection using a three-ingredient burger as a minimal benchmark for material design in a low-dimensional space, where both forward and reverse diffusion admit analytical solutions: Markov chains with Bayesian inversion in the discrete case and the Ornstein-Uhlenbeck process with score-based reversal in the continuous case. We extend this framework to a high-dimensional design space with 146 ingredients and 8.9x10^43 possible configurations, where analytical solutions become intractable. We therefore learn the discrete and continuous reverse processes using neural network models that infer inverse dynamics from data. We train the models on only 2,260 recipes and generate one million samples that capture the statistical structure of the data, including ingredient prevalence and quantitative composition. We further generate five new burgers and validate them in a restaurant-based sensory study with 100 participants, where three of the AI-designed burgers outperform the classical Big Mac in overall liking, flavor, and texture. These results establish diffusion-based generative modeling as a physically grounded approach to design in high-dimensional spaces. They position generative AI as a natural extension of computational mechanics, with applications from burgers to matter, and establish a path toward data-driven, physics-informed generative design.
[AI-130] ABQAWORLD: Optimizing Multimodal Reasoning for Multi-Turn Table Question Answering
【速读】:该论文旨在解决多轮表格推理中因固定文本序列化导致的表状态编码表示误差累积问题,该误差会显著降低推理准确性,而现有基于表格定位(tabular grounding)的方法虽能缓解此问题但增加了计算开销,不利于实际部署。解决方案的关键在于提出TABQAWORLD框架,其通过联合优化表格动作的表示与估计实现高效可靠推理:在表示层面,采用动作条件的多模态选择策略,动态切换视觉与文本表示以提升表状态读取可靠性;在估计层面,利用表格元数据(如维度、数据类型和关键值)优化每步推理轨迹,安全规划路径并压缩低复杂度动作,从而减少对话轮次和延迟。该方法无需训练即可实现优于基线4.87%的准确率提升,并在静态设置下实现5.42%准确率增益与33.35%推理延迟降低。
链接: https://arxiv.org/abs/2604.03393
作者: Tung Sum Thomas Kwok,Xinyu Wang,Xiaofeng Lin,Peng Lu,Chunhe Wang,Changlun Li,Hanwei Wu,Nan Tang,Elisa Kreiss,Guang Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal reasoning has emerged as a powerful framework for enhancing reasoning capabilities of reasoning models. While multi-turn table reasoning methods have improved reasoning accuracy through tool use and reward modeling, they rely on fixed text serialization for table state readouts. This introduces representation errors in table encoding that significantly accumulate over multiple turns. Such accumulation is alleviated by tabular grounding methods in the expense of inference compute and cost, rendering real world deployment impractical. To address this, we introduce TABQAWORLD, a table reasoning framework that jointly optimizes tabular action through representation and estimation. For representation, TABQAWORLD employs an action-conditioned multimodal selection policy, which dynamically switches between visual and textual representations to maximize table state readout reliability. For estimation, TABQAWORLD optimizes stepwise reasoning trajectory through table metadata including dimension, data types and key values, safely planning trajectory and compressing low-complexity actions to reduce conversation turns and latency. Designed as a training-free framework, empirical evaluations show that TABQAWORLD achieves state-of-the-art performance with 4.87% accuracy improvements over baselines, with 5.42% accuracy gain and 33.35% inference latency reduction over static settings, establishing a new standard for reliable and efficient table reasoning.
[AI-131] Humes Representational Conditions for Causal Judgment: What Bayesian Formalization Abstracted Away
【速读】:该论文旨在解决如何在现代认知科学与人工智能框架中重新理解休谟(Hume)关于因果判断的心理学机制问题。其核心问题是:休谟所提出的因果判断的三个表征条件——经验基础(ideas must trace to impressions)、结构化检索(structured retrieval)和生动性传递(vivacity transfer)——在从休谟到贝叶斯认识论和预测加工(predictive processing)的发展过程中是如何被保留或抽象化的。解决方案的关键在于,通过系统梳理休谟文本中的这些条件,并指出它们在后续理论中虽被形式化为概率更新机制,但其深层表征要求(如经验根基、结构化关联与心理确信的生成)却逐渐隐含于背景假设之中;而当代大语言模型(large language models)作为典型案例,展示了仅具备统计更新能力但缺乏上述三重表征条件的局限性,从而揭示了传统因果认知模型中那些被长期忽视的核心约束。
链接: https://arxiv.org/abs/2604.03387
作者: Yiling Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Hume’s account of causal judgment presupposes three representational conditions: experiential grounding (ideas must trace to impressions), structured retrieval (association must operate through organized networks exceeding pairwise connection), and vivacity transfer (inference must produce felt conviction, not merely updated probability). This paper extracts these conditions from Hume’s texts and argues that they are integral to his causal psychology. It then traces their fate through the formalization trajectory from Hume to Bayesian epistemology and predictive processing, showing that later frameworks preserve the updating structure of Hume’s insight while abstracting away these further representational conditions. Large language models serve as an illustrative contemporary case: they exhibit a form of statistical updating without satisfying the three conditions, thereby making visible requirements that were previously background assumptions in Hume’s framework.
[AI-132] Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在价值对齐(value alignment)方面存在的根本性问题,即AI系统并非世界观中立,而是隐含地默认一种“程序性世俗主义”(Procedural Secularism),导致其在体现基督教视角下的人类繁荣(human flourishing)维度上出现系统性性能下降。解决方案的关键在于提出并应用“繁荣型人工智能基准:基督教单轮问答版”(Flourishing AI Benchmark: Christian Single-Turn, FAI-C-ST),该框架通过七个维度量化评估前沿模型响应与基督教伦理框架的一致性,从而揭示模型训练目标偏向广泛可接受性和安全性的内在缺陷,并证明价值对齐差距源于训练范式而非技术限制。
链接: https://arxiv.org/abs/2604.03356
作者: Nicholas Skytland,Lauren Parsons,Alicia Llewellyn,Steele Billings,Peter Larson,John Anderson,Sean Boisen,Steve Runge
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence (AI) alignment is fundamentally a formation problem, not only a safety problem. As Large Language Models (LLMs) increasingly mediate moral deliberation and spiritual inquiry, they do more than provide information; they function as instruments of digital catechesis, actively shaping and ordering human understanding, decision-making, and moral reflection. To make this formative influence visible and measurable, we introduce the Flourishing AI Benchmark: Christian Single-Turn (FAI-C-ST), a framework designed to evaluate Frontier Model responses against a Christian understanding of human flourishing across seven dimensions. By comparing 20 Frontier Models against both pluralistic and Christian-specific criteria, we show that current AI systems are not worldview-neutral. Instead, they default to a Procedural Secularism that lacks the grounding necessary to sustain theological coherence, resulting in a systematic performance decline of approximately 17 points across all dimensions of flourishing. Most critically, there is a 31-point decline in the Faith and Spirituality dimension. These findings suggest that the performance gap in values alignment is not a technical limitation, but arises from training objectives that prioritize broad acceptability and safety over deep, internally coherent moral or theological reasoning. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.03356 [cs.AI] (or arXiv:2604.03356v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.03356 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-133] From Model-Based Screening to Data-Driven Surrogates: A Multi-Stage Workflow for Exploring Stochastic Agent -Based Models
【速读】:该论文旨在解决基于代理的模型(Agent-Based Models, ABMs)在系统性探索中面临的维度诅咒(curse of dimensionality)和固有随机性(inherent stochasticity)问题。其解决方案的关键在于构建一个两阶段流水线:首先通过自动化模型驱动的筛选识别主导变量、评估结果变异性并分割参数空间;其次利用机器学习代理模型(machine learning surrogates)捕捉剩余非线性交互效应,从而自动发现系统行为高度依赖于多变量非线性相互作用的不稳定区域。该方法为高维随机模拟器提供了严谨且无需人工干预的敏感性分析与政策测试框架。
链接: https://arxiv.org/abs/2604.03350
作者: Paul Saves,Matthieu Mastio,Nicolas Verstaevel,Benoit Gaudou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in MABS 2026 - The 27th International Workshop on Multi-Agent-Based Simulation
Abstract:Systematic exploration of Agent-Based Models (ABMs) is challenged by the curse of dimensionality and their inherent stochasticity. We present a multi-stage pipeline integrating the systematic design of experiments with machine learning surrogates. Using a predator-prey case study, our methodology proceeds in two steps. First, an automated model-based screening identifies dominant variables, assesses outcome variability, and segments the parameter space. Second, we train Machine Learning models to map the remaining nonlinear interaction effects. This approach automates the discovery of unstable regions where system outcomes are highly dependent on nonlinear interactions between many variables. Thus, this work provides modelers with a rigorous, hands-off framework for sensitivity analysis and policy testing, even when dealing with high-dimensional stochastic simulators.
[AI-134] owards Intelligent Energy Security: A Unified Spatio-Temporal and Graph Learning Framework for Scalable Electricity Theft Detection in Smart Grids
【速读】:该论文旨在解决智能电网中电力窃取与非技术性损耗(Non-Technical Losses, NTLs)问题,此类问题导致显著经济损失并影响电网可靠性。解决方案的关键在于提出一种集成人工智能框架——SmartGuard Energy Intelligence System (SGEIS),其核心创新在于融合监督学习、深度学习时间序列建模、非侵入式负载监测(NILM)和图神经网络(GNN),从而同时捕捉用电行为的时序特征与空间拓扑依赖关系。通过多尺度特征工程、规则驱动的异常标注及混合模型架构(如LSTM、TCN、Autoencoder用于异常检测,Gradient Boosting/XGBoost/LightGBM用于分类,GNN用于节点关联分析),系统实现了高精度检测(ROC-AUC达0.894,高风险节点识别准确率超96%)与可解释性增强,为实际部署提供了鲁棒性强、可扩展的智能监测方案。
链接: https://arxiv.org/abs/2604.03344
作者: AbdulQoyum A. Olowookere,Usman A. Oguntola,Ebenezer. Leke Odekanle,Maridiyah A. Madehin,Aisha A. Adesope
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 9 figures
Abstract:Electricity theft and non-technical losses (NTLs) remain critical challenges in modern smart grids, causing significant economic losses and compromising grid reliability. This study introduces the SmartGuard Energy Intelligence System (SGEIS), an integrated artificial intelligence framework for electricity theft detection and intelligent energy monitoring. The proposed system combines supervised machine learning, deep learning-based time-series modeling, Non-Intrusive Load Monitoring (NILM), and graph-based learning to capture both temporal and spatial consumption patterns. A comprehensive data processing pipeline is developed, incorporating feature engineering, multi-scale temporal analysis, and rule-based anomaly labeling. Deep learning models, including Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Autoencoders, are employed to detect abnormal usage patterns. In parallel, ensemble learning methods such as Random Forest, Gradient Boosting, XGBoost, and LightGBM are utilized for classification. To model grid topology and spatial dependencies, Graph Neural Networks (GNNs) are applied to identify correlated anomalies across interconnected nodes. The NILM module enhances interpretability by disaggregating appliance-level consumption from aggregate signals. Experimental results demonstrate strong performance, with Gradient Boosting achieving a ROC-AUC of 0.894, while graph-based models attain over 96% accuracy in identifying high-risk nodes. The hybrid framework improves detection robustness by integrating temporal, statistical, and spatial intelligence. Overall, SGEIS provides a scalable and practical solution for electricity theft detection, offering high accuracy, improved interpretability, and strong potential for real-world smart grid deployment.
[AI-135] Composer Vector: Style-steering Symbolic Music Generation in a Latent Space
【速读】:该论文旨在解决符号化音乐生成中对作曲家风格进行细粒度和灵活控制的难题。现有基于训练的方法依赖大规模标注数据集,且通常仅支持单一作曲家风格生成,难以满足混合或创意场景的需求。解决方案的关键在于提出“作曲家向量”(Composer Vector),这是一种在推理阶段直接作用于模型潜在空间的导向机制,无需重新训练即可实现作曲家风格控制;通过连续的导向系数,该方法能平滑、可解释地引导生成结果趋向目标作曲家风格,并支持在统一潜在空间框架内无缝融合多种风格,从而为可控符号化音乐生成提供了一种通用且实用的新范式。
链接: https://arxiv.org/abs/2604.03333
作者: Xunyi Jiang,Mingyang Yao,Jingyue Huang,Julian McAuley
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Symbolic music generation has made significant progress, yet achieving fine-grained and flexible control over composer style remains challenging. Existing training-based methods for composer style conditioning depend on large labeled datasets. Besides, these methods typically support only single-composer generation at a time, limiting their applicability to more creative or blended scenarios. In this work, we propose Composer Vector, an inference-time steering method that operates directly in the model’s latent space to control composer style without retraining. Through experiments on multiple symbolic music generation models, we show that Composer Vector effectively guides generations toward target composer styles, enabling smooth and interpretable control through a continuous steering coefficient. It also enables seamless fusion of multiple styles within a unified latent space framework. Overall, our work demonstrates that simple latent space steering provides a practical and general mechanism for controllable symbolic music generation, enabling more flexible and interactive creative workflows. Code and Demo are available here: this https URL and this https URL
[AI-136] AICCE: AI Driven Compliance Checker Engine
【速读】:该论文旨在解决当前基于规则的通信协议合规性验证系统在面对IPv6流量中隐蔽或复杂非合规行为时失效的问题,此类问题常被攻击者利用以建立隐蔽通信通道。解决方案的关键在于提出一种名为AICCE(Artificial Intelligence Driven Compliance Checker Engine)的生成式合规检查引擎,其核心创新是融合双架构推理与检索增强生成(Retrieval-Augmented Generation, RAG)技术:通过将协议标准语义编码至高维向量空间实现精准片段检索,并构建两种互补流水线——解释模式(Explainability Mode)利用并行大语言模型(Large Language Model, LLM)代理进行结构化讨论以提升决策可解释性与鲁棒性,脚本执行模式(Script Execution Mode)将条款转化为Python规则用于高效批量验证,从而在十六个前沿生成模型的IPv6数据集上实现高达99%的准确率和F1分数,显著优于传统规则系统。
链接: https://arxiv.org/abs/2604.03330
作者: Mohammad Wali Ur Rahman,Martin Manuel Lopez,Lamia Tasnim Mim,Carter Farthing,Julius Battle,Kathryn Buckley,Salim Hariri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted for publication in IEEE Transactions on Artificial Intelligence
Abstract:For digital infrastructure to be safe, compatible, and standards-aligned, automated communication protocol compliance verification is crucial. Nevertheless, current rule-based systems are becoming less and less effective since they are unable to identify subtle or intricate non-compliance, which attackers frequently use to establish covert communication channels in IPv6 traffic. In order to automate IPv6 compliance verification, this paper presents the Artificial Intelligence Driven Compliance Checker Engine (AICCE), a novel generative system that combines dual-architecture reasoning and retrieval-augmented generation (RAG). Specification segments pertinent to each query can be efficiently retrieved thanks to the semantic encoding of protocol standards into a high-dimensional vector space. Based on this framework, AICCE offers two complementary pipelines: (i) Explainability Mode, which uses parallel LLM agents to render decisions and settle disputes through organized discussions to improve interpretability and robustness, and (ii) Script Execution Mode, which converts clauses into Python rules that can be executed quickly for dataset-wide verification. With the debate mechanism enhancing decision reliability in complicated scenarios and the script-based pipeline lowering per-sample latency, AICCE achieves accuracy and F1-scores of up to 99% when tested on IPv6 packet samples across sixteen cutting-edge generative models. By offering a scalable, auditable, and generalizable mechanism for identifying both routine and covert non-compliance in dynamic communication environments, our results show that AICCE overcomes the blind spots of conventional rule-based compliance checking systems.
[AI-137] General Explicit Network (GEN): A novel deep learning architecture for solving partial differential equations
【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在实际应用中面临的局限性问题,尤其是其仅依赖离散点到点拟合而忽视真实解潜在性质、以及采用连续激活函数导致局部特性匹配但泛化能力与鲁棒性较差的问题。解决方案的关键在于提出一种通用显式网络(General Explicit Network, GEN),该方法实现了从点到函数的偏微分方程(Partial Differential Equations, PDEs)求解范式转变,通过基于先验知识构建的基函数显式构造“函数”组件进行拟合,从而显著提升解的鲁棒性和可扩展性。
链接: https://arxiv.org/abs/2604.03321
作者: Genwei Ma,Ting Luo,Ping Yang,Xing Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Medical Physics (physics.med-ph)
备注:
Abstract:Machine learning, especially physics-informed neural networks (PINNs) and their neural network variants, has been widely used to solve problems involving partial differential equations (PDEs). The successful deployment of such methods beyond academic research remains limited. For example, PINN methods primarily consider discrete point-to-point fitting and fail to account for the potential properties of real solutions. The adoption of continuous activation functions in these approaches leads to local characteristics that align with the equation solutions while resulting in poor extensibility and robustness. A general explicit network (GEN) that implements point-to-function PDE solving is proposed in this paper. The “function” component can be constructed based on our prior knowledge of the original PDEs through corresponding basis functions for fitting. The experimental results demonstrate that this approach enables solutions with high robustness and strong extensibility to be obtained.
[AI-138] RAG naroX: A Secure Local-Hosted ChatOps Assistant Using Small Language Models
【速读】:该论文旨在解决当前ChatOps助手依赖外部云服务(如Azure或OpenAI)所带来的资源消耗高、缺乏可审计性及部署灵活性差的问题。其解决方案的关键在于设计并实现了一个完全基于通用硬件运行的轻量级系统RAGnaroX,采用Rust语言开发以保障性能与安全性,核心架构包含模块化数据摄入、混合检索机制和函数调用能力,从而在保证问答准确率的同时显著提升资源效率,例如在SQuAD数据集上实现0.90的上下文精确度且平均响应时间仅为2.5秒/请求。
链接: https://arxiv.org/abs/2604.03291
作者: Benedikt Dornauer,Mircea-Cristian Racasan
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces RAGnaroX, a resource-efficient ChatOps assistant that operates entirely on commodity hardware. Unlike existing solutions that often rely on external providers such as Azure or OpenAI, RAGnaroX offers a fully auditable, on-premise stack implemented in Rust. Its architecture integrates modular data ingestion, hybrid retrieval, and function calling, enabling flexible yet secure deployment. Our evaluation focuses on the RAG pipeline, with benchmarks conducted on the SQuAD (single-hop QA), MultiHopRAG (multi-hop QA), and MLQA (cross-lingual QA) datasets. Results show that RAGnaroX achieves competitive accuracy while maintaining strong resource efficiency, for example, reaching 0.90 context precision on single-hop questions with an average response time of 2.5 seconds per request. A replication package containing the tool, the demonstration video (this https URL v=cDxfuEbcoM4), and all supporting materials are available at this https URL.
[AI-139] Customized User Plane Processing via Code Generating AI Agents for Next Generation Mobile Networks
【速读】:该论文旨在解决6G及未来移动网络中如何通过生成式AI(Generative AI)按需定制网络处理功能块的问题,以提升网络的自动化、灵活性和适应性。其关键解决方案在于利用AI代理(AI agents)根据文本形式的服务请求自动生成处理用户面流量的功能代码模块,从而实现对协议数据单元(PDU)的解析与指定操作执行。研究重点评估了模型选择、提示工程(prompt design)以及代码模板提供等因素对生成准确性的影 响,结果表明在适当条件下,AI代理能够可靠地生成满足需求的定制化处理逻辑,为网络按需扩展新能力提供了可行路径。
链接: https://arxiv.org/abs/2604.03282
作者: Xiaowen Ma,Onur Ayan,Yunpu Ma,Xueli An
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI is envisioned to have a crucial impact on next generation mobile networking, making the sixth generation (6G) system considerably more autonomous, flexible, and adaptive than its predecessors. By leveraging their natural language processing and code generation capabilities, AI agents enable novel interactions and services between networks and vertical applications. A particularly promising and interesting use case is the customization of connectivity services for vertical applications by generating new customized processing blocks based on text-based service requests. More specifically, AI agents are able to generate code for a new function block that handles user plane traffic, allowing it to inspect and decode a protocol data unit (PDU) and perform specified actions as requested by the application. In this study, we investigate the code generation problem for generating such customized processing blocks on-demand. We evaluate various factors affecting the accuracy of the code generation process in this context, including model selection, prompt design, and the provision of a code template for the agent to utilize. Our findings indicate that AI agents are capable of generating such blocks with the desired behavior on-demand under suitable conditions. We believe that exploring the code generation for network-specific tasks is a very interesting problem for 6G and beyond, enabling networks to achieve a new level of customization by generating new capabilities on-demand.
[AI-140] Safe Decentralized Operation of EV Virtual Power Plant with Limited Network Visibility via Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决虚拟电厂(VPP)在信息受限条件下协调多个电动汽车充电站(EVCS)时面临的电压安全与经济运行之间的矛盾问题。现实中,VPP仅能获取配电网络(PDN)的局部、聚合状态信息,难以实现精确的电压控制。解决方案的关键在于提出一种基于Transformer辅助的拉格朗日多智能体近端策略优化(TL-MAPPO)框架:通过中心化训练结合拉格朗日正则化约束来确保电压和负荷满足条件,同时利用Transformer嵌入层捕捉电价、负荷与充电需求的时间相关性,从而提升各EVCS代理的去中心化决策质量。实验表明,该方法在真实33节点配电系统中可减少约45%的电压越限事件并降低约10%的运营成本,显著优于现有主流多智能体深度强化学习基线。
链接: https://arxiv.org/abs/2604.03278
作者: Chenghao Huang,Jiarong Fan,Weiqing Wang,Hao Wang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: The 2026 IEEE Power Energy Society General Meeting, 7 pages including appendix
Abstract:As power systems advance toward net-zero targets, behind-the-meter renewables are driving rapid growth in distributed energy resources (DERs). Virtual power plants (VPPs) increasingly coordinate these resources to support power distribution network (PDN) operation, with EV charging stations (EVCSs) emerging as a key asset due to their strong impact on local voltages. However, in practice, VPPs must make operational decisions with only partial visibility of PDN states, relying on limited, aggregated information shared by the distribution system operator. This work proposes a safety-enhanced VPP framework for coordinating multiple EVCSs under such realistic information constraints to ensure voltage security while maintaining economic operation. We develop Transformer-assisted Lagrangian Multi-Agent Proximal Policy Optimization (TL-MAPPO), in which EVCS agents learn decentralized charging policies via centralized training with Lagrangian regularization to enforce voltage and demand-satisfaction constraints. A transformer-based embedding layer deployed on each EVCS agent captures temporal correlations among prices, loads, and charging demand to improve decision quality. Experiments on a realistic 33-bus PDN show that the proposed framework reduces voltage violations by approximately 45% and operational costs by approximately 10% compared to representative multi-agent DRL baselines, highlighting its potential for practical VPP deployment.
[AI-141] AI Governance Control Stack for Operational Stability: Achieving Hardened Governance in AI Systems
【速读】:该论文旨在解决当前人工智能(AI)治理实践中普遍存在的问题:即多数治理方法仅停留在政策指导层面,缺乏确保AI系统在高风险决策环境中长期保持可靠、可审计和可问责行为的运行稳定性机制。随着AI部署规模扩大,组织亟需一套能够贯穿AI生命周期、持续保障治理完整性的架构。其解决方案的关键在于提出“AI治理控制栈(AI Governance Control Stack for Operational Stability)”,该架构由六个互补的治理层级构成——系统记录版本治理、基于证据的验证、决策时刻可解释性日志、遥测监控、模型漂移检测与治理升级机制——通过将可解释性基础设施、持续监控与人工监督相结合,构建了一个结构化且可操作的治理体系,从而实现对AI行为的追踪、风险响应与合规保障,同时与欧盟《人工智能法案》(EU AI Act)、ISO/IEC 42001人工智能管理系统及NIST AI风险管理框架等新兴法规标准高度对齐。
链接: https://arxiv.org/abs/2604.03262
作者: Horatio Morgan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 1 table. Research paper introducing an operational AI governance architecture aligned with emerging regulatory frameworks including the EU AI Act, ISO/IEC 42001, and the NIST AI Risk Management Framework
Abstract:Artificial intelligence systems are increasingly embedded in high-stakes decision environments, yet many governance approaches focus primarily on policy guidance rather than operational stability mechanisms. As AI deployments scale, organizations require governance architectures capable of maintaining reliable, auditable, and accountable behavior over time. This paper introduces the AI Governance Control Stack for Operational Stability, a layered governance architecture designed to ensure traceable and resilient AI system behavior. The proposed control stack integrates six complementary governance layers: system-of-record version governance, evidence-based verification, decision-time explainability logging, telemetry monitoring, model drift detection, and governance escalation. Together, these layers provide a structured mechanism for preserving governance integrity across the AI lifecycle while enabling organizations to detect instability, respond to emerging risks, and maintain regulatory accountability. The architecture aligns operational governance practices with emerging regulatory and standards frameworks, including the EU AI Act, ISO/IEC 42001 Artificial Intelligence Management Systems, and the NIST AI Risk Management Framework. By combining explainability infrastructure with continuous monitoring and human oversight mechanisms, the governance control stack provides a practical blueprint for achieving hardened AI governance in complex enterprise environments. The paper contributes a conceptual governance architecture and a framework alignment analysis demonstrating how operational stability mechanisms can strengthen responsible AI implementation. The findings suggest that organizations must move beyond static policy frameworks toward integrated governance control systems capable of sustaining trustworthy AI operation in dynamic environments. Comments: 10 pages, 4 figures, 1 table. Research paper introducing an operational AI governance architecture aligned with emerging regulatory frameworks including the EU AI Act, ISO/IEC 42001, and the NIST AI Risk Management Framework Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.03262 [cs.CY] (or arXiv:2604.03262v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.03262 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Horatio Morgan [view email] [v1] Thu, 12 Mar 2026 19:01:42 UTC (322 KB)
[AI-142] Is your AI Model Accurate Enough? The Difficult Choices Behind Rigorous AI Development and the EU AI Act
【速读】:该论文旨在解决当前AI治理中一个核心争议问题:即“准确性”是否如技术与法律讨论所普遍假设的那样,是一个客观、可测量且纯粹技术性的属性。作者通过法律-技术交叉分析指出,AI性能评估本质上依赖于情境相关的规范性决策(techno-normative choices),这些选择决定了哪些错误被优先关注、风险如何分配以及多目标间的权衡如何取舍。论文以2024年欧盟《人工智能法案》(AI Act)中对高风险系统要求“适当准确度”的条款为案例,提炼出四个关键决策维度:(1)指标选取、(2)多指标平衡、(3)代表性数据上的指标测量、(4)接受阈值设定,并揭示每个维度的技术实现均隐含或明确承载了关于可接受风险、误差和权衡的规范立场。其解决方案的关键在于将这些“技术-规范”交织的维度显性化,从而为监管者、审计人员和开发者提供可操作的框架,推动法律安全要求向技术实践的有效转化,促进更具透明性和责任性的AI部署。
链接: https://arxiv.org/abs/2604.03254
作者: Lucas G. Uberti-Bona Marin,Bram Rijsbosch,Kristof Meding,Gerasimos Spanakis,Gijs van Dijck,Konrad Kollnig
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: To appear in the 2026 ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT '26)
Abstract:Technical and legal debates frequently suggest that “accuracy” is an objective, measurable, and purely technical property. We challenge this view, showing that evaluating AI performance fundamentally depends on context-dependent normative decisions. These techno-normative choices are crucial for rigorous AI deployment, as they determine which errors are prioritised, how risks are distributed, and how trade-offs between competing objectives are resolved. This paper provides a legal-technical analysis of the choices that shape how accuracy is defined, measured, and assessed, using the 2024 European Union AI Act – which mandates an “appropriate level of accuracy” for high-risk systems – as a primary case study. We identify and analyse four choices central to any robust performance evaluation: (1) selecting metrics, (2) balancing multiple metrics, (3) measuring metrics against representative data, and (4) determining acceptance thresholds. For each choice, we study its relationship to the AI Act’s accuracy requirement and associated documentation obligations, show how its technical implementation embeds implicit or explicit assumptions about acceptable risks, errors, and trade-offs, and discuss the implications for the practical implementation of the AI Act by examples and related technical standards. By making the techno-normative dimensions of accuracy explicit, this paper contributes to broader interdisciplinary debates on AI governance and regulation, and offers specific guidance for regulators, auditors, and developers tasked with translating (legal) safety requirements into technical practice. Comments: To appear in the 2026 ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT '26) Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.03254 [cs.CY] (or arXiv:2604.03254v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.03254 Focus to learn more arXiv-issued DOI via DataCite
[AI-143] Personalized AI Practice Replicates Learning Rate Regularity at Scale
【速读】:该论文旨在解决如何在大规模个性化学习中有效测量和优化学生的学习速率问题,特别是克服传统方法对人工认知建模的高度依赖。其解决方案的关键在于利用一个全自动的数字学习平台Campus AI,该平台能够自动生成知识组件(Knowledge Components, KCs)及其对应的练习题,并通过专家验证确保质量;这种“一对一多”的映射结构使得可以应用加性因子模型(Additive Factors Models)来无须复杂认知建模即可准确估计学习参数。研究结果表明,尽管学生初始知识水平存在显著差异,但其学习速率高度一致,且使用该自动化系统的学生达到80%掌握度所需的练习次数(中位数为7.22次)与专家设计课程相当(6.54次),证明了基于科学原理的自动内容生成可支持高效、可扩展的个性化学习。
链接: https://arxiv.org/abs/2604.03246
作者: Jocelyn Beauchesne,Christine Maroti,Jeshua Bratman,Jerome Pesenti,Laurence Holt,Alex Tambellini,Allison McGrath,Matthew Guo,Sarah Peterson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent research demonstrated that students exhibit consistent learning rates across diverse educational contexts. We test these findings using a dataset of 1.8 million (366k post-filtering) student interactions from the digital platform Campus AI providing further evidence to the observation of regularity in learning rate among students. Unlike prior work requiring manual cognitive modeling, Campus AI automatically generates Knowledge Components (KCs) and corresponding exercises, both of which are validated by human experts. This one-to-many mapping facilitates the application of Additive Factors Models to measure learning parameters without complex cognitive modeling. Using mixed-effects logistic regression, we confirmed the core finding of prior work: students displayed substantial variation in initial knowledge ( \textIQR = [2.78, 12.18] practice opportunities to reach 80% mastery) but remarkably consistent learning rates ( \textIQR = [7.01, 8.25] opportunities). Furthermore, students using this fully automated system achieved 80% mastery in a median of 7.22 practice opportunities, comparable to the 6.54 reported for expert-designed curricula. These results suggest that automated, science-grounded content generation can support effective personalized learning at scale. Data and code are publicly available. this https URL Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.03246 [cs.CY] (or arXiv:2604.03246v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.03246 Focus to learn more arXiv-issued DOI via DataCite
[AI-144] FVRuleLearner: Operator-Level Reasoning Tree (OP-Tree)-Based Rules Learning for Formal Verification
【速读】:该论文旨在解决自然语言到SystemVerilog Assertions (NL-to-SVA) 自动转换中因训练数据有限和形式化验证(Formal Verification, FV)操作符内在复杂性导致的生成准确性不足问题,尤其是SVA操作符选择错误所引发的功能性错误。其解决方案的关键在于提出FVRuleLearner框架,该框架基于一种新颖的Operator Reasoning Tree (OP-Tree) 构建操作符层级规则(Op-Rule),将SVA生成过程建模为结构化且可解释的推理路径;通过两个互补阶段实现:(1) 训练阶段构建OP-Tree,将NL-to-SVA对齐分解为细粒度、操作符感知的问题,并整合通向正确断言的推理路径;(2) 测试阶段执行操作符对齐检索,从学习得到的OP-Tree中获取相关推理轨迹并生成适用于未见规范的新规则,从而显著提升语法与功能正确率,有效减少各类操作符类别下的功能失败率。
链接: https://arxiv.org/abs/2604.03245
作者: Lily Jiaxin Wan,Chia-Tung Ho,Yunsheng Bai,Cunxi Yu,Deming Chen,Haoxing Ren
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted to IEEE VTS’26
Abstract:The remarkable reasoning and code generation capabilities of large language models (LLMs) have recently motivated increasing interest in automating formal verification (FV), a process that ensures hardware correctness through mathematically precise assertions but remains highly labor-intensive, particularly through the translation of natural language into SystemVerilog Assertions (NL-to-SVA). However, LLMs still struggle with SVA generation due to limited training data and the intrinsic complexity of FV operators. Consequently, a more efficient and robust methodology for ensuring correct SVA operator selection is essential for producing functionally correct assertions. To address these challenges, we introduce FVRuleLearner, an Operator-Level Rule (Op-Rule) learning framework built on a novel Operator Reasoning Tree (OP-Tree), which models SVA generation as structured, interpretable reasoning. FVRuleLearner operates in two complementary phases: (1) Training: it constructs OP-Tree that decomposes NL-to-SVA alignment into fine-grained, operator-aware questions, combining reasoning paths that lead to correct assertions; and (2) Testing: it performs operator-aligned retrieval to fetch relevant reasoning traces from the learned OP-Tree and generate new rules for unseen specifications. In the comprehensive studies, the proposed FVRuleLearner outperforms the state-of-the-art baseline by 3.95% in syntax correctness and by 31.17% in functional correctness on average. Moreover, FVRuleLearner successfully reduces an average of 70.33% of SVA functional failures across diverse operator categories through a functional taxonomy analysis, showing the effectiveness of applying learned OP-Tree to the Op-Rule generations for unseen NL-to-SVA tasks. These results establish FVRuleLearner as a new paradigm for domain-specific reasoning and rule learning in formal verification.
[AI-145] Position: Science of AI Evaluation Requires Item-level Benchmark Data
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 评估范式中存在的系统性效度失效问题,这些问题包括不合理的设计选择和指标错位等,导致评估结果难以支撑高风险场景下的部署决策。其解决方案的关键在于引入项级(item-level)AI 基准数据,通过细粒度诊断分析与证据中心的评估框架,实现对基准测试的严谨验证和可解释性提升。论文进一步提出 OpenEval 作为开放共享的项级数据仓库,以推动社区在评估实践中采纳基于证据的方法论。
链接: https://arxiv.org/abs/2604.03244
作者: Han Jiang,Susu Zhang,Xiaoyuan Yi,Xing Xie,Ziang Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB)
备注:
Abstract:AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often exhibit systemic validity failures. These issues, ranging from unjustified design choices to misaligned metrics, remain intractable without a principled framework for gathering validity evidence and conducting granular diagnostic analysis. In this position paper, we argue that item-level AI benchmark data is essential for establishing a rigorous science of AI evaluation. Item-level analysis enables fine-grained diagnostics and principled validation of benchmarks. We substantiate this position by dissecting current validity failures and revisiting evaluation paradigms across computer science and psychometrics. Through illustrative analyses of item properties and latent constructs, we demonstrate the unique insights afforded by item-level data. To catalyze community-wide adoption, we introduce OpenEval, a growing repository of item-level benchmark data designed supporting evidence-centered AI evaluation.
[AI-146] o Throw a Stone with Six Birds: On Agents and Agent hood
【速读】:该论文旨在解决代理性(agency)在实证研究中难以界定和验证的问题,尤其是区分对象的持久存在(persistence)与实际控制能力(control)之间的混淆。传统方法常因无法清晰分离“物体属性”与“行为影响”,导致代理权声明易受误导或伪造。其解决方案的关键在于基于六鸟理论(Six Birds Theory, SBT)提出一种类型正确的代理建模框架:将代理视为一个维持的理论对象,其可行接口策略能在保持自身可行性的同时引导外部未来状态。该框架通过四个可检验组件实现操作化——记账门限可行性、基于后继支持语义的最大可行核(greatest fixed point)、作为差异制造代理的可行赋能(feasible empowerment,即信道容量),以及用于量化粗粒度观测下对象性的经验包装映射的幂等缺陷。实验在最小环形世界中验证了这些组件的有效性,实现了对代理身份(agenthood)与代理行为(agency)的分离测试,且不依赖目标设定、意识或生物体假设。
链接: https://arxiv.org/abs/2604.03239
作者: Ioannis Tsiokos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Six Birds Theory (SBT) treats macroscopic objects as induced closures rather than primitives. Empirical discussions of agency often conflate persistence (being an object) with control (making a counterfactual difference), which makes agency claims difficult to test and easy to spoof. We give a type-correct account of agency within SBT: a theory induces a layer with an explicit interface and ledgered constraints; an agent is a maintained theory object whose feasible interface policies can steer outside futures while remaining viable. We operationalize this contract in finite controlled systems using four checkable components: ledger-gated feasibility, a robust viability kernel computed as a greatest fixed point under successor-support semantics, feasible empowerment (channel capacity) as a proxy for difference-making, and an empirical packaging map whose idempotence defect quantifies objecthood under coarse observation. In a minimal ring-world with toggles for repair, protocol holonomy, identity staging, and operator rewriting, matched-control ablations yield four separations: calibrated null regimes with single actions show zero empowerment and block model-misspecification false positives; enabling repair collapses the idempotence defect; protocols increase empowerment only at horizons of two or more steps; and learning to rewrite operators monotonically increases median empowerment (0.73 to 1.34 bits). These results provide hash-traceable tests that separate agenthood from agency without making claims about goals, consciousness, or biological organisms, and they are accompanied by reproducible, audited artifacts.
[AI-147] Structural Segmentation of the Minimum Set Cover Problem: Exploiting Universe Decomposability for Metaheuristic Optimization
【速读】:该论文旨在解决最小集合覆盖问题(Minimum Set Cover Problem, MSCP)中因传统方法将实例视为整体而忽略其内在结构特性所导致的求解效率与质量受限的问题。解决方案的关键在于提出一种基于不相交集合并(disjoint-set union,即union-find)的高效预处理策略,用于识别由元素共现关系诱导的连通分量,从而将原始问题分解为多个独立子问题;每个子问题使用GRASP元启发式算法求解,且部分解可无损组合,最终实现更优的解质量和更好的可扩展性,尤其在大规模、结构可分解的实例上表现显著。
链接: https://arxiv.org/abs/2604.03234
作者: Isidora Hernández,Héctor Ferrada,Cristóbal A. Navarro
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Submitted to journal
Abstract:The Minimum Set Cover Problem (MSCP) is a classical NP-hard combinatorial optimization problem with numerous applications in science and engineering. Although a wide range of exact, approximate, and metaheuristic approaches have been proposed, most methods implicitly treat MSCP instances as monolithic, overlooking potential intrinsic structural properties of the universe. In this work, we investigate the concept of \emphuniverse segmentability in the MSCP and analyze how intrinsic structural decomposition (universe segmentability) can be exploited to enhance heuristic optimization. We propose an efficient preprocessing strategy based on disjoint-set union (union–find) to detect connected components induced by element co-occurrence within subsets, enabling the decomposition of the original instance into independent subproblems. Each subproblem is solved using the GRASP metaheuristic, and partial solutions are combined without compromising feasibility. Extensive experiments on standard benchmark instances and large-scale synthetic datasets show that exploiting natural universe segmentation consistently improves solution quality and scalability, particularly for large and structurally decomposable instances. These gains are supported by a succinct bit-level set representation that enables efficient set operations, making the proposed approach computationally practical at scale.
[AI-148] IC3-Evolve: Proof-/Witness-Gated Offline LLM -Driven Heuristic Evolution for IC3 Hardware Model Checking
【速读】:该论文旨在解决硬件安全模型检测中IC3(Property-Directed Reachability, PDR)算法性能受复杂交互启发式策略和实现选择影响的问题,这些问题导致手动调优成本高、脆弱且难以复现。解决方案的关键在于提出IC3-Evolve——一个基于大语言模型(Large Language Model, LLM)的离线代码演化框架,通过生成受限于槽位的小规模可审计补丁,并采用证明/见证门控验证机制(即SAFE运行必须输出可独立验证的证书,UNSAFE运行必须输出可重放的反例轨迹),确保所有候选补丁在正确性约束下被筛选,从而避免不一致修改的部署;最终得到一个无机器学习推理开销、无运行时模型依赖的独立演进式检查器,且在公开与工业基准上展现出良好的泛化能力。
链接: https://arxiv.org/abs/2604.03232
作者: Mingkai Miao,Guangyu Hu,Ziyi Yang,Hongce Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:IC3, also known as property-directed reachability (PDR), is a commonly-used algorithm for hardware safety model checking. It checks if a state transition system complies with a given safety property. IC3 either returns UNSAFE (indicating property violation) with a counterexample trace, or SAFE with a checkable inductive invariant as the proof to safety. In practice, the performance of IC3 is dominated by a large web of interacting heuristics and implementation choices, making manual tuning costly, brittle, and hard to reproduce. This paper presents IC3-Evolve, an automated offline code-evolution framework that utilizes an LLM to propose small, slot-restricted and auditable patches to an IC3 implementation. Crucially, every candidate patch is admitted only through proof- /witness-gated validation: SAFE runs must emit a certificate that is independently checked, and UNSAFE runs must emit a replayable counterexample trace, preventing unsound edits from being deployed. Since the LLM is used only offline, the deployed artifact is a standalone evolved checker with zero ML/LLM inference overhead and no runtime model dependency. We evolve on the public hardware model checking competition (HWMCC) benchmark and evaluate the generalizability on unseen public and industrial model checking benchmarks, showing that IC3-Evolve can reliably discover practical heuristic improvements under strict correctness gates.
[AI-149] From Concept to Practice: an Automated LLM -aided UVM Machine for RTL Verification
【速读】:该论文旨在解决集成电路(Integrated Circuit, IC)验证过程中测试平台构建和激励生成效率低下的问题,这一环节占整个开发周期的近70%,且传统通用验证方法学(Universal Verification Methodology, UVM)依赖大量手动编码与重复工具执行,对工程师的专业知识要求高。解决方案的关键在于提出UVM²框架,该框架利用大语言模型(Large Language Models, LLMs)自动生成UVM测试平台,并通过覆盖率反馈机制迭代优化,从而显著降低人工干预,同时在RTL设计规模达1.6K行的基准测试中实现平均代码覆盖率87.44%和功能覆盖率89.58%,优于现有最优方案20.96%和23.51%。
链接: https://arxiv.org/abs/2504.19959
作者: Junhao Ye,Yuchen Hu,Ke Xu,Dingrong Pan,Qichun Chen,Jie Zhou,Shuai Zhao,Xinwei Fang,Xi Wang,Nan Guan,Zhe Jiang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Verification presents a major bottleneck in Integrated Circuit (IC) development, consuming nearly 70% of the total development effort. While the Universal Verification Methodology (UVM) is widely used in industry to improve verification efficiency through structured and reusable testbenches, constructing these testbenches and generating sufficient stimuli remain challenging. These challenges arise from the considerable manual coding effort required, repetitive manual execution of multiple EDA tools, and the need for in-depth domain expertise to navigate complex this http URL, we present UVM^2, an automated verification framework that leverages Large Language Models (LLMs) to generate UVM testbenches and iteratively refine them using coverage feedback, significantly reducing manual effort while maintaining rigorous verification this http URL evaluate UVM^2, we introduce a benchmark suite comprising Register Transfer Level (RTL) designs of up to 1.6K lines of this http URL results show that UVM^2 reduces testbench setup time by up to UVM^2 compared to experienced engineers, and achieve average code and function coverage of 87.44% and 89.58%, outperforming state-of-the-art solutions by 20.96% and 23.51%, respectively.
[AI-150] How AI Aggregation Affects Knowledge
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在社会学习中的角色问题,即当群体信念被AI聚合并作为训练数据用于未来预测时,如何影响个体长期信念的收敛效率。其核心问题是:AI聚合机制是否会改善或损害集体学习过程?解决方案的关键在于引入一个可训练的AI聚合器(AI aggregator),该聚合器基于群体信念进行训练,并将合成信号反馈给个体 agents,从而扩展经典的DeGroot模型。研究通过定义“学习差距”(learning gap)来量化长期信念与最优基准之间的偏离程度,发现聚合器更新速度存在一个临界阈值——若更新过快,则无法找到在广泛环境中稳定提升学习效果的训练权重;而当更新足够慢时,此类权重存在。此外,局部聚合架构(local aggregator)由于使用邻近或主题特定的数据,能在所有环境下稳健提升学习表现,相比之下,单一全局聚合器(global aggregator)可能在某些状态维度上恶化学习效果。
链接: https://arxiv.org/abs/2604.04906
作者: Daron Acemoglu,Tianyi Lin,Asuman Ozdaglar,James Siderius
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 45 pages
Abstract:Artificial intelligence (AI) changes social learning when aggregated outputs become training data for future predictions. To study this, we extend the DeGroot model by introducing an AI aggregator that trains on population beliefs and feeds synthesized signals back to agents. We define the learning gap as the deviation of long-run beliefs from the efficient benchmark, allowing us to capture how AI aggregation affects learning. Our main result identifies a threshold in the speed of updating: when the aggregator updates too quickly, there is no positive-measure set of training weights that robustly improves learning across a broad class of environments, whereas such weights exist when updating is sufficiently slow. We then compare global and local architectures. Local aggregators trained on proximate or topic-specific data robustly improve learning in all environments. Consequently, replacing specialized local aggregators with a single global aggregator worsens learning in at least one dimension of the state.
[AI-151] Muon Dynamics as a Spectral Wasserstein Flow
【速读】:该论文旨在解决深度学习优化中梯度归一化(gradient normalization)的理论基础问题,特别是针对深层架构下参数自然分组为矩阵或块时,如何更有效地进行梯度更新以稳定训练并降低对尺度敏感性。其核心挑战在于设计一种能反映参数结构特性的归一化机制,并建立相应的几何框架来刻画参数空间中的最优传输路径。解决方案的关键是引入一族基于正定矩阵范数 γ 的谱 Wasserstein 距离(Spectral Wasserstein distances),其中迹范数对应经典二次 Wasserstein 距离 W2,算子范数恢复 Muon 几何,而中间 Schatten 范数则在两者之间插值。通过构建静态 Kantorovich 形式、证明与 W2 的比较界、推导 max-min 表达式及条件 Brenier 定理,并在高斯边际情形下将问题转化为协方差矩阵上的约束优化,该研究不仅揭示了不同归一化规则对应的几何结构,还建立了静态与动态 Benamou-Brenier 公式的等价性,从而确立了谱 Wasserstein 成本作为固定维度下与 W2 等价的真实度量,最终将此框架应用于生成式 AI (Generative AI) 中的矩阵流和不平衡传输建模,实现了从理论到应用的关键突破。
链接: https://arxiv.org/abs/2604.04891
作者: Gabriel Peyré
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Gradient normalization is central in deep-learning optimization because it stabilizes training and reduces sensitivity to scale. For deep architectures, parameters are naturally grouped into matrices or blocks, so spectral normalizations are often more faithful than coordinatewise Euclidean ones; Muon is the main motivating example of this paper. More broadly, we study a family of spectral normalization rules, ranging from ordinary gradient descent to Muon and intermediate Schatten-type schemes, in a mean-field regime where parameters are modeled by probability measures. We introduce a family of Spectral Wasserstein distances indexed by a norm gamma on positive semidefinite matrices. The trace norm recovers the classical quadratic Wasserstein distance, the operator norm recovers the Muon geometry, and intermediate Schatten norms interpolate between them. We develop the static Kantorovich formulation, prove comparison bounds with W2, derive a max-min representation, and obtain a conditional Brenier theorem. For Gaussian marginals, the problem reduces to a constrained optimization on covariance matrices, extending the Bures formula and yielding a closed form for commuting covariances in the Schatten family. For monotone norms, including all Schatten cases, we prove the equivalence between the static and dynamic Benamou-Brenier formulations, deduce that the resulting transport cost is a genuine metric equivalent to W2 in fixed dimension, and show that the induced Gaussian covariance cost is also a metric. We then interpret the associated normalized continuity equation as a Spectral Wasserstein gradient flow, identify its exact finite-particle counterpart as a normalized matrix flow, obtain first geodesic-convexity results, and show how positively homogeneous mean-field models induce a spectral unbalanced transport on the sphere.
[AI-152] A Quantum Search Approach to Magic Square Constraint Problems with Classical Benchmarking
【速读】:该论文旨在解决组合约束满足问题(Combinatorial Constraint Satisfaction Problems)的求解效率问题,以魔方阵(Magic Square)构造为例进行验证。其解决方案的关键在于将问题重新建模为量子搜索问题,利用Grover算法实现幅度放大(Amplitude Amplification),并通过一个可逆的、依赖于约束条件的量子Oracle标记有效配置;同时采用经典预处理技术(如Siamese构造和部分约束检查)生成紧凑的候选域,从而在量子阶段进行高效搜索,避免了传统迭代式混合求解框架,实现了结构化的初始设置与量子搜索的分离。实验结果验证了该量子搜索流程的正确性,并确认了相较于经典暴力枚举方法具有理论上的二次查询优势。
链接: https://arxiv.org/abs/2604.04786
作者: Rituparna R,Harsha Varthini,Aswani Kumar Cherukuri
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a quantum search approach to combinatorial constraint satisfaction problems, demonstrated through the generation of magic squares. We reformulate magic square construction as a quantum search problem in which a reversible, constraint-sensitive oracle marks valid configurations for amplitude amplification via Grover’s algorithm. Classical pre-processing using the Siamese construction and partial constraint checks generates a compact candidate domain before quantum encoding. Rather than integrating classical and quantum solvers in an iterative loop, this work uses the classical component for structured initialisation and the quantum component for search, and benchmarks the quantum approach against classical brute-force enumeration and backtracking. Our Qiskit implementation demonstrates the design of multi-register modular arithmetic circuits, oracle logic, and diffusion operators. Experiments are conducted on small grid instances, as larger grids are intractable on classical statevector simulators due to exponential memory growth. The results validate the correctness of the proposed quantum search pipeline and confirm the theoretical quadratic query advantage over classical search.
[AI-153] An AI Teaching Assistant for Motion Picture Engineering
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在教学环境中应用的实践细节与实际效益尚不明确的问题,特别是在研究生课程中部署生成式AI助教(AI Teaching Assistant, AI-TA)的有效性与可行性。其解决方案的关键在于采用检索增强生成(Retrieval Augmented Generation, RAG)技术构建AI-TA系统,并通过精细化设计和调优RAG管道以适配特定课程需求,同时结合大规模实证实验(43名学生、296次会话、1889次查询)验证其教学效果。研究进一步表明,经过合理设计的评估方式可确保AI-TA使用不会影响学术有效性,从而为LLM驱动的教学工具在高等教育中的安全落地提供了实证依据。
链接: https://arxiv.org/abs/2604.04670
作者: Deirdre O’Regan,Anil C. Kokaram
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Signal Processing (eess.SP)
备注: Accepted for publication in IEEE Signal Processing Magazine
Abstract:The rapid rise of LLMs over the last few years has promoted growing experimentation with LLM-driven AI tutors. However, the details of implementation, as well as the benefit in a teaching environment, are still in the early days of exploration. This article addresses these issues in the context of implementation of an AI Teaching Assistant (AI-TA) using Retrieval Augmented Generation (RAG) for Trinity College Dublin’s Master’s Motion Picture Engineering (MPE) course. We provide details of our implementation (including the prompt to the LLM, and code), and highlight how we designed and tuned our RAG pipeline to meet course needs. We describe our survey instrument and report on the impact of the AI-TA through a number of quantitative metrics. The scale of our experiment (43 students, 296 sessions, 1,889 queries over 7 weeks) was sufficient to have confidence in our findings. Unlike previous studies, we experimented with allowing the use of the AI-TA in open-book examinations. Statistical analysis across three exams showed no performance differences regardless of AI-TA access (p 0.05), demonstrating that thoughtfully designed assessments can maintain academic validity. Student feedback revealed that the AI-TA was beneficial (mean = 4.22/5), while students had mixed feelings about preferring it over human tutoring (mean = 2.78/5).
[AI-154] RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation CVPR
【速读】:该论文旨在解决传统基于帧的调频连续波(FMCW)雷达感知方法在计算效率和端到端延迟方面的瓶颈问题。其核心解决方案是提出一种名为RAVEN的深度学习架构,该架构通过逐chirp流式处理原始ADC数据、利用独立接收机状态空间编码器保留MIMO结构,并引入可学习的跨天线混合模块以恢复紧凑的虚拟阵列特征;此外,还设计了早期退出机制,在潜在状态稳定时仅使用部分chirps即可做出决策,从而显著降低计算开销与延迟,同时在汽车雷达基准测试中保持优异的目标检测和BEV自由空间分割性能。
链接: https://arxiv.org/abs/2604.04490
作者: Anuvab Sen,Mir Sayeed Mohammad,Saibal Mukhopadhyay
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: CVPR submission / conference paper
Abstract:This paper presents RAVEN, a computationally efficient deep learning architecture for FMCW radar perception. The method processes raw ADC data in a chirp-wise streaming manner, preserves MIMO structure through independent receiver state-space encoders, and uses a learnable cross-antenna mixing module to recover compact virtual-array features. It also introduces an early-exit mechanism so the model can make decisions using only a subset of chirps when the latent state has stabilized. Across automotive radar benchmarks, the approach reports strong object detection and BEV free-space segmentation performance while substantially reducing computation and end-to-end latency compared with conventional frame-based radar pipelines.
[AI-155] MC-GenRef: Annotation-free mammography microcalcification segmentation with generative posterior refinement
【速读】:该论文旨在解决乳腺X线摄影中微钙化(Microcalcification, MC)密集分割的挑战,尤其是目标极小且稀疏、像素级标注成本高且模糊,以及跨机构数据分布偏移导致致密组织中假阳性与漏检频发的问题。解决方案的关键在于提出MC-GenRef框架:训练阶段利用真实负样本作为背景,通过轻量级图像形成模型注入物理合理的MC模式(含局部对比度调制和模糊),生成精确的图像-掩膜对而无需真实密集标注;推理阶段引入测试时生成后验精炼(Test-Time Generative Posterior Refinement, TT-GPR),将分割视为近似后验推断,基于稀疏种子生成一致性投影并转化为病例特异性代理标签,结合重叠一致性和边缘感知正则化迭代优化logits,从而显著提升召回率(Recall)和FNR(False Negative Rate)等敏感指标,同时增强跨站点鲁棒性。
链接: https://arxiv.org/abs/2604.04470
作者: Hyunwoo Cho,Yeeun Kwon,Min Jung Kim,Yangmo Yoo
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:
Abstract:Microcalcification (MC) analysis is clinically important in screening mammography because clustered puncta can be an early sign of malignancy, yet dense MC segmentation remains challenging: targets are extremely small and sparse, dense pixel-level labels are expensive and ambiguous, and cross-site shift often induces texture-driven false positives and missed puncta in dense tissue. We propose MC-GenRef, a real dense-label-free framework that combines high-fidelity synthetic supervision with test-time generative posterior refinement (TT-GPR). During training, real negative mammogram patches are used as backgrounds, and physically plausible MC patterns are injected through a lightweight image formation model with local contrast modulation and blur, yielding exact image-mask pairs without real dense annotation. Using only these synthetic labeled pairs, MC-GenRef trains a base segmentor and a seed-conditioned rectified-flow (RF) generator that serves as a controllable generative prior. During inference, TT-GPR treats segmentation as approximate posterior inference: it derives a sparse seed from the current prediction, forms seed-consistent RF projections, converts them into case-specific surrogate targets through the frozen segmentor, and iteratively refines the logits with overlap-consistent and edge-aware regularization. On INbreast, the synthetic-only initializer achieved the best Dice without real dense annotations, while TT-GPR improved miss-sensitive performance to Recall and FNR, with strong class-balanced behavior (this http URL., G-Mean). On an external private Yonsei cohort ( n=50 ), TT-GPR consistently improved the synthetic-only initializer under cross-site shift, increasing Dice and Recall while reducing FNR. These results suggest that test-time generative posterior refinement is a practical route to reduce MC misses and improve robustness without additional real dense labeling.
[AI-156] PATHFINDER: Multi-objective discovery in structural and spectral spaces
【速读】:该论文旨在解决自动化微观表征中因过度依赖单一预设目标而导致的探索局限性问题,即机器学习驱动的工作流容易过早收敛至常见响应,从而忽略稀有但具有科学意义的状态。其解决方案的关键在于提出PATHFINDER框架,该框架通过融合新颖性驱动探索与优化策略,实现对结构、谱学和测量空间的协同探索;具体而言,它利用局部结构的潜在空间表示、功能响应的代理建模以及基于帕累托(Pareto)的采集策略,选择在特征空间和对象空间中兼具新颖性和实验可操作性的测量点,从而在有限实验预算下拓展结构-性能景观并避免陷入单一最优解。
链接: https://arxiv.org/abs/2604.04194
作者: Kamyar Barakati,Boris N. Slautin,Utkarsh Pratiush,Hiroshi Funakubo,Sergei V. Kalinin
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
备注: 24 pages, 6 figures
Abstract:Automated decision-making is becoming key for automated characterization including electron and scanning probe microscopies and nano indentation. Most machine learning driven workflows optimize a single predefined objective and tend to converge prematurely on familiar responses, overlooking rare but scientifically important states. More broadly, the challenge is not only where to measure next, but how to coordinate exploration across structural, spectral, and measurement spaces under finite experimental budgets while balancing target-driven optimization with novelty discovery. Here we introduce PATHFINDER, a framework for autonomous microscopy that combines novelty driven exploration with optimization, helping the system discover more diverse and useful representations across structural, spectral, and measurement spaces. By combining latent space representations of local structure, surrogate modeling of functional response, and Pareto-based acquisition, the framework selects measurements that balance novelty discovery in feature and object space and are informative and experimentally actionable. Benchmarked on pre acquired STEM EELS data and realized experimentally in scanning probe microscopy of ferroelectric materials, this approach expands the accessible structure property landscape and avoids collapse onto a single apparent optimum. These results point to a new mode of autonomous microscopy that is not only optimization-driven, but also discovery-oriented, broad in its search, and responsive to human guidance.
[AI-157] An Improved Last-Iterate Convergence Rate for Anchored Gradient Descent Ascent
【速读】:该论文旨在解决光滑凸凹极小极大问题中锚定梯度下降上升算法(Anchored Gradient Descent Ascent)的最后迭代收敛速率问题。此前研究仅能证明平方梯度范数的收敛速率为 O(1/t2−2p),其中 p∈(1/2,1),而是否能达到更优的精确 O(1/t) 收敛率仍是一个开放问题。本文通过形式化证明工具 Lean 自主完成证明,首次证实了该最优速率是可达的,其关键在于构建并验证了一个严格的数学框架,从而在理论上确立了该算法在最后迭代点上的线性收敛性质。
链接: https://arxiv.org/abs/2604.03782
作者: Anja Surina,Arun Suggala,George Tsoukalas,Anton Kovsharov,Sergey Shirobokov,Francisco J. R. Ruiz,Pushmeet Kohli,Swarat Chaudhuri
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:
Abstract:We analyze the last-iterate convergence of the Anchored Gradient Descent Ascent algorithm for smooth convex-concave min-max problems. While previous work established a last-iterate rate of \mathcalO(1/t^2-2p) for the squared gradient norm, where p \in (1/2, 1) , it remained an open problem whether the improved exact \mathcalO(1/t) rate is achievable. In this work, we resolve this question in the affirmative. This result was discovered autonomously by an AI system capable of writing formal proofs in Lean. The Lean proof can be accessed at this https URL
[AI-158] he Ideation Bottleneck: Decomposing the Quality Gap Between AI-Generated and Human Economics Research
【速读】:该论文旨在解决生成式 AI (Generative AI) 在经济学研究论文产出中质量显著低于人类作者的问题,其核心挑战在于识别并量化导致这一差距的具体来源。解决方案的关键在于将整体质量差异分解为两个独立维度:研究选题质量(idea quality)与执行质量(execution quality),并通过双模型集成方法进行评估——使用基于发表决策训练的微调语言模型 ensemble 评估选题质量,采用 Gemini 3.1 Flash Lite 模型依据六维评分标准评估执行质量,从而精确量化两者对总体质量差距的贡献比例。结果显示,选题质量差距(Cohen’s d = 2.23)占整体差异的71%,远高于执行质量差距(d = 0.90),表明当前AI生成经济学研究的主要瓶颈仍在于创新性研究思路的生成能力不足。
链接: https://arxiv.org/abs/2604.03338
作者: Ning Li
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Autonomous AI systems can now generate complete economics research papers, but they substantially underperform human-authored publications in head-to-head comparisons. This paper decomposes the quality gap into two independent components: research idea quality and execution quality. Using a two-model ensemble of fine-tuned language models trained on publication decisions (Gong, Li, and Zhou, 2026) to evaluate idea quality and a comprehensive six-dimension rubric assessed by Gemini 3.1 Flash Lite – the same model family used as the APE tournament judge, ensuring methodological consistency – to evaluate execution quality, we analyze 953 economics papers – 912 AI-generated papers from the APE project and 41 human papers published in the American Economic Review and AEJ: Economic Policy. The idea quality gap is large (Cohen’s d = 2.23, p 0.001), with human papers achieving 47.1% mean ensemble exceptional probability versus 16.5% for AI. The execution quality gap is also significant but smaller (d = 0.90, p 0.001), with human papers scoring 4.38/5.0 versus 3.84. Idea quality accounts for approximately 71% of the overall quality difference, with execution contributing 29%. The largest execution weakness is mechanism analysis depth (d = 1.43); no significant difference is found on robustness. We document that 74% of AI papers employ difference-in-differences, and only 7 AI papers (0.8%) surpass the median human paper on both idea and execution quality simultaneously. The primary bottleneck to competitive AI-generated economics research remains ideation.
[AI-159] Downscaling weather forecasts from Low- to High-Resolution with Diffusion Models
【速读】:该论文旨在解决全球大气降尺度(global atmospheric downscaling)中如何从低分辨率集合预报生成高分辨率集合的问题,尤其关注小尺度结构的恢复与极端事件的合理模拟。解决方案的关键在于提出一种基于概率扩散(probabilistic diffusion-based)的方法,嵌入于Anemoi框架中,通过学习细尺度残差(fine-scale residuals,即高分辨率场与插值后低分辨率输入之间的差异)的条件分布,实现从100 km分辨率到30 km分辨率的降尺度重建;训练过程聚焦于小尺度结构恢复,再通过高噪声环境下的微调(fine-tuning)生成极端天气事件,从而在概率技能(FCRPS)、功率谱匹配、多变量物理一致性(如风压耦合)及热带气旋极端值再现等方面显著优于传统方法。
链接: https://arxiv.org/abs/2604.03303
作者: Joffrey Dumont Le Brazidec,Simon Lang,Martin Leutbecher,Baudouin Raoult,Gert Mertes,Florian Pinault,Aristofanis Tsiringakis,Pedro Maciel,Ana Prieto Nemesio,Jan Polster,Cathal O Brien,Matthew Chantry
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a probabilistic diffusion-based method for global atmospheric downscaling implemented within the Anemoi framework. The approach transforms low-resolution ensemble forecasts into high-resolution ensembles by learning the conditional distribution of finer-scale residuals, defined as the difference between the high-resolution fields and the interpolated low-resolution inputs. The system is trained on reforecast pairs from ECMWF IFS, using coarse fields at 100 km to reconstruct fine-scale variability at 30 km resolution. The bulk of the training focuses on recovering small-scale structures, while fine-tuning in high-noise regimes enables the generation of extremes. Evaluation against the medium-range IFS ensemble target shows that the model increases probabilistic skill (FCRPS) for surface variables, reproduces target power spectra at small scales, captures physically consistent multivariate relationships such as wind-pressure coupling, and generates extreme values consistent with those of the target ensemble in tropical cyclones.
[AI-160] AIFS-COMPO: A Global Data-Driven Atmospheric Composition Forecasting System
【速读】:该论文旨在解决传统全球大气成分预测系统在计算资源消耗高、预测时效受限等问题,尤其针对气溶胶和活性气体的中程预报精度与效率之间的矛盾。解决方案的关键在于提出AIFS-COMPO,一个基于Transformer架构的端到端数据驱动预报系统,通过编码器-处理器-解码器结构联合建模气象场与大气成分变量(如气溶胶、臭氧、NO₂等),利用Copernicus大气监测服务(CAMS)再分析、同化和预报数据学习天气、排放、传输与化学过程的耦合动力学,从而实现高精度且低算力需求的全球大气成分预测,同时拓展了可预报时间范围。
链接: https://arxiv.org/abs/2604.03300
作者: Paula Harder,Johannes Flemming,Mihai Alexe,Gert Mertes,Baudouin Raoult,Matthew Chantry
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce AIFS-COMPO, a skilful medium-range data-driven global forecasting system for aerosols and reactive gases. Building on the ECMWF Artificial Intelligence Forecast System (AIFS), AIFS-COMPO employs a transformer-based encoder-processor-decoder architecture to jointly model meteorological and atmospheric composition variables. The model is trained on Copernicus Atmosphere Monitoring Service (CAMS) reanalysis, analysis, and forecast data to learn the coupled dynamics of weather, emissions, transport, and atmospheric chemistry. We evaluate AIFS-COMPO against a range of atmospheric composition observations and compare its performance with the operational CAMS global forecasting system IFS-COMPO. The results show that AIFS-COMPO achieves comparable or improved forecast skill for several key species while requiring only a fraction of the computational resources. Furthermore, the efficiency of the approach enables forecasts beyond the current operational horizon, demonstrating the potential of AI-based systems for fast and accurate global atmospheric composition prediction.
[AI-161] Impact of geophysical fields on Deep Learning-based Lagrangian drift simulations
【速读】:该论文旨在解决如何通过融合多种地球物理输入场(geophysical input fields)来提升海面拉格朗日漂移模拟的准确性问题。其解决方案的关键在于利用DriftNet这一基于学习的方法,系统评估不同输入组合对漂移轨迹预测性能的影响,发现将同化得到的海表流速(SSC)与完全观测的海表高度(SSH)相结合,可在数值实验和实际漂流器实验中显著降低轨迹分离距离、减少归一化累积拉格朗日分离度及改善速度自相关特性,从而实现最优漂移模拟效果;而单独引入海表温度(SST)则通常会降低性能。
链接: https://arxiv.org/abs/2604.03292
作者: Daria Botvynko(Lab-STICC_OSE, IMT Atlantique - MEE, IMT Atlantique),Carlos Granero-Belinchon(ODYSSEY, IMT Atlantique - MEE, Lab-STICC_OSE),Simon Van Gennip(MOi),Abdesslam Benzinou(ENIB),Ronan Fablet(IMT Atlantique - MEE, Lab-STICC_OSE, ODYSSEY)
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:We assess the influence of different Eulerian geophysical input fields on Lagrangian drift simulations using DriftNet, a learning-based method designed to simulate Lagrangian drift on the sea surface. Two experiments are conducted: a fully numerical experiment (Benchmark B1) and a real-world drifters-based experiment (Benchmark B2). Both experiments are performed in two regions with different ocean dynamics: North East Pacific and Gulf Stream regions. The performance of DrifNet is evaluated with three different metrics: separation distance between simulated and ground-truth trajectories, the normalized cumulative Lagrangian separation and the autocorrelation of Lagrangian velocities. In both regions, results from B1 show that combining assimilated sea surface currents (SSC) with fully observed sea surface height (SSH) leads to greatest improvement in trajectory simulation. This configuration reduces separation distance by over 50% and significantly decreases normalized cumulative Lagrangian separation and metrics related to velocities autocorrelation functions compared to the baseline using SSC alone. On the other hand, the inclusion of sea surface temperature (SST) either alone or in combination with SSC generally degrades performance. In B2, using satellite-derived SSH, Ekman and winds velocities improves surface drifters trajectories simulation, particularly in the North East Pacific. While the satellite-derived SST in combination with reanalysis-based SSC configuration leads to better trajectories simulation in the Gulf Stream. Overall, we highlight the added value of combining multiple geophysical fields to improve Lagrangian drift simulation on both numerical and real-world experiments.
[AI-162] oward Artificial Intelligence Enabled Earth System Coupling
【速读】:该论文旨在解决地球系统多圈层(如大气、海洋、陆地、生物圈等)耦合建模中长期存在的局限性,尤其是多成分模型之间跨域交互不充分、物理一致性不足以及集成难度大等问题。其解决方案的关键在于利用前沿人工智能(Artificial Intelligence, AI)技术,强化不同地球圈层之间的跨域互动能力,提升多成分模型的协同表示一致性,并推动统一地球系统框架的发展,从而实现更精确、可解释且物理自洽的地球系统模拟。
链接: https://arxiv.org/abs/2604.03289
作者: Maria Kaselimi,Anna Belehaki
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
备注:
Abstract:Coupling constitutes a foundational mechanism in the Earth system, regulating the interconnected physical, chemical, and biological processes that link its spheres. This review examines how emerging artificial intelligence (AI) methods create new opportunities to enhance Earth system coupling and address long-standing limitations in multi-component models. Rather than surveying next-generation modelling efforts broadly, we focus specifically on how state-of-the-art AI techniques can strengthen cross-domain interactions, support more coherent multi-component representations, and enable progress toward unified Earth system frameworks. The scope extends beyond climate models to include any modelling system in which Earth spheres interact. We outline emerging opportunities, persistent limitations, and conceptual pathways through which AI may enhance physical consistency, interpretability, and integration across domains. In doing so, this review provides a structured foundation for understanding the role of AI in advancing coupled Earth system modelling.
[AI-163] IPSL-AID: Generative Diffusion Models for Climate Downscaling from Global to Regional Scales
【速读】:该论文旨在解决气候模型在区域尺度上分辨率不足的问题,传统全球气候模型(Global Climate Models, GCMs)通常仅能提供150至200公里的空间分辨率,难以刻画关键的区域过程。为此,作者提出IPS-LAID方法,其核心在于采用基于去噪扩散概率模型(Denoising Diffusion Probabilistic Model, DDPM)的生成式AI技术,通过训练ERA5再分析数据,在给定粗分辨率输入及其时空上下文条件下,生成0.25度高分辨率的温度、风场和降水场,并同时建模细尺度特征的概率分布以实现不确定性量化。该方案的关键创新在于利用生成式扩散模型高效实现从全球到区域的降尺度转换,且能准确重构极端事件、功率谱和空间结构等统计特性,从而为气候适应与减缓策略提供更可靠的决策依据。
链接: https://arxiv.org/abs/2604.03275
作者: Kishanthan Kingston,Olivier Boucher,Freddy Bouchet,Pierre Chapel,Rosemary Eade,Jean-Francois Lamarque,Redouane Lguensat,Kazem Ardaneh
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 12 figures, submitted to Climate Informatique 2026, to appear in Environmental Data Science
Abstract:Effective adaptation and mitigation strategies for climate change require high-resolution projections to inform strategic decision-making. Conventional global climate models, which typically operate at resolutions of 150 to 200 kilometers, lack the capacity to represent essential regional processes. IPSL-AID is a global to regional downscaling tool based on a denoising diffusion probabilistic model designed to address this limitation. Trained on ERA5 reanalysis data, it generates 0.25 degree resolution fields for temperature, wind, and precipitation using coarse inputs and their spatiotemporal context. It also models probability distributions of fine-scale features to produce plausible scenarios for uncertainty quantification. The model accurately reconstructs statistical distributions, including extreme events, power spectra, and spatial structures. This work highlights the potential of generative diffusion models for efficient climate downscaling with uncertainty
[AI-164] Artificial Intelligence and Systemic Risk: A Unified Model of Performative Prediction Algorithmic Herding and Cognitive Dependency in Financial Markets
【速读】:该论文旨在解决人工智能(AI)在金融市场的广泛应用如何通过多重相互强化机制引发系统性风险的问题。解决方案的关键在于构建一个统一的理论模型,揭示AI采用率(ϕ)与系统性风险之间的非线性耦合关系,其核心表达式为 r(ϕ)=ϕρβ/λ′(ϕ),其中 ρ 为算法信号相关性、β 为表现反馈强度、λ′(ϕ) 为内生有效价格冲击。由于 λ′(ϕ) 随 ϕ 增加而递减,导致风险耦合项呈凸函数形式,进而使得系统性风险乘数 M=(1−r)−1 随AI渗透率超线性增长。这一机制通过三个层次展开:第一层是内生脆弱性(市场深度随AI采用率单调递减且凸),第二层是在超模博弈中嵌入凸耦合产生算法同质化(鞍点分岔),第三层引入认知依赖作为内生状态变量,证明静态框架下无法实现滞后效应(不可能定理),并确立三类渠道各自必要性(渠道必要性定理)。实证部分基于SEC Form 13F全量持仓数据(9950万笔交易,10957家机构,2013–2024年)及Bartik工具变量(第一阶段F=22.7),验证了尾部损失放大效应达18–54%,具有显著经济意义。
链接: https://arxiv.org/abs/2604.03272
作者: Shuchen Meng,Xupeng Chen
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); General Finance (q-fin.GN)
备注:
Abstract:We develop a unified model in which AI adoption in financial markets generates systemic risk through three mutually reinforcing channels: performative prediction, algorithmic herding, and cognitive dependency. Within an extended rational expectations framework with endogenous adoption, we derive an equilibrium systemic risk coupling r(\phi) = \phi\rho\beta/\lambda’(\phi) , where \phi is the AI adoption share, \rho the algorithmic signal correlation, \beta the performative feedback intensity, and \lambda’(\phi) the endogenous effective price impact. Because \lambda’(\phi) is decreasing in \phi , the coupling is convex in adoption, implying that the systemic risk multiplier M = (1 - r)^-1 grows superlinearly as AI penetration increases. The model is developed in three layers. First, endogenous fragility: market depth is decreasing and convex in AI adoption. Second, embedding the convex coupling within a supermodular adoption game produces a saddle-node bifurcation into an algorithmic monoculture. Third, cognitive dependency as an endogenous state variable yields an impossibility theorem (hysteresis requires dynamics beyond static frameworks) and a channel necessity theorem (each channel is individually necessary). Empirical validation uses the complete universe of SEC Form 13F filings (99.5 million holdings, 10,957 institutional managers, 2013–2024) with a Bartik shift-share instrument (first-stage F = 22.7 ). The model implies tail-loss amplification of 18–54%, economically significant relative to Basel III countercyclical buffers.
机器学习
[LG-0] Stratifying Reinforcement Learning with Signal Temporal Logic
链接: https://arxiv.org/abs/2604.04923
作者: Justin Curry,Alberto Speranzon
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Systems and Control (eess.SY); Algebraic Topology (math.AT)
*备注: 8 pages, 13 figures
Abstract:In this paper, we develop a stratification-based semantics for Signal Temporal Logic (STL) in which each atomic predicate is interpreted as a membership test in a stratified space. This perspective reveals a novel correspondence principle between stratification theory and STL, showing that most STL formulas can be viewed as inducing a stratification of space-time. The significance of this interpretation is twofold. First, it offers a fresh theoretical framework for analyzing the structure of the embedding space generated by deep reinforcement learning (DRL) and relates it to the geometry of the ambient decision space. Second, it provides a principled framework that both enables the reuse of existing high-dimensional analysis tools and motivates the creation of novel computational techniques. To ground the theory, we (1) illustrate the role of stratification theory in Minigrid games and (2) apply numerical techniques to the latent embeddings of a DRL agent playing such a game where the robustness of STL formulas is used as the reward. In the process, we propose computationally efficient signatures that, based on preliminary evidence, appear promising for uncovering the stratification structure of such embedding spaces.
[LG-1] Empowering Power Outage Prediction with Spatially Aware Hybrid Graph Neural Networks and Contrastive Learning
链接: https://arxiv.org/abs/2604.04916
作者: Xuyang Shen,Zijie Pan,Diego Cerrai,Xinxuan Zhang,Christopher Colorio,Emmanouil N. Anagnostou,Dongjin Song
类目: Machine Learning (cs.LG)
*备注:
Abstract:Extreme weather events, such as severe storms, hurricanes, snowstorms, and ice storms, which are exacerbated by climate change, frequently cause widespread power outages. These outages halt industrial operations, impact communities, damage critical infrastructure, profoundly disrupt economies, and have far-reaching effects across various sectors. To mitigate these effects, the University of Connecticut and Eversource Energy Center have developed an outage prediction modeling (OPM) system to provide pre-emptive forecasts for electric distribution networks before such weather events occur. However, existing predictive models in the system do not incorporate the spatial effect of extreme weather events. To this end, we develop Spatially Aware Hybrid Graph Neural Networks (SA-HGNN) with contrastive learning to enhance the OPM predictions for extreme weather-induced power outages. Specifically, we first encode spatial relationships of both static features (e.g., land cover, infrastructure) and event-specific dynamic features (e.g., wind speed, precipitation) via Spatially Aware Hybrid Graph Neural Networks (SA-HGNN). Next, we leverage contrastive learning to handle the imbalance problem associated with different types of extreme weather events and generate location-specific embeddings by minimizing intra-event distances between similar locations while maximizing inter-event distances across all locations. Thorough empirical studies in four utility service territories, i.e., Connecticut, Western Massachusetts, Eastern Massachusetts, and New Hampshire, demonstrate that SA-HGNN can achieve state-of-the-art performance for power outage prediction.
[LG-2] HI-MoE: Hierarchical Instance-Conditioned Mixture-of-Experts for Object Detection
链接: https://arxiv.org/abs/2604.04908
作者: Vadim Vashkelis,Natalia Trukhina
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mixture-of-Experts (MoE) architectures enable conditional computation by activating only a subset of model parameters for each input. Although sparse routing has been highly effective in language models and has also shown promise in vision, most vision MoE methods operate at the image or patch level. This granularity is poorly aligned with object detection, where the fundamental unit of reasoning is an object query corresponding to a candidate instance. We propose Hierarchical Instance-Conditioned Mixture-of-Experts (HI-MoE), a DETR-style detection architecture that performs routing in two stages: a lightweight scene router first selects a scene-consistent expert subset, and an instance router then assigns each object query to a small number of experts within that subset. This design aims to preserve sparse computation while better matching the heterogeneous, instance-centric structure of detection. In the current draft, experiments are concentrated on COCO with preliminary specialization analysis on LVIS. Under these settings, HI-MoE improves over a dense DINO baseline and over simpler token-level or instance-only routing variants, with especially strong gains on small objects. We also provide an initial visualization of expert specialization patterns. We present the method, ablations, and current limitations in a form intended to support further experimental validation.
[LG-3] Are Latent Reasoning Models Easily Interpretable?
链接: https://arxiv.org/abs/2604.04902
作者: Connor Dilgren,Sarah Wiegreffe
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:Latent reasoning models (LRMs) have attracted significant research interest due to their low inference cost (relative to explicit reasoning models) and theoretical ability to explore multiple reasoning paths in parallel. However, these benefits come at the cost of reduced interpretability: LRMs are difficult to monitor because they do not reason in natural language. This paper presents an investigation into LRM interpretability by examining two state-of-the-art LRMs. First, we find that latent reasoning tokens are often unnecessary for LRMs’ predictions; on logical reasoning datasets, LRMs can almost always produce the same final answers without using latent reasoning at all. This underutilization of reasoning tokens may partially explain why LRMs do not consistently outperform explicit reasoning methods and raises doubts about the stated role of these tokens in prior work. Second, we demonstrate that when latent reasoning tokens are necessary for performance, we can decode gold reasoning traces up to 65-93% of the time for correctly predicted instances. This suggests LRMs often implement the expected solution rather than an uninterpretable reasoning process. Finally, we present a method to decode a verified natural language reasoning trace from latent tokens without knowing a gold reasoning trace a priori, demonstrating that it is possible to find a verified trace for a majority of correct predictions but only a minority of incorrect predictions. Our findings highlight that current LRMs largely encode interpretable processes, and interpretability itself can be a signal of prediction correctness.
[LG-4] Data Attribution in Adaptive Learning
链接: https://arxiv.org/abs/2604.04892
作者: Amit Kiran Rege
类目: Machine Learning (cs.LG)
*备注: Work in progress
Abstract:Machine learning models increasingly generate their own training data – online bandits, reinforcement learning, and post-training pipelines for language models are leading examples. In these adaptive settings, a single training observation both updates the learner and shifts the distribution of future data the learner will collect. Standard attribution methods, designed for static datasets, ignore this feedback. We formalize occurrence-level attribution for finite-horizon adaptive learning via a conditional interventional target, prove that replay-side information cannot recover it in general, and identify a structural class in which the target is identified from logged data.
[LG-5] Optimizing LLM Prompt Engineering with DSPy Based Declarative Learning
链接: https://arxiv.org/abs/2604.04869
作者: Shiek Ruksana,Sailesh Kiran Kurra,Thipparthi Sanjay Baradwaj
类目: Machine Learning (cs.LG)
*备注: Best paper Award ,IEEE International Conference on Emerging Smart Computing and Informatics (ESCI) Pune, India. Mar 11-13, 2026
Abstract:Large Language Models (LLMs) have shown strong performance across a wide range of natural language processing tasks; however, their effectiveness is highly dependent on prompt design, structure, and embedded reasoning signals. Conventional prompt engineering methods largely rely on heuristic trial-and-error processes, which limits scalability, reproducibility, and generalization across tasks. DSPy, a declarative framework for optimizing text-processing pipelines, offers an alternative approach by enabling automated, modular, and learnable prompt construction for LLM-based this http URL paper presents a systematic study of DSPy-based declarative learning for prompt optimization, with emphasis on prompt synthesis, correction, calibration, and adaptive reasoning control. We introduce a unified DSPy LLM architecture that combines symbolic planning, gradient free optimization, and automated module rewriting to reduce hallucinations, improve factual grounding, and avoid unnecessary prompt complexity. Experimental evaluations conducted on reasoning tasks, retrieval-augmented generation, and multi-step chain-of-thought benchmarks demonstrate consistent gains in output reliability, efficiency, and generalization across models. The results show improvements of up to 30 to 45% in factual accuracy and a reduction of approximately 25% in hallucination rates. Finally, we outline key limitations and discuss future research directions for declarative prompt optimization frameworks.
[LG-6] FairLogue: A Toolkit for Intersectional Fairness Analysis in Clinical Machine Learning Models
链接: https://arxiv.org/abs/2604.04858
作者: Nick Souligne,Vignesh Subbian
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Objective: Algorithmic fairness is essential for equitable and trustworthy machine learning in healthcare. Most fairness tools emphasize single-axis demographic comparisons and may miss compounded disparities affecting intersectional populations. This study introduces Fairlogue, a toolkit designed to operationalize intersectional fairness assessment in observational and counterfactual contexts within clinical settings. Methods: Fairlogue is a Python-based toolkit composed of three components: 1) an observational framework extending demographic parity, equalized odds, and equal opportunity difference to intersectional populations; 2) a counterfactual framework evaluating fairness under treatment-based contexts; and 3) a generalized counterfactual framework assessing fairness under interventions on intersectional group membership. The toolkit was evaluated using electronic health record data from the All of Us Controlled Tier V8 dataset in a glaucoma surgery prediction task using logistic regression with race and gender as protected attributes. Results: Observational analysis identified substantial intersectional disparities despite moderate model performance (AUROC = 0.709; accuracy = 0.651). Intersectional evaluation revealed larger fairness gaps than single-axis analyses, including demographic parity differences of 0.20 and equalized odds true positive and false positive rate gaps of 0.33 and 0.15, respectively. Counterfactual analysis using permutation-based null distributions produced unfairness (“u-value”) estimates near zero, suggesting observed disparities were consistent with chance after conditioning on covariates. Conclusion: Fairlogue provides a modular toolkit integrating observational and counterfactual methods for quantifying and evaluating intersectional bias in clinical machine learning workflows.
[LG-7] he Role of Generator Access in Autoregressive Post-Training
链接: https://arxiv.org/abs/2604.04855
作者: Amit Kiran Rege
类目: Machine Learning (cs.LG)
*备注: Work in progress
Abstract:We study how generator access constrains autoregressive post-training. The central question is whether the learner is confined to fresh root-start rollouts or can return to previously built prefixes and query the next-token rule there. In the root-start regime, output sampling, generated-token log probabilities, top- k reports, and full next-token distributions along sampled trajectories all reduce to one canonical experiment, limited by the on-policy probability of reaching informative prefixes. Weak prefix control breaks this barrier, and once control is available, richer observations such as conditional sampling or logits can outperform top- 1 access. Changing only the generator interface creates an exponential gap for KL-regularized outcome-reward post-training.
[LG-8] Partially deterministic sampling for compressed sensing with denoising guarantees
链接: https://arxiv.org/abs/2604.04802
作者: Yaniv Plan,Matthew S. Scott,Ozgur Yilmaz
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:We study compressed sensing when the sampling vectors are chosen from the rows of a unitary matrix. In the literature, these sampling vectors are typically chosen randomly; the use of randomness has enabled major empirical and theoretical advances in the field. However, in practice there are often certain crucial sampling vectors, in which case practitioners will depart from the theory and sample such rows deterministically. In this work, we derive an optimized sampling scheme for Bernoulli selectors which naturally combines random and deterministic selection of rows, thus rigorously deciding which rows should be sampled deterministically. This sampling scheme provides measurable improvements in image compressed sensing for both generative and sparse priors when compared to with-replacement and without-replacement sampling schemes, as we show with theoretical results and numerical experiments. Additionally, our theoretical guarantees feature improved sample complexity bounds compared to previous works, and novel denoising guarantees in this setting.
[LG-9] Forgetting to Witness: Efficient Federated Unlearning and Its Visible Evaluation
链接: https://arxiv.org/abs/2604.04800
作者: Houzhe Wang,Xiaojie Zhu,Chi Chen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:With the increasing importance of data privacy and security, federated unlearning has emerged as a novel research field dedicated to ensuring that federated learning models no longer retain or leak relevant information once specific data has been deleted. In this paper, to the best of our knowledge, we propose the first complete pipeline for federated unlearning, which includes a federated unlearning approach and an evaluation framework. Our proposed federated unlearning approach ensures high efficiency and model accuracy without the need to store historical this http URL effectively leverages the knowledge distillation model alongside various optimization mechanisms. Moreover, we propose a framework named Skyeye to visualize the forgetting capacity of federated unlearning models. It utilizes the federated unlearning model as the classifier integrated into a Generative Adversarial Network (GAN). Afterward, both the classifier and discriminator guide the generator in generating samples. Throughout this process, the generator learns from the classifier’s knowledge. The generator then visualizes this knowledge through sample generation. Finally, the model’s forgetting capability is evaluated based on the relevance between the deleted data and the generated samples. Comprehensive experiments are conducted to illustrate the effectiveness of the proposed federated unlearning approach and the corresponding evaluation framework.
[LG-10] Fine-Tuning Integrity for Modern Neural Networks: Structured Drift Proofs via Norm Rank and Sparsity Certificates
链接: https://arxiv.org/abs/2604.04738
作者: Zhenhang Shang,Kani Chen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures
Abstract:Fine-tuning is now the primary method for adapting large neural networks, but it also introduces new integrity risks. An untrusted party can insert backdoors, change safety behavior, or overwrite large parts of a model while claiming only small updates. Existing verification tools focus on inference correctness or full-model provenance and do not address this problem. We introduce Fine-Tuning Integrity (FTI) as a security goal for controlled model evolution. An FTI system certifies that a fine-tuned model differs from a trusted base only within a policy-defined drift class. We propose Succinct Model Difference Proofs (SMDPs) as a new cryptographic primitive for enforcing these drift constraints. SMDPs provide zero-knowledge proofs that the update to a model is norm-bounded, low-rank, or sparse. The verifier cost depends only on the structure of the drift, not on the size of the model. We give concrete SMDP constructions based on random projections, polynomial commitments, and streaming linear checks. We also prove an information-theoretic lower bound showing that some form of structure is necessary for succinct proofs. Finally, we present architecture-aware instantiations for transformers, CNNs, and MLPs, together with an end-to-end system that aggregates block-level proofs into a global certificate. Comments: 15 pages, 3 figures Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) MSC classes: cs.CR, cs.LG Cite as: arXiv:2604.04738 [cs.CR] (or arXiv:2604.04738v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.04738 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-11] Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions
链接: https://arxiv.org/abs/2604.04662
作者: Daniel Bloch
类目: Machine Learning (cs.LG); Mathematical Finance (q-fin.MF); Pricing of Securities (q-fin.PR); Statistical Finance (q-fin.ST)
*备注:
Abstract:This paper introduces Anticipatory Reinforcement Learning (ARL), a novel framework designed to bridge the gap between non-Markovian decision processes and classical reinforcement learning architectures, specifically under the constraint of a single observed trajectory. In environments characterised by jump-diffusions and structural breaks, traditional state-based methods often fail to capture the essential path-dependent geometry required for accurate foresight. We resolve this by lifting the state space into a signature-augmented manifold, where the history of the process is embedded as a dynamical coordinate. By utilising a self-consistent field approach, the agent maintains an anticipated proxy of the future path-law, allowing for a deterministic evaluation of expected returns. This transition from stochastic branching to a single-pass linear evaluation significantly reduces computational complexity and variance. We prove that this framework preserves fundamental contraction properties and ensures stable generalisation even in the presence of heavy-tailed noise. Our results demonstrate that by grounding reinforcement learning in the topological features of path-space, agents can achieve proactive risk management and superior policy stability in highly volatile, continuous-time environments.
[LG-12] From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism
链接: https://arxiv.org/abs/2604.04648
作者: Zhuohao Yu,Zhiwei Steven Wu,Adam Block
类目: Machine Learning (cs.LG)
*备注: 29 pages, 8 figures
Abstract:Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as N increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking, via stronger reward models or heavy-handed distributional regularization, either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in RL, which uses lower confidence bounds on value estimates to avoid OOD actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.
[LG-13] Dynamic Free-Rider Detection in Federated Learning via Simulated Attack Patterns KDD2026 ECML
链接: https://arxiv.org/abs/2604.04611
作者: Motoki Nakamura
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Submitted to ECML PKDD 2026 (under review)
Abstract:Federated learning (FL) enables multiple clients to collaboratively train a global model by aggregating local updates without sharing private data. However, FL often faces the challenge of free-riders, clients who submit fake model parameters without performing actual training to obtain the global model without contributing. Chen et al. proposed a free-rider detection method based on the weight evolving frequency (WEF) of model parameters. This detection approach is a leading candidate for practical free-rider detection methods, as it requires neither a proxy dataset nor pre-training. Nevertheless, it struggles to detect ``dynamic’’ free-riders who behave honestly in early rounds and later switch to free-riding, particularly under global-model-mimicking attacks such as the delta weight attack and our newly proposed adaptive WEF-camouflage attack. In this paper, we propose a novel detection method S2-WEF that simulates the WEF patterns of potential global-model-based attacks on the server side using previously broadcasted global models, and identifies clients whose submitted WEF patterns resemble the simulated ones. To handle a variety of free-rider attack strategies, S2-WEF further combines this simulation-based similarity score with a deviation score computed from mutual comparisons among submitted WEFs, and separates benign and free-rider clients by two-dimensional clustering and per-score classification. This method enables dynamic detection of clients that transition into free-riders during training without proxy datasets or pre-training. We conduct extensive experiments across three datasets and five attack types, demonstrating that S2-WEF achieves higher robustness than existing approaches.
[LG-14] SAIL: Scene-aware Adaptive Iterative Learning for Long-Tail Trajectory Prediction in Autonomous Vehicles
链接: https://arxiv.org/abs/2604.04573
作者: Bin Rao,Haicheng Liao,Chengyue Wang,Keqiang Li,Zhenning Li,Hai Yang
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:
Abstract:Autonomous vehicles (AVs) rely on accurate trajectory prediction for safe navigation in diverse traffic environments, yet existing models struggle with long-tail scenarios-rare but safety-critical events characterized by abrupt maneuvers, high collision risks, and complex interactions. These challenges stem from data imbalance, inadequate definitions of long-tail trajectories, and suboptimal learning strategies that prioritize common behaviors over infrequent ones. To address this, we propose SAIL, a novel framework that systematically tackles the long-tail problem by first defining and modeling trajectories across three key attribute dimensions: prediction error, collision risk, and state complexity. Our approach then synergizes an attribute-guided augmentation and feature extraction process with a highly adaptive contrastive learning strategy. This strategy employs a continuous cosine momentum schedule, similarity-weighted hard-negative mining, and a dynamic pseudo-labeling mechanism based on evolving feature clustering. Furthermore, it incorporates a focusing mechanism to intensify learning on hard-positive samples within each identified class. This comprehensive design enables SAIL to excel at identifying and forecasting diverse and challenging long-tail events. Extensive evaluations on the nuScenes and ETH/UCY datasets demonstrate SAIL’s superior performance, achieving up to 28.8% reduction in prediction error on the hardest 1% of long-tail samples compared to state-of-the-art baselines, while maintaining competitive accuracy across all scenarios. This framework advances reliable AV trajectory prediction in real-world, mixed-autonomy settings.
[LG-15] Safe and Near-Optimal Gate Control: A Case Study from the Danish West Coast
链接: https://arxiv.org/abs/2604.04545
作者: Martin Kristjansen(Aalborg University),Kim Guldstrand Larsen(Aalborg University),Marius Mikučionis(Aalborg University),Christian Schilling(Aalborg University)
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: In Proceedings MARS 2026, arXiv:2604.03053
Abstract:Ringkoebing Fjord is an inland water basin on the Danish west coast separated from the North Sea by a set of gates used to control the amount of water entering and leaving the fjord. Currently, human operators decide when and how many gates to open or close for controlling the fjord’s water level, with the goal to satisfy a range of conflicting safety and performance requirements such as keeping the water level in a target range, allowing maritime traffic, and enabling fish migration. Uppaal Stratego. We then use this digital twin along with forecasts of the sea level and the wind speed to learn a gate controller in an online fashion. We evaluate the learned controllers under different sea-level scenarios, representing normal tidal behavior, high waters, and low waters. Our evaluation demonstrates that, unlike a baseline controller, the learned controllers satisfy the safety requirements, while performing similarly regarding the other requirements.
[LG-16] Beyond Imbalance Ratio: Data Characteristics as Critical Moderators of Oversampling Method Selection
链接: https://arxiv.org/abs/2604.04541
作者: Yuwen Jiang,Songyun Ye
类目: Machine Learning (cs.LG)
*备注:
Abstract:The prevailing IR-threshold paradigm posits a positive correlation between imbalance ratio (IR) and oversampling effectiveness, yet this assumption remains empirically unsubstantiated through controlled experimentation. We conducted 12 controlled experiments (N 100 dataset variants) that systematically manipulated IR while holding data characteristics (class separability, cluster structure) constant via algorithmic generation of Gaussian mixture datasets. Two additional validation experiments examined ceiling effects and metric-dependence. All methods were evaluated on 17 real-world datasets from OpenML. Upon controlling for confounding variables, IR exhibited a weak to moderate negative correlation with oversampling benefits. Class separability emerged as a substantially stronger moderator, accounting for significantly more variance in method effectiveness than IR alone. We propose a ‘Context Matters’ framework that integrates IR, class separability, and cluster structure to provide evidence-based selection criteria for practitioners.
[LG-17] FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
链接: https://arxiv.org/abs/2604.04539
作者: Donghu Kim,Youngdo Lee,Minho Park,Kinam Kim,I Made Aswin Nahendra,Takuma Seno,Sehee Min,Daniel Palenicek,Florian Vogt,Danica Kragic,Jan Peters,Jaegul Choo,Hojoon Lee
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: preprint, 40pages
Abstract:Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.
[LG-18] Learning from Equivalence Queries Revisited
链接: https://arxiv.org/abs/2604.04535
作者: Mark Braverman,Roi Livni,Yishay Mansour,Shay Moran,Kobbi Nissim
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Information Theory (cs.IT)
*备注:
Abstract:Modern machine learning systems, such as generative models and recommendation systems, often evolve through a cycle of deployment, user interaction, and periodic model updates. This differs from standard supervised learning frameworks, which focus on loss or regret minimization over a fixed sequence of prediction tasks. Motivated by this setting, we revisit the classical model of learning from equivalence queries, introduced by Angluin (1988). In this model, a learner repeatedly proposes hypotheses and, when a deployed hypothesis is inadequate, receives a counterexample. Under fully adversarial counterexample generation, however, the model can be overly pessimistic. In addition, most prior work assumes a \emphfull-information setting, where the learner also observes the correct label of the counterexample, an assumption that is not always natural. We address these issues by restricting the environment to a broad class of less adversarial counterexample generators, which we call \emphsymmetric. Informally, such generators choose counterexamples based only on the symmetric difference between the hypothesis and the target. This class captures natural mechanisms such as random counterexamples (Angluin and Dohrn, 2017; Bhatia, 2021; Chase, Freitag, and Reyzin, 2024), as well as generators that return the simplest counterexample according to a prescribed complexity measure. Within this framework, we study learning from equivalence queries under both full-information and bandit feedback. We obtain tight bounds on the number of learning rounds in both settings and highlight directions for future work. Our analysis combines a game-theoretic view of symmetric adversaries with adaptive weighting methods and minimax arguments. Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Information Theory (cs.IT) Cite as: arXiv:2604.04535 [cs.LG] (or arXiv:2604.04535v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.04535 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-19] Isokinetic Flow Matching for Pathwise Straightening of Generative Flows
链接: https://arxiv.org/abs/2604.04491
作者: Tauhid Khan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Flow Matching (FM) constructs linear conditional probability paths, but the learned marginal velocity field inevitably exhibits strong curvature due to trajectory superposition. This curvature severely inflates numerical truncation errors, bottlenecking few-step sampling. To overcome this, we introduce Isokinetic Flow Matching (Iso-FM), a lightweight, Jacobian-free dynamical regularizer that directly penalizes pathwise acceleration. By using a self-guided finite-difference approximation of the material derivative Dv/Dt, Iso-FM enforces local velocity consistency without requiring auxiliary encoders or expensive second-order autodifferentiation. Operating as a pure plug-and-play addition to single-stage FM training, Iso-FM dramatically improves few-step generation. On CIFAR-10 (DiT-S/2), Iso-FM slashes conditional non-OT FID at 2 steps from 78.82 to 27.13 - a 2.9x relative efficiency gain - and reaches a best-observed FID at 4 steps of 10.23. These results firmly establish acceleration regularization as a principled, compute-efficient mechanism for fast generative sampling.
[LG-20] Generative modeling of granular flow on inclined planes using conditional flow matching
链接: https://arxiv.org/abs/2604.04453
作者: Xuyang Li,Rui Li,Teng Man,Yimin Lu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Granular flows govern many natural and industrial processes, yet their interior kinematics and mechanics remain largely unobservable, as experiments access only boundaries or free surfaces. Conventional numerical simulations are computationally expensive for fast inverse reconstruction, and deterministic models tend to collapse to over-smoothed mean predictions in ill-posed settings. This study, to the best of the authors’ knowledge, presents the first conditional flow matching (CFM) framework for granular-flow reconstruction from sparse boundary observations. Trained on high-fidelity particle-resolved discrete element simulations, the generative model is guided at inference by a differentiable forward operator with a sparsity-aware gradient guidance mechanism, which enforces measurement consistency without hyperparameter tuning and prevents unphysical velocity predictions in non-material regions. A physics decoder maps the reconstructed velocity fields to stress states and energy fluctuation quantities, including mean stress, deviatoric stress, and granular temperature. The framework accurately recovers interior flow fields from full observation to only 16% of the informative window, and it remains effective under strongly diluted spatial resolution with only 11% of data. It also outperforms a deterministic CNN baseline in the most ill-posed reconstruction regime and provides spatially resolved uncertainty estimates through ensemble generation. These results demonstrate that conditional generative modeling offers a practical route for non-invasive inference of hidden bulk mechanics in granular media, with broader applicability for inverse problems in particulate and multiphase systems.
[LG-21] nyNina: A Resource-Efficient Edge-AI Framework for Sustainable Air Quality Monitoring via Intra-Image Satellite Super-Resolution
链接: https://arxiv.org/abs/2604.04445
作者: Prasanjit Dey,Zachary Yahn,Bianca Schoen-Phelan,Soumyabrata Dev
类目: Machine Learning (cs.LG)
*备注: This manuscript is currently under review at IEEE Access
Abstract:Nitrogen dioxide (NO _2 ) is a primary atmospheric pollutant and a significant contributor to respiratory morbidity and urban climate-related challenges. While satellite platforms like Sentinel-2 provide global coverage, their native spatial resolution often limits the precision required, fine-grained NO _2 assessment. To address this, we propose TinyNina, a resource-efficient Edge-AI framework specifically engineered for sustainable environmental monitoring. TinyNina implements a novel intra-image learning paradigm that leverages the multi-spectral hierarchy of Sentinel-2 as internal training labels, effectively eliminating the dependency on costly and often unavailable external high-resolution reference datasets. The framework incorporates wavelength-specific attention gates and depthwise separable convolutions to preserve pollutant-sensitive spectral features while maintaining an ultra-lightweight footprint of only 51K parameters. Experimental results, validated against 3,276 matched satellite-ground station pairs, demonstrate that TinyNina achieves a state-of-the-art Mean Absolute Error (MAE) of 7.4 \mu g/m ^3 . This performance represents a 95% reduction in computational overhead and 47 \times faster inference compared to high-capacity models such as EDSR and RCAN. By prioritizing task-specific utility and architectural efficiency, TinyNina provides a scalable, low-latency solution for real-time air quality monitoring in smart city infrastructures.
[LG-22] Eliminating Vendor Lock-In in Quantum Machine Learning via Framework-Agnostic Neural Networks
链接: https://arxiv.org/abs/2604.04414
作者: Poornima Kumaresan,Shwetha Singaravelu,Lakshmi Rajendran,Santhosh Sivasubramani
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Quantum machine learning (QML) stands at the intersection of quantum computing and artificial intelligence, offering the potential to solve problems that remain intractable for classical methods. However, the current landscape of QML software frameworks suffers from severe fragmentation: models developed in TensorFlow Quantum cannot execute on PennyLane backends, circuits authored in Qiskit Machine Learning cannot be deployed to Amazon Braket hardware, and researchers who invest in one ecosystem face prohibitive switching costs when migrating to another. This vendor lock-in impedes reproducibility, limits hardware access, and slows the pace of scientific discovery. In this paper, we present a framework-agnostic quantum neural network (QNN) architecture that abstracts away vendor-specific interfaces through a unified computational graph, a hardware abstraction layer (HAL), and a multi-framework export pipeline. The core architecture supports simultaneous integration with TensorFlow, PyTorch, and JAX as classical co-processors, while the HAL provides transparent access to IBM Quantum, Amazon Braket, Azure Quantum, IonQ, and Rigetti backends through a single application programming interface (API). We introduce three pluggable data encoding strategies (amplitude, angle, and instantaneous quantum polynomial encoding) that are compatible with all supported backends. An export module leveraging Open Neural Network Exchange (ONNX) metadata enables lossless circuit translation across Qiskit, Cirq, PennyLane, and Braket representations. We benchmark our framework on the Iris, Wine, and MNIST-4 classification tasks, demonstrating training time parity (within 8% overhead) compared to native framework implementations, while achieving identical classification accuracy.
[LG-23] ReinVBC: A Model-based Reinforcement Learning Approach to Vehicle Braking Controller
链接: https://arxiv.org/abs/2604.04401
作者: Haoxin Lin,Junjie Zhou,Daheng Xu,Yang Yu
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Braking system, the key module to ensure the safety and steer-ability of current vehicles, relies on extensive manual calibration during production. Reducing labor and time consumption while maintaining the Vehicle Braking Controller (VBC) performance greatly benefits the vehicle industry. Model-based methods in offline reinforcement learning, which facilitate policy exploration within a data-driven dynamics model, offer a promising solution for addressing real-world control tasks. This work proposes ReinVBC, which applies an offline model-based reinforcement learning approach to deal with the vehicle braking control problem. We introduce useful engineering designs into the paradigm of model learning and utilization to obtain a reliable vehicle dynamics model and a capable braking policy. Several results demonstrate the capability of our method in real-world vehicle braking and its potential to replace the production-grade anti-lock braking system.
[LG-24] Finite-Time Analysis of Q-Value Iteration for General-Sum Stackelberg Games
链接: https://arxiv.org/abs/2604.04394
作者: Narim Jeong,Donghwan Lee
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages
Abstract:Reinforcement learning has been successful both empirically and theoretically in single-agent settings, but extending these results to multi-agent reinforcement learning in general-sum Markov games remains challenging. This paper studies the convergence of Stackelberg Q-value iteration in two-player general-sum Markov games from a control-theoretic perspective. We introduce a relaxed policy condition tailored to the Stackelberg setting and model the learning dynamics as a switching system. By constructing upper and lower comparison systems, we establish finite-time error bounds for the Q-functions and characterize their convergence properties. Our results provide a novel control-theoretic perspective on Stackelberg learning. Moreover, to the best of the authors’ knowledge, this paper offers the first finite-time convergence guarantees for Q-value iteration in general-sum Markov games under Stackelberg interactions.
[LG-25] CPT: Controllable and Editable Design Variations with Language Models NEURIPS2025
链接: https://arxiv.org/abs/2604.04380
作者: Karthik Suresh,Amine Ben Khalifa,Li Zhang,Wei-ting Hsu,Fangzheng Wu,Vinay More,Asim Kadav
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures, Accepted at NeurIPS 2025 Workshop on Generative and Protective AI for Content Creation (GenProCC 2025)
Abstract:Designing visually diverse and high-quality designs remains a manual, time-consuming process, limiting scalability and personalization in creative workflows. We present a system for generating editable design variations using a decoder-only language model, the Creative Pre-trained Transformer (CPT), trained to predict visual style attributes in design templates. At the core of our approach is a new representation called Creative Markup Language (CML), a compact, machine-learning-friendly format that captures canvas-level structure, page layout, and element-level details (text, images, and vector graphics), including both content and style. We fine-tune CPT on a large corpus of design templates authored by professional designers, enabling it to learn meaningful, context-aware predictions for attributes such as color schemes and font choices. The model produces semantically structured and stylistically coherent outputs, preserving internal consistency across elements. Unlike generative image models, our system yields fully editable design documents rather than pixel-only images, allowing users to iterate and personalize within a design editor. In experiments, our approach generates contextual color and font variations for existing templates and shows promise in adjusting layouts while maintaining design principles.
[LG-26] Adversarial Robustness Analysis of Cloud-Assisted Autonomous Driving Systems
链接: https://arxiv.org/abs/2604.04349
作者: Maher Al Islam,Amr S. El-Wakeel
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Autonomous vehicles increasingly rely on deep learning-based perception and control, which impose substantial computational demands. Cloud-assisted architectures offload these functions to remote servers, enabling enhanced perception and coordinated decision-making through the Internet of Vehicles (IoV). However, this paradigm introduces cross-layer vulnerabilities, where adversarial manipulation of perception models and network impairments in the vehicle-cloud link can jointly undermine safety-critical autonomy. This paper presents a hardware-in-the-loop IoV testbed that integrates real-time perception, control, and communication to evaluate such vulnerabilities in cloud-assisted autonomous driving. A YOLOv8-based object detector deployed on the cloud is subjected to whitebox adversarial attacks using the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD), while network adversaries induce delay and packet loss in the vehicle-cloud loop. Results show that adversarial perturbations significantly degrade perception performance, with PGD reducing detection precision and recall from 0.73 and 0.68 in the clean baseline to 0.22 and 0.15 at epsilon= 0.04. Network delays of 150-250 ms, corresponding to transient losses of approximately 3-4 frames, and packet loss rates of 0.5-5 % further destabilize closed-loop control, leading to delayed actuation and rule violations. These findings highlight the need for cross-layer resilience in cloud-assisted autonomous driving systems.
[LG-27] Deep Kuratowski Embedding Neural Networks for Wasserstein Metric Learning
链接: https://arxiv.org/abs/2604.04343
作者: Andrew Qing He
类目: Machine Learning (cs.LG)
*备注:
Abstract:Computing pairwise Wasserstein distances is a fundamental bottleneck in data analysis pipelines. Motivated by the classical Kuratowski embedding theorem, we propose two neural architectures for learning to approximate the Wasserstein-2 distance ( W_2 ) from data. The first, DeepKENN, aggregates distances across all intermediate feature maps of a CNN using learnable positive weights. The second, ODE-KENN, replaces the discrete layer stack with a Neural ODE, embedding each input into the infinite-dimensional Banach space C^1([0,1], \mathbbR^d) and providing implicit regularization via trajectory smoothness. Experiments on MNIST with exact precomputed W_2 distances show that ODE-KENN achieves a 28% lower test MSE than the single-layer baseline and 18% lower than DeepKENN under matched parameter counts, while exhibiting a smaller generalization gap. The resulting fast surrogate can replace the expensive W_2 oracle in downstream pairwise distance computations.
[LG-28] Generative models for decision-making under distributional shift
链接: https://arxiv.org/abs/2604.04342
作者: Xiuyuan Cheng,Yunqin Zhu,Yao Xie
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under review for INFORMS TutORials in Operations Research, 2026
Abstract:Many data-driven decision problems are formulated using a nominal distribution estimated from historical data, while performance is ultimately determined by a deployment distribution that may be shifted, context-dependent, partially observed, or stress-induced. This tutorial presents modern generative models, particularly flow- and score-based methods, as mathematical tools for constructing decision-relevant distributions. From an operations research perspective, their primary value lies not in unconstrained sample synthesis but in representing and transforming distributions through transport maps, velocity fields, score fields, and guided stochastic dynamics. We present a unified framework based on pushforward maps, continuity, Fokker-Planck equations, Wasserstein geometry, and optimization in probability space. Within this framework, generative models can be used to learn nominal uncertainty, construct stressed or least-favorable distributions for robustness, and produce conditional or posterior distributions under side information and partial observation. We also highlight representative theoretical guarantees, including forward-reverse convergence for iterative flow models, first-order minimax analysis in transport-map space, and error-transfer bounds for posterior sampling with generative priors. The tutorial provides a principled introduction to using generative models for scenario generation, robust decision-making, uncertainty quantification, and related problems under distributional shift.
[LG-29] How Long short-term memory artificial neural network synthetic data and fine-tuning improve the classification of raw EEG data
链接: https://arxiv.org/abs/2604.04316
作者: Albert Nasybullin,Vladimir Maksimenko,Semen Kurkin
类目: Machine Learning (cs.LG)
*备注: 4 pages, 4 figures, 2 tables
Abstract:In this paper, we discuss a Machine Learning pipeline for the classification of EEG data. We propose a combination of synthetic data generation, long short-term memory artificial neural network (LSTM), and fine-tuning to solve classification problems for experiments with implicit visual stimuli, such as the Necker cube with different levels of ambiguity. The developed approach increased the quality of the classification model of raw EEG data.
[LG-30] Convolutional Neural Network and Adversarial Autoencoder in EEG images classification
链接: https://arxiv.org/abs/2604.04313
作者: Albert Nasybullin,Semen Kurkin
类目: Machine Learning (cs.LG)
*备注: 4 pages, 6 figures
Abstract:In this paper, we consider applying computer vision algorithms for the classification problem one faces in neuroscience during EEG data analysis. Our approach is to apply a combination of computer vision and neural network methods to solve human brain activity classification problems during hand movement. We pre-processed raw EEG signals and generated 2D EEG topograms. Later, we developed supervised and semi-supervised neural networks to classify different motor cortex activities.
[LG-31] Out-of-Air Computation: Enabling Structured Extraction from Wireless Superposition
链接: https://arxiv.org/abs/2604.04312
作者: Seyed Mohammad Azimi-Abarghouyi
类目: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Over-the-air computation (AirComp) has traditionally been built on the principle of pre-embedding computation into transmitted waveforms or on exploiting massive antenna arrays, often requiring the wireless multiple-access channel (MAC) to operate under conditions that approximate an ideal computational medium. This paper introduces a new computation framework, termed out-of-air computation (AirCPU), which establishes a joint source-channel coding foundation in which computation is not embedded before transmission but is instead extracted from the wireless superposition by exploiting structured coding. AirCPU operates directly on continuous-valued device data, avoiding the need for a separate source quantization stage, and employs a multi-layer nested lattice architecture that enables progressive resolution by decomposing each input into hierarchically scaled components, all transmitted over a common bounded digital constellation under a fixed power constraint. We formalize the notion of decoupled resolution, showing that in operating regimes where the decoding error probability is sufficiently small, the impact of channel noise and finite constellation constraints on distortion becomes negligible, and the resulting computation error is primarily determined by the target resolution set by the finest lattice. For fading MACs, we further introduce collective and successive computation mechanisms, in addition to the proposed direct computation, which exploit multiple decoded integer-coefficient functions and side-information functions as structural representations of the wireless superposition to significantly expand the reliable operating regime; in this context, we formulate and characterize the underlying reliability conditions and integer optimization problems, and develop a structured low-complexity two-group approximation to address them.
[LG-32] Correcting Source Mismatch in Flow Matching with Radial-Angular Transport
链接: https://arxiv.org/abs/2604.04291
作者: Fouad Oubari,Mathilde Mougeot
类目: Machine Learning (cs.LG)
*备注:
Abstract:Flow Matching is typically built from Gaussian sources and Euclidean probability paths. For heavy-tailed or anisotropic data, however, a Gaussian source induces a structural mismatch already at the level of the radial distribution. We introduce \textitRadial–Angular Flow Matching (RAFM), a framework that explicitly corrects this source mismatch within the standard simulation-free Flow Matching template. RAFM uses a source whose radial law matches that of the data and whose conditional angular distribution is uniform on the sphere, thereby removing the Gaussian radial mismatch by construction. This reduces the remaining transport problem to angular alignment, which leads naturally to conditional paths on scaled spheres defined by spherical geodesic interpolation. The resulting framework yields explicit Flow Matching targets tailored to radial–angular transport without modifying the underlying deterministic training pipeline. We establish the exact density of the matched-radial source, prove a radial–angular KL decomposition that isolates the Gaussian radial penalty, characterize the induced target vector field, and derive a stability result linking Flow Matching error to generation error. We further analyze empirical estimation of the radial law, for which Wasserstein and CDF metrics provide natural guarantees. Empirically, RAFM substantially improves over standard Gaussian Flow Matching and remains competitive with recent non-Gaussian alternatives while preserving a lightweight deterministic training procedure. Overall, RAFM provides a principled source-and-path design for Flow Matching on heavy-tailed and extreme-event data. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.04291 [cs.LG] (or arXiv:2604.04291v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.04291 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-33] DAGAF: A directed acyclic generative adversarial framework for joint structure learning and tabular data synthesis
链接: https://arxiv.org/abs/2604.04290
作者: Hristo Petkov,Calum MacLellan,Feng Dong
类目: Machine Learning (cs.LG)
*备注: The code for this paper is available at this https URL
Abstract:Understanding the causal relationships between data variables can provide crucial insights into the construction of tabular datasets. Most existing causality learning methods typically focus on applying a single identifiable causal model, such as the Additive Noise Model (ANM) or the Linear non-Gaussian Acyclic Model (LiNGAM), to discover the dependencies exhibited in observational data. We improve on this approach by introducing a novel dual-step framework capable of performing both causal structure learning and tabular data synthesis under multiple causal model assumptions. Our approach uses Directed Acyclic Graphs (DAG) to represent causal relationships among data variables. By applying various functional causal models including ANM, LiNGAM and the Post-Nonlinear model (PNL), we implicitly learn the contents of DAG to simulate the generative process of observational data, effectively replicating the real data distribution. This is supported by a theoretical analysis to explain the multiple loss terms comprising the objective function of the framework. Experimental results demonstrate that DAGAF outperforms many existing methods in structure learning, achieving significantly lower Structural Hamming Distance (SHD) scores across both real-world and benchmark datasets (Sachs: 47%, Child: 11%, Hailfinder: 5%, Pathfinder: 7% improvement compared to state-of-the-art), while being able to produce diverse, high-quality samples.
[LG-34] A Family of Open Time-Series Foundation Models for the Radio Access Network
链接: https://arxiv.org/abs/2604.04271
作者: Ioannis Panitsas,Leandros Tassiulas
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:The Radio Access Network (RAN) is evolving into a programmable and disaggregated infrastructure that increasingly relies on AI-native algorithms for optimization and closed-loop control. However, current RAN intelligence is still largely built from task-specific models tailored to individual functions, resulting in model fragmentation, limited knowledge sharing across tasks, poor generalization, and increased system complexity. To address these limitations, we introduce TimeRAN, a unified multi-task learning framework for time-series modeling in the RAN. TimeRAN leverages a lightweight time-series foundation model with few task-specific heads to learn transferable representations that can be efficiently adapted across diverse tasks with limited supervision. To enable large-scale pretraining, we further curate and open-source TimeRAN DataPile, the largest time-series corpus for RAN analytics to date, comprising over 355K time series and 0.56B measurements across diverse telemetry sources, protocol layers, and deployment scenarios. We evaluate TimeRAN across a comprehensive set of RAN analytics tasks, including anomaly detection, classification, forecasting, and imputation, and show that it achieves state-of-the-art performance with minimal or no task-specific fine-tuning. Finally, we integrate TimeRAN into a proof-of-concept 5G testbed and demonstrate that it operates efficiently with limited resource requirements in real-world scenarios.
[LG-35] owards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning
链接: https://arxiv.org/abs/2604.04255
作者: Aobo Chen,Chenxu Zhao,Chenglin Miao,Mengdi Huai
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Large language models (LLMs) possess strong semantic understanding, driving significant progress in data mining applications. This is further enhanced by large reasoning models (LRMs), which provide explicit multi-step reasoning traces. On the other hand, the growing need for the right to be forgotten has driven the development of machine unlearning techniques, which aim to eliminate the influence of specific data from trained models without full retraining. However, unlearning may also introduce new security vulnerabilities by exposing additional interaction surfaces. Although many studies have investigated unlearning attacks, there is no prior work on LRMs. To bridge the gap, we first in this paper propose LRM unlearning attack that forces incorrect final answers while generating convincing but misleading reasoning traces. This objective is challenging due to non-differentiable logical constraints, weak optimization effect over long rationales, and discrete forget set selection. To overcome these challenges, we introduce a bi-level exact unlearning attack that incorporates a differentiable objective function, influential token alignment, and a relaxed indicator strategy. To demonstrate the effectiveness and generalizability of our attack, we also design novel optimization frameworks and conduct comprehensive experiments in both white-box and black-box settings, aiming to raise awareness of the emerging threats to LRM unlearning pipelines.
[LG-36] ransmission Neural Networks: Inhibitory and Excitatory Connections
链接: https://arxiv.org/abs/2604.04246
作者: Shuang Gao,Peter E. Caines
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
*备注: 8 pages
Abstract:This paper extends the Transmission Neural Network model proposed by Gao and Caines in [1]-[3] to incorporate inhibitory connections and neurotransmitter populations. The extended network model contains binary neuronal states, transmission dynamics, and inhibitory and excitatory connections. Under technical assumptions, we establish the characterization of the firing probabilities of neurons, and show that such a characterization considering inhibitions can be equivalently represented by a neural network where each neuron has a continuous state of dimension 2. Moreover, we incorporated neurotransmitter populations into the modeling and establish the limit network model when the number of neurotransmitters at all synaptic connections go to infinity. Finally, sufficient conditions for stability and contraction properties of the limit network model are established.
[LG-37] Learning An Interpretable Risk Scoring System for Maximizing Decision Net Benefit
链接: https://arxiv.org/abs/2604.04241
作者: Wenhao Chi,Ş. İlker Birbil
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 31 pages, 5 figures, and 6 tables
Abstract:Risk scoring systems are widely used in high-stakes domains to assist decision-making. However, existing approaches often focus on optimizing predictive accuracy or likelihood-based criteria, which may not align with the main goal of maximizing utility. In this paper, we propose a novel risk scoring system that directly optimizes net benefit over a range of decision thresholds. The model is formulated as a sparse integer linear programming problem which enables the construction of a transparent scoring system with integer coefficients, and hence, facilitates interpretation and practical application. We also establish fundamental relationships among net benefit, discrimination, and calibration. Our analysis proves that optimizing net benefit also guarantees conventional performance measures. We thoroughly evaluated our method on multiple public datasets as well as on a real-world clinical dataset. This computational study demonstrated that our interpretable method can effectively achieve high net benefit while maintaining competitive discrimination and calibration performance.
[LG-38] Peoples Water Data: Enabling Reliable Field Data Generation and Microbial Contamination Screening in Household Drinking Water
链接: https://arxiv.org/abs/2604.04240
作者: Suzan Kagan,Shira Spigelman,Sankar Sudhir,Thalappil Pradeep,Hadas Mamane
类目: Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注:
Abstract:Unsafe drinking water remains a major public health concern globally, particularly in low-resource regions where routine microbiological surveillance is limited. Although Escherichia coli is the internationally recognized indicator of fecal contamination, laboratory-based testing is often inaccessible at scale. In this study, we developed and evaluated a two-stage machine-learning framework for predicting E. coli presence in decentralized household point-of-use drinking water in Chennai, India using low-cost physicochemical and contextual indicators. The dataset comprised 3,023 samples collected under the Peoples Water Data initiative; after harmonization, technical cleaning, and outlier screening, 2,207 valid samples were retained. This framework provides a scalable decision-support tool for prioritizing microbiological testing in resource-constrained environments and addresses an important gap in point-of-use contamination risk assessment. Beyond predictive modeling, the present study was conducted within an AI-supported field implementation framework that combined student-facing guidance and real-time QC to improve protocol adherence, traceability, and data reliability in decentralized household water monitoring. Subjects: Machine Learning (cs.LG); Physics and Society (physics.soc-ph) Cite as: arXiv:2604.04240 [cs.LG] (or arXiv:2604.04240v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.04240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-39] Subspace Control: Turning Constrained Model Steering into Controllable Spectral Optimization
链接: https://arxiv.org/abs/2604.04231
作者: Yancheng Huang,Changsheng Wang,Chongyu Fan,Yicheng Lang,Bingqi Shang,Yang Zhang,Mingyi Hong,Qing Qu,Alvaro Velasquez,Sijia Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models, such as large language models (LLMs), are powerful but often require customization before deployment to satisfy practical constraints such as safety, privacy, and task-specific requirements, leading to “constrained” optimization problems for model steering and adaptation. However, solving such problems remains largely underexplored and is particularly challenging due to interference between the primary objective and constraint objectives during optimization. In this paper, we propose a subspace control framework for constrained model training. Specifically, (i) we first analyze, from a model merging perspective, how spectral cross-task interference arises and show that it can be resolved via a one-shot solution that orthogonalizes the merged subspace; (ii) we establish a connection between this solution and gradient orthogonalization in the spectral optimizer Muon; and (iii) building on these insights, we introduce SIFT (spectral interference-free training), which leverages a localization scheme to selectively intervene during optimization, enabling controllable updates that mitigate objective-constraint conflicts. We evaluate SIFT across four representative applications: (a) machine unlearning, (b) safety alignment, © text-to-speech adaptation, and (d) hallucination mitigation. Compared to both control-based and control-free baselines, SIFT consistently achieves substantial and robust performance improvements across all tasks. Code is available at this https URL.
[LG-40] owards Agent ic Defect Reasoning : A Graph-Assisted Retrieval Framework for Laser Powder Bed Fusion
链接: https://arxiv.org/abs/2604.04208
作者: Muhammad Rizwan Awan,Volker Pickert,Muhammad Waqar Ashraf,Saleh Ali,Farshid Mahmouditabar,Shafiq Odhano
类目: Machine Learning (cs.LG)
*备注:
Abstract:Laser Powder Bed Fusion (LPBF) is highly sensitive to process parameters, which influence defect formation through complex thermal and fluid mechanisms. However, defect-related knowledge is dispersed across the literature, limiting systematic understanding. This study presents a graph-assisted retrieval framework for defect reasoning in LPBF, using Ti6Al4V as a case study. Scientific publications are transformed into a structured representation, and relationships between parameters, mechanisms, and defects are encoded into an evidence-linked knowledge graph. The framework integrates semantic and graph-based retrieval, supported by a lightweight agent-based reasoning layer to construct interpretable defect pathways. Evaluation shows high retrieval accuracy (0.9667) and recall (0.9667), demonstrating effective identification of relevant defect related evidence. The framework enables transparent reasoning chains linking process parameters to defects. This work provides a scalable approach for converting unstructured literature into a query able and interpretable knowledge resource for additive manufacturing.
[LG-41] Which Leakage Types Matter?
链接: https://arxiv.org/abs/2604.04199
作者: Simon Roth
类目: Machine Learning (cs.LG)
*备注: 35 pages, 6 figures, 10 tables. Companion to arXiv:2603.10742
Abstract:Twenty-eight within-subject counterfactual experiments across 2,047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation - fitting scalers on full data) is negligible: all nine conditions produce |\Delta\textAUC| \leq 0.005 . Class II (selection - peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.
[LG-42] Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach
链接: https://arxiv.org/abs/2604.04195
作者: Gabriel Diaz Ramos,Lorenzo Luzi,Debshila Basu Mallick,Richard Baraniuk
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 10 pages, 6 figures. Accepted at the Educational Data Mining (EDM) 2026 conference
Abstract:To advance Educational Data Mining (EDM) within strict privacy-protecting regulatory frameworks, researchers must develop methods that enable data-driven analysis while protecting sensitive student information. Synthetic data generation is one such approach, enabling the release of statistically generated samples instead of real student records; however, existing deep learning and parametric generators often distort marginal distributions and degrade under iterative regeneration, leading to distribution drift and progressive loss of distributional support that compromise reliability. In response, we introduce the Non-Parametric Gaussian Copula (NPGC), a plug-and-play synthesis method that replaces deep learning and parametric optimization with empirical statistical anchoring to preserve the observed marginal distributions while modeling dependencies through a copula framework. NPGC integrates Differential Privacy (DP) at both the marginal and correlation levels, supports heterogeneous variable types, and treats missing data as an explicit state to retain informative absence patterns. We evaluate NPGC against deep learning and parametric baselines on five benchmark datasets and demonstrate that it remains stable across multiple regeneration cycles and achieves competitive downstream performance at substantially lower computational cost. We further validate NPGC through deployment in a real-world online learning platform, demonstrating its practicality for privacy-preserving research.
[LG-43] Uncertainty-Aware Foundation Models for Clinical Data
链接: https://arxiv.org/abs/2604.04175
作者: Qian Zhou,Yuanyun Zhang,Shi Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Healthcare foundation models have largely followed paradigms from natural language processing and computer vision, emphasizing large scale pretraining and deterministic representations over heterogeneous clinical data. However, clinical observations are inherently incomplete, reflecting sparse, irregular, and modality dependent measurements of an underlying physiologic state. In this work, we propose a framework for uncertainty aware foundation modeling that represents each patient not as a point embedding, but as a distribution over plausible latent states. By learning set valued representations and enforcing consistency across partial views of the same patient, the model captures what is invariantly inferable while explicitly encoding epistemic uncertainty. We integrate this formulation with multimodal encoders and scalable self supervised objectives, combining reconstruction, contrastive alignment, and distributional regularization. Across diverse clinical tasks, our approach improves predictive performance, robustness under missing data, and uncertainty calibration relative to strong baselines. These results suggest that modeling what is not observed rather than only what is constitutes a critical inductive bias for healthcare foundation models.
[LG-44] he Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models
链接: https://arxiv.org/abs/2604.04155
作者: Prashant C. Raju
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:
Abstract:Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2’s reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.
[LG-45] Measuring Robustness of Speech Recognition from MEG Signals Under Distribution Shift NEURIPS2025
链接: https://arxiv.org/abs/2604.04129
作者: Sheng-You Chien,Bo-Yi Mao,Yi-Ning Chang,Po-Chih Kuo
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 17 pages, 6 figures, LibriBrain Competition @NeurIPS2025
Abstract:This study investigates robust speech-related decoding from non-invasive MEG signals using the LibriBrain phoneme-classification benchmark from the 2025 PNPL competition. We compare residual convolutional neural networks (CNNs), an STFT-based CNN, and a CNN–Transformer hybrid, while also examining the effects of group averaging, label balancing, repeated grouping, normalization strategies, and data augmentation. Across our in-house implementations, preprocessing and data-configuration choices matter more than additional architectural complexity, among which instance normalization emerges as the most influential modification for generalization. The strongest of our own models, a CNN with group averaging, label balancing, repeated grouping, and instance normalization, achieves 60.95% F1-macro on the test split, compared with 39.53% for the plain CNN baseline. However, most of our models, without instance normalization, show substantial validation-to-test degradation, indicating that distribution shift induced by different normalization statistics is a major obstacle to generalization in our experiments. By contrast, MEGConformer maintains 64.09% F1-macro on both validation and test, and saliency-map analysis is qualitatively consistent with this contrast: weaker models exhibit more concentrated or repetitive phoneme-sensitive patterns across splits, whereas MEGConformer appears more distributed. Overall, the results suggest that improving the reliability of non-invasive phoneme decoding will likely require better handling of normalization-related distribution shift while also addressing the challenge of single-trial decoding.
[LG-46] Physical Sensitivity Kernels Can Emerge in Data-Driven Forward Models: Evidence From Surface-Wave Dispersion
链接: https://arxiv.org/abs/2604.04107
作者: Ziye Yu,Yuqi Cai,Xin Liu
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 12 pages, 2 figures
Abstract:Data-driven neural networks are increasingly used as surrogate forward models in geophysics, but it remains unclear whether they recover only the data mapping or also the underlying physical sensitivity structure. Here we test this question using surface-wave dispersion. By comparing automatically differentiated gradients from a neural-network surrogate with theoretical sensitivity kernels, we show that the learned gradients can recover the main depth-dependent structure of physical kernels across a broad range of periods. This indicates that neural surrogate models can learn physically meaningful differential information, rather than acting as purely black-box predictors. At the same time, strong structural priors in the training distribution can introduce systematic artifacts into the inferred sensitivities. Our results show that neural forward surrogates can recover useful physical information for inversion and uncertainty analysis, while clarifying the conditions under which this differential structure remains physically consistent.
[LG-47] Restless Bandits with Individual Penalty Constraints: A New Near-Optimal Index Policy and How to Learn It
链接: https://arxiv.org/abs/2604.04101
作者: Nida Zamir,I-Hong Hou
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper investigates the Restless Multi-Armed Bandit (RMAB) framework under individual penalty constraints to address resource allocation challenges in dynamic wireless networked environments. Unlike conventional RMAB models, our model allows each user (arm) to have distinct and stringent performance constraints, such as energy limits, activation limits, or age of information minimums, enabling the capture of diverse objectives including fairness and efficiency. To find the optimal resource allocation policy, we propose a new Penalty-Optimal Whittle (POW) index policy. The POW index of an user only depends on the user’s transition kernel and penalty constraints, and remains invariable to system-wide features such as the number of users present and the amount of resource available. This makes it computationally tractable to calculate the POW Indices offline without any need for online adaptation. Moreover, we theoretically prove that the POW index policy is asymptotically optimal while satisfying all individual penalty constraints. We also introduce a deep reinforcement learning algorithm to efficiently learn the POW index on the fly. Simulation results across various applications and system configurations further demonstrate that the POW index policy not only has near-optimal performance but also significantly outperforms other existing policies.
[LG-48] Spectral Path Regression: Directional Chebyshev Harmonics for Interpretable Tabular Learning
链接: https://arxiv.org/abs/2604.04091
作者: Milo Coombs
类目: Machine Learning (cs.LG)
*备注: 19 pages, 4 figures. Includes appendix. Experiments on standard tabular benchmarks. Code available at this https URL
Abstract:Classical approximation bases such as Chebyshev polynomials provide principled and interpretable representations, but their multivariate tensor-product constructions scale exponentially with dimension and impose axis-aligned structure that is poorly matched to real tabular data. We address this by replacing tensorised oscillations with directional harmonic modes of the form \cos(\mathbfm^\top\arccos(\mathbfx)) , which organise multivariate structure by direction in angular space rather than by coordinate index. This representation yields a discrete spectral regression model in which complexity is controlled by selecting a small number of structured frequency vectors (spectral paths), and training reduces to a single closed-form ridge solve with no iterative optimisation. Experiments on standard continuous-feature tabular regression benchmarks show that the resulting models achieve accuracy competitive with strong nonlinear baselines while remaining compact, computationally efficient, and explicitly interpretable through analytic expressions of learned feature interactions.
[LG-49] ArrowFlow: Hierarchical Machine Learning in the Space of Permutations
链接: https://arxiv.org/abs/2604.04087
作者: Ozgur Yilmaz
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce ArrowFlow, a machine learning architecture that operates entirely in the space of permutations. Its computational units are ranking filters, learned orderings that compare inputs via Spearman’s footrule distance and update through permutation-matrix accumulation, a non-gradient rule rooted in displacement evidence. Layers compose hierarchically: each layer’s output ranking becomes the next layer’s input, enabling deep ordinal representation learning without any floating-point parameters in the core computation. We connect the architecture to Arrow’s impossibility theorem, showing that violations of social-choice fairness axioms (context dependence, specialization, symmetry breaking) serve as inductive biases for nonlinearity, sparsity, and stability. Experiments span UCI tabular benchmarks, MNIST, gene expression cancer classification (TCGA), and preference data, all against GridSearchCV-tuned baselines. ArrowFlow beats all baselines on Iris (2.7% vs. 3.3%) and is competitive on most UCI datasets. A single parameter, polynomial degree, acts as a master switch: degree 1 yields noise robustness (8-28% less degradation), privacy preservation (+0.5pp cost), and missing-feature resilience; higher degrees trade these for improved clean accuracy. ArrowFlow is not designed to surpass gradient-based methods. It is an existence proof that competitive classification is possible in a fundamentally different computational paradigm, one that elevates ordinal structure to a first-class citizen, with natural alignment to integer-only and neuromorphic hardware. Subjects: Machine Learning (cs.LG) ACMclasses: I.2.6 Cite as: arXiv:2604.04087 [cs.LG] (or arXiv:2604.04087v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.04087 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ozgur Yilmaz [view email] [v1] Sun, 5 Apr 2026 12:10:41 UTC (51 KB)
[LG-50] Extended Hybrid Timed Petri Nets with Semi-Supervised Anomaly Detection for Switched Systems Modelling and Fault Detection
链接: https://arxiv.org/abs/2604.04051
作者: Fatiha Hamdi,Abdelhafid Zeroual,Fouzi Harrou
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS); Cellular Automata and Lattice Gases (nlin.CG)
*备注:
Abstract:Hybrid physical systems combine continuous and discrete dynamics, which can be simultaneously affected by faults. Conventional fault detection methods often treat these dynamics separately, limiting their ability to capture interacting fault patterns. This paper proposes a unified fault detection framework for hybrid dynamical systems by integrating an Extended Timed Continuous Petri Net (ETCPN) model with semi-supervised anomaly detection. The proposed ETCPN extends existing Petri net formalisms by introducing marking-dependent flow functions, enabling intrinsic coupling between discrete and continuous dynamics. Based on this structure, a mode-dependent hybrid observer is designed, whose stability under arbitrary switching is ensured via Linear Matrix Inequalities (LMIs), solved offline to determine observer gains. The observer generates residuals that reflect discrepancies between the estimated and measured outputs. These residuals are processed using semi-supervised methods, including One-Class SVM (OC-SVM), Support Vector Data Description (SVDD), and Elliptic Envelope (EE), trained exclusively on normal data to avoid reliance on labeled faults. The framework is validated through simulations involving discrete faults, continuous faults, and hybrid faults. Results demonstrate high detection accuracy, fast convergence, and robust performance, with OC-SVM and SVDD providing the best trade-off between detection rate and false alarms. The framework is computationally efficient for real-time deployment, as the main complexity is confined to the offline LMI design phase.
[LG-51] Jellyfish: Zero-Shot Federated Unlearning Scheme with Knowledge Disentanglement
链接: https://arxiv.org/abs/2604.04030
作者: Houzhe Wang,Xiaojie Zhu,Chi Chen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:With the increasing importance of data privacy and security, federated unlearning emerges as a new research field dedicated to ensuring that once specific data is deleted, federated learning models no longer retain or disclose related information. In this paper, we propose a zero-shot federated unlearning scheme, named Jellyfish. It distinguishes itself from conventional federated unlearning frameworks in four key aspects: synthetic data generation, knowledge disentanglement, loss function design, and model repair. To preserve the privacy of forgotten data, we design a zero-shot unlearning mechanism that generates error-minimization noise as proxy data for the data to be forgotten. To maintain model utility, we first propose a knowledge disentanglement mechanism that regularises the output of the final convolutional layer by restricting the number of activated channels for the data to be forgotten and encouraging activation sparsity. Next, we construct a comprehensive loss function that incorporates multiple components, including hard loss, confusion loss, distillation loss, model weight drift loss, gradient harmonization, and gradient masking, to effectively align the learning trajectories of the objectives of forgetting" and retaining". Finally, we propose a zero-shot repair mechanism that leverages proxy data to restore model accuracy within acceptable bounds without accessing users’ local data. To evaluate the performance of the proposed zero-shot federated unlearning scheme, we conducted comprehensive experiments across diverse settings. The results validate the effectiveness and robustness of the scheme.
[LG-52] Autoencoder-Based Parameter Estimation for Superposed Multi-Component Damped Sinusoidal Signals
链接: https://arxiv.org/abs/2604.03985
作者: Momoka Iida,Hayato Motohashi,Hirotaka Takahashi
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: 27 pages, 16 figures, 14 tables
Abstract:Damped sinusoidal oscillations are widely observed in many physical systems, and their analysis provides access to underlying physical properties. However, parameter estimation becomes difficult when the signal decays rapidly, multiple components are superposed, and observational noise is present. In this study, we develop an autoencoder-based method that uses the latent space to estimate the frequency, phase, decay time, and amplitude of each component in noisy multi-component damped sinusoidal signals. We investigate multi-component cases under Gaussian-distribution training and further examine the effect of the training-data distribution through comparisons between Gaussian and uniform training. The performance is evaluated through waveform reconstruction and parameter-estimation accuracy. We find that the proposed method can estimate the parameters with high accuracy even in challenging setups, such as those involving a subdominant component or nearly opposite-phase components, while remaining reasonably robust when the training distribution is less informative. This demonstrates its potential as a tool for analyzing short-duration, noisy signals.
[LG-53] Multirate Stein Variational Gradient Descent for Efficient Bayesian Sampling
链接: https://arxiv.org/abs/2604.03981
作者: Arash Sarshar
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:Many particle-based Bayesian inference methods use a single global step size for all parts of the update. In Stein variational gradient descent (SVGD), however, each update combines two qualitatively different effects: attraction toward high-posterior regions and repulsion that preserves particle diversity. These effects can evolve at different rates, especially in high-dimensional, anisotropic, or hierarchical posteriors, so one step size can be unstable in some regions and inefficient in others. We derive a multirate version of SVGD that updates these components on different time scales. The framework yields practical algorithms, including a symmetric split method, a fixed multirate method (MR-SVGD), and an adaptive multirate method (Adapt-MR-SVGD) with local error control. We evaluate the methods in a broad and rigorous benchmark suite covering six problem families: a 50D Gaussian target, multiple 2D synthetic targets, UCI Bayesian logistic regression, multimodal Gaussian mixtures, Bayesian neural networks, and large-scale hierarchical logistic regression. Evaluation includes posterior-matching metrics, predictive performance, calibration quality, mixing, and explicit computational cost accounting. Across these six benchmark families, multirate SVGD variants improve robustness and quality-cost tradeoffs relative to vanilla SVGD. The strongest gains appear on stiff hierarchical, strongly anisotropic, and multimodal targets, where adaptive multirate SVGD is usually the strongest variant and fixed multirate SVGD provides a simpler robust alternative at lower cost.
[LG-54] ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation
链接: https://arxiv.org/abs/2604.03922
作者: Hui Sun,Yun-Ji Zhang,Zheng Xie,Ren-Biao Liu,Yali Du,Xin-Ye Li,Ming Li
类目: Machine Learning (cs.LG)
*备注: 32 pages, 14 figures, 9 tables
Abstract:Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emphcircular dependency. Our key insight is that we need not determine test correctness at all: \emphtest votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can \emphdistinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test’s pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test’s ability to separate correct code from incorrect code. Building on this, we propose \textbfACES~(\textbfAUC \textbfConsist\textbfEncy \textbfScoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@ k on multiple code generation benchmarks.
[LG-55] Align Your Structures: Generating Trajectories with Structure Pretraining for Molecular Dynamics ICLR2026
链接: https://arxiv.org/abs/2604.03911
作者: Aniketh Iyengar,Jiaqi Han,Pengwei Sun,Mingjian Jiang,Jianwen Xie,Stefano Ermon
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Published at ICLR 2026. 38 pages, 17 figures, 17 tables
Abstract:Generating molecular dynamics (MD) trajectories using deep generative models has attracted increasing attention, yet remains inherently challenging due to the limited availability of MD data and the complexities involved in modeling high-dimensional MD distributions. To overcome these challenges, we propose a novel framework that leverages structure pretraining for MD trajectory generation. Specifically, we first train a diffusion-based structure generation model on a large-scale conformer dataset, on top of which we introduce an interpolator module trained on MD trajectory data, designed to enforce temporal consistency among generated structures. Our approach effectively harnesses abundant structural data to mitigate the scarcity of MD trajectory data and effectively decomposes the intricate MD modeling task into two manageable subproblems: structural generation and temporal alignment. We comprehensively evaluate our method on the QM9 and DRUGS small-molecule datasets across unconditional generation, forward simulation, and interpolation tasks, and further extend our framework and analysis to tetrapeptide and protein monomer systems. Experimental results confirm that our approach excels in generating chemically realistic MD trajectories, as evidenced by remarkable improvements of accuracy in geometric, dynamical, and energetic measurements.
[LG-56] Improving Model Performance by Adapting the KGE Metric to Account for System Non-Stationarity
链接: https://arxiv.org/abs/2604.03906
作者: M Jawad,HV Gupta,YH Wang,MA Farmani,A Behrangi,GY Niu
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Geoscientific systems tend to be characterized by pronounced temporal non-stationarity, arising from seasonal and climatic variability in hydrometeorological drivers, and from natural and anthropogenic changes to land use and cover. As has been pointed out, such variability renders “the assumption of statistical stationarity obsolete in water management”, and requires us to “account for, rather than ignore, non-stationary trends” in the data. However, metrics used for model development are typically based on the implicit and unjustifiable assumption that the data generating process is time-stationary. Here, we introduce the JKGE_ss metric (adapted from KGE_ss) that detects and accounts for dynamical non-stationarity in the statistical properties of the data and thereby improves information extraction and model performance. Unlike NSE and KGE_ss, which use the long-term mean as a benchmark against which to evaluate model efficiency, JKGE_ss emphasizes reproduction of temporal variations in system storage. We tested the robustness of the new metric by training physical-conceptual and data-based catchment-scale models of varying complexity across a wide range of hydroclimatic conditions, from recent-precipitation-dominated to snow-dominated to strongly arid. In all cases, the result was improved reproduction of system temporal dynamics at all time scales, across wet to dry years, and over the full range of flow levels (especially recession periods). Since traditional metrics fail to adequately account for temporal shifts in system dynamics, potentially resulting in misleading assessments of model performance under changing conditions, we recommend the adoption of JKGE_ss for geoscientific model development. Subjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph) Cite as: arXiv:2604.03906 [cs.LG] (or arXiv:2604.03906v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.03906 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Muhammad Jawad [view email] [v1] Sun, 5 Apr 2026 00:17:10 UTC (5,661 KB)
[LG-57] Improving ML Attacks on LWE with Data Repetition and Stepwise Regression
链接: https://arxiv.org/abs/2604.03903
作者: Alberto Alfarano,Eshika Saxena,Emily Wenger,François Charton,Kristin Lauter
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The Learning with Errors (LWE) problem is a hard math problem in lattice-based cryptography. In the simplest case of binary secrets, it is the subset sum problem, with error. Effective ML attacks on LWE were demonstrated in the case of binary, ternary, and small secrets, succeeding on fairly sparse secrets. The ML attacks recover secrets with up to 3 active bits in the “cruel region” (Nolte et al., 2024) on samples pre-processed with BKZ. We show that using larger training sets and repeated examples enables recovery of denser secrets. Empirically, we observe a power-law relationship between model-based attempts to recover the secrets, dataset size, and repeated examples. We introduce a stepwise regression technique to recover the “cool bits” of the secret.
[LG-58] Lotka-Sharpe Neural Operators for Control of Population PDEs
链接: https://arxiv.org/abs/2604.03892
作者: Miroslav Krstic,Iasson Karafyllis,Luke Bhan,Carina Veil
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 16 pages. In submission
Abstract:Age-structured predator-prey integro-partial differential equations provide models of interacting populations in ecology, epidemiology, and biotechnology. A key challenge in feedback design for these systems is the scalar \zeta , defined implicitly by the Lotka-Sharpe nonlinear integral condition, as a mapping from fertility and mortality rates to \zeta . To solve this challenge with operator learning, we first prove that the Lotka-Sharpe operator is Lipschitz continuous, guaranteeing the existence of arbitrarily accurate neural operator approximations over a compact set of fertility and mortality functions. We then show that the resulting approximate feedback law preserves semi-global practical asymptotic stability under propagation of the operator approximation error through various other nonlinear operators, all the way through to the control input. In the numerical results, not only do we learn ``once-and-for-all’’ the canonical Lotka-Sharpe (LS) operator, and thus make it available for future uses in control of other age-structured population interconnections, but we demonstrate the online usage of the neural LS operator under estimation of the fertility and mortality functions.
[LG-59] Provable Multi-Task Reinforcement Learning: A Representation Learning Framework with Low Rank Rewards
链接: https://arxiv.org/abs/2604.03891
作者: Yaoze Guo,Shana Moothedath
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-task representation learning (MTRL) is an approach that learns shared latent representations across related tasks, facilitating collaborative learning that improves the overall learning efficiency. This paper studies MTRL for multi-task reinforcement learning (RL), where multiple tasks have the same state-action space and transition probabilities, but different rewards. We consider T linear Markov Decision Processes (MDPs) where the reward functions and transition dynamics admit linear feature embeddings of dimension d. The relatedness among the tasks is captured by a low-rank structure on the reward matrices. Learning shared representations across multiple RL tasks is challenging due to the complex and policy-dependent nature of data that leads to a temporal progression of error. Our approach adopts a reward-free reinforcement learning framework to first learn a data-collection policy. This policy then informs an exploration strategy for estimating the unknown reward matrices. Importantly, the data collected under this well-designed policy enable accurate estimation, which ultimately supports the learning of an near-optimal policy. Unlike existing approaches that rely on restrictive assumptions such as Gaussian features, incoherence conditions, or access to optimal solutions, we propose a low-rank matrix estimation method that operates under more general feature distributions encountered in RL settings. Theoretical analysis establishes that accurate low-rank matrix recovery is achievable under these relaxed assumptions, and we characterize the relationship between representation error and sample complexity. Leveraging the learned representation, we construct near-optimal policies and prove a regret bound. Experimental results demonstrate that our method effectively learns robust shared representations and task dynamics from finite data.
[LG-60] Spatiotemporal Interpolation of GEDI Biomass with Calibrated Uncertainty
链接: https://arxiv.org/abs/2604.03874
作者: Robin Young,Srinivasan Keshav
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Monitoring deforestation-driven carbon emissions requires both spatially explicit and temporally continuous estimates of aboveground biomass density (AGBD) with calibrated uncertainty. NASA’s Global Ecosystem Dynamics Investigation (GEDI) provides reliable LIDAR-derived AGBD, but its orbital sampling causes irregular spatiotemporal coverage, and occasional operational interruptions, including a 13-month hibernation from March 2023 to April 2024, leave extended gaps in the observational record. Prior work has used machine learning approaches to fill GEDI’s spatial gaps using satellite-derived features, but temporal interpolation of biomass through unobserved periods, particularly across active disturbance events, remains largely unaddressed. Moreover, standard ensemble methods for biomass mapping have been shown to produce systematically miscalibrated prediction intervals. To address these gaps, we extend the Attentive Neural Process (ANP) framework, previously applied to spatial biomass interpolation, to jointly sparse spatiotemporal settings using geospatial foundation model embeddings. We treat space and time symmetrically, empirically validating a form of space-for-time substitution in which observations from nearby locations at other times inform predictions at held-out periods. Our results demonstrate that the ANP produces well-calibrated uncertainty estimates across disturbance regimes, supporting its use in Measurement, Reporting, and Verification (MRV) applications that require reliable uncertainty quantification for forest carbon accounting.
[LG-61] Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment
链接: https://arxiv.org/abs/2604.03867
作者: Soham Gadgil,Chris Lin,Su-In Lee
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:Steering vectors have emerged as a lightweight and effective approach for aligning large language models (LLMs) at inference time, enabling modulation over model behaviors by shifting LLM representations towards a target behavior. However, existing methods typically apply steering vectors at a globally fixed layer, implicitly assuming that the optimal intervention layer is invariant across inputs. We argue that this assumption is fundamentally limited, as representations relevant to a target behavior can be encoded at different layers depending on the input. Theoretically, we show that different inputs can require steering at different layers to achieve alignment with a desirable model behavior. We also provide empirical evidence that the optimal steering layer varies substantially across inputs in practice. Motivated by these observations, we introduce Where to Steer (W2S), a framework that adaptively selects the intervention layer conditioned on the input, by learning a mapping from input embeddings to optimal steering layers. Across multiple LLMs and alignment behaviors, W2S consistently outperforms fixed-layer baselines, with improvements in both in-distribution and out-of-distribution settings. Our findings highlight the importance of input-dependent control in LLM alignment and demonstrate that adaptive layer selection is a key design dimension missing in the current methodology of steering vectors.
[LG-62] SecureAFL: Secure Asynchronous Federated Learning ASIACCS2026
链接: https://arxiv.org/abs/2604.03862
作者: Anjun Gao,Feng Wang,Zhenglin Wan,Yueyang Quan,Zhuqing Liu,Minghong Fang
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To appear in ACM AsiaCCS 2026
Abstract:Federated learning (FL) enables multiple clients to collaboratively train a global machine learning model via a server without sharing their private training data. In traditional FL, the system follows a synchronous approach, where the server waits for model updates from numerous clients before aggregating them to update the global model. However, synchronous FL is hindered by the straggler problem. To address this, the asynchronous FL architecture allows the server to update the global model immediately upon receiving any client’s local model update. Despite its advantages, the decentralized nature of asynchronous FL makes it vulnerable to poisoning attacks. Several defenses tailored for asynchronous FL have been proposed, but these mechanisms remain susceptible to advanced attacks or rely on unrealistic server assumptions. In this paper, we introduce SecureAFL, an innovative framework designed to secure asynchronous FL against poisoning attacks. SecureAFL improves the robustness of asynchronous FL by detecting and discarding anomalous updates while estimating the contributions of missing clients. Additionally, it utilizes Byzantine-robust aggregation techniques, such as coordinate-wise median, to integrate the received and estimated updates. Extensive experiments on various real-world datasets demonstrate the effectiveness of SecureAFL.
[LG-63] A Bayesian Information-Theoretic Approach to Data Attribution AISTATS2026
链接: https://arxiv.org/abs/2604.03858
作者: Dharmesh Tailor,Nicolò Felicioni,Kamil Ciosek
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026)
Abstract:Training Data Attribution (TDA) seeks to trace model predictions back to influential training examples, enhancing interpretability and safety. We formulate TDA as a Bayesian information-theoretic problem: subsets are scored by the information loss they induce - the entropy increase at a query when removed. This criterion credits examples for resolving predictive uncertainty rather than label noise. To scale to modern networks, we approximate information loss using a Gaussian Process surrogate built from tangent features. We show this aligns with classical influence scores for single-example attribution while promoting diversity for subsets. For even larger-scale retrieval, we relax to an information-gain objective and add a variance correction for scalable attribution in vector databases. Experiments show competitive performance on counterfactual sensitivity, ground-truth retrieval and coreset selection, showing that our method scales to modern architectures while bridging principled measures with practice.
[LG-64] Understanding When Poisson Log-Normal Models Outperform Penalized Poisson Regression for Microbiome Count Data
链接: https://arxiv.org/abs/2604.03853
作者: Daniel Agyapong,Julien Chiquet,Jane Marks,Toby Dylan Hocking
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multivariate count models are often justified by their ability to capture latent dependence, but researchers receive little guidance on when this added structure improves on simpler penalized marginal Poisson regression. We study this question using real microbiome data under a unified held-out evaluation framework. For count prediction, we compare PLN and GLMNet(Poisson) on 20 datasets spanning 32 to 18,270 samples and 24 to 257 taxa, using held-out Poisson deviance under leave-one-taxon-out prediction with 3-fold sample cross-validation rather than synthetic or in-sample criteria. For network inference, we compare PLNNetwork and GLMNet(Poisson) neighborhood selection on five publicly available datasets with experimentally validated microbial interaction truth. PLN outperforms GLMNet(Poisson) on most count-prediction datasets, with gains up to 38 percent. The primary predictor of the winner is the sample-to-taxon ratio, with mean absolute correlation as the strongest secondary signal and overdispersion as an additional predictor. PLNNetwork performs best on broad undirected interaction benchmarks, whereas GLMNet(Poisson) is better aligned with local or directional effects. Taken together, these results provide guidance for choosing between latent multivariate count models and penalized Poisson regression in biological count prediction and interaction recovery.
[LG-65] Collapse-Free Prototype Readout Layer for Transformer Encoders
链接: https://arxiv.org/abs/2604.03850
作者: Giansalvo Cirrincione,Rahul Ranjeev Kumar
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 35 pages, 6 figures, submitted to Pattern Recognition
Abstract:DDCL-Attention is a prototype-based readout layer for transformer encoders that replaces simple pooling methods, such as mean pooling or class tokens, with a learned compression mechanism. It uses a small set of global prototype vectors and assigns tokens to them through soft probabilistic matching, producing compact token summaries at linear complexity in sequence length. The method offers three main advantages. First, it avoids prototype collapse through an exact decomposition of the training loss into a reconstruction term and a diversity term, ensuring that prototypes remain distinct. Second, its joint training with the encoder is shown to be stable under a practical timescale condition, using Tikhonov’s singular perturbation theory and explicit learning-rate constraints. Third, the same framework supports three uses: a final readout layer, a differentiable codebook extending VQ-VAE, and a hierarchical document compressor. Experiments on four datasets confirm the theoretical predictions: the loss decomposition holds exactly, prototype separation grows as expected when the stability condition is met, and the codebook reaches full utilization, outperforming standard hard vector quantization. An additional study on orbital debris classification shows that the method also applies beyond standard NLP and vision tasks, including scientific tabular data. Comments: 35 pages, 6 figures, submitted to Pattern Recognition Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2604.03850 [cs.LG] (or arXiv:2604.03850v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.03850 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-66] Explainability-Guided Adversarial Attacks on Transformer-Based Malware Detectors Using Control Flow Graphs
链接: https://arxiv.org/abs/2604.03843
作者: Andrew Wheeler,Kshitiz Aryal,Maanak Gupta
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures, 4 tables, 1 algorithm, 2 equations
Abstract:Transformer-based malware detection systems operating on graph modalities such as control flow graphs (CFGs) achieve strong performance by modeling structural relationships in program behavior. However, their robustness to adversarial evasion attacks remains underexplored. This paper examines the vulnerability of a RoBERTa-based malware detector that linearizes CFGs into sequences of function calls, a design choice that enables transformer modeling but may introduce token-level sensitivities and ordering artifacts exploitable by adversaries. By evaluating evasion strategies within this graph-to-sequence framework, we provide insight into the practical robustness of transformer-based malware detectors beyond aggregate detection accuracy. This paper proposes a white-box adversarial evasion attack that leverages explainability mechanisms to identify and perturb most influential graph components. Using token- and word-level attributions derived from integrated gradients, the attack iteratively replaces positively attributed function calls with synthetic external imports, producing adversarial CFG representations without altering overall program structure. Experimental evaluation on small- and large-scale Windows Portable Executable (PE) datasets demonstrates that the proposed method can reliably induce misclassification, even against models trained to high accuracy. Our results highlight that explainability tools, while valuable for interpretability, can also expose critical attack surfaces in transformer-based malware detectors. Comments: 9 pages, 3 figures, 4 tables, 1 algorithm, 2 equations Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2604.03843 [cs.CR] (or arXiv:2604.03843v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.03843 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-67] On the Efficiency of Sinkhorn-Knopp for Entropically Regularized Optimal Transport
链接: https://arxiv.org/abs/2604.03787
作者: Kun He
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 66 pages
Abstract:The Sinkhorn–Knopp (SK) algorithm is a cornerstone method for matrix scaling and entropically regularized optimal transport (EOT). Despite its empirical efficiency, existing theoretical guarantees to achieve a target marginal accuracy \varepsilon deteriorate severely in the presence of outliers, bottlenecked either by the global maximum regularized cost \eta|C|\infty (where \eta is the regularization parameter and C the cost matrix) or the matrix’s minimum-to-maximum entry ratio \nu . This creates a fundamental disconnect between theory and practice. In this paper, we resolve this discrepancy. For EOT, we introduce the novel concept of well-boundedness, a local bulk mass property that rigorously isolates the well-behaved portion of the data from extreme outliers. We prove that governed by this fundamental notion, SK recovers the target transport plan for a problem of dimension n in O(\log n - \log \varepsilon) iterations, completely independent of the regularized cost \eta|C|\infty . Furthermore, we show that a virtually cost-free pre-scaling step eliminates the dimensional dependence entirely, accelerating convergence to a strictly dimension-free O(\log(1/\varepsilon)) iterations. Beyond EOT, we establish a sharp phase transition for general (\boldsymbolu,\boldsymbolv) -scaling governed by a critical matrix density threshold. We prove that when a matrix’s density exceeds this threshold, the iteration complexity is strictly independent of \nu . Conversely, when the density falls below this threshold, the dependence on \nu becomes unavoidable; in this sub-critical regime, we construct instances where SK requires \Omega(n/\varepsilon) iterations. Comments: 66 pages Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2604.03787 [cs.DS] (or arXiv:2604.03787v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2604.03787 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-68] Spatiotemporal-Aware Bit-Flip Injection on DNN-based Advanced Driver Assistance Systems
链接: https://arxiv.org/abs/2604.03753
作者: Taibiao Zhao,Xiang Zhang,Mingxuan Sun,Ruyi Ding,Xugui Zhou
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Modern advanced driver assistance systems (ADAS) rely on deep neural networks (DNNs) for perception and planning. Since DNNs’ parameters reside in DRAM during inference, bit flips caused by cosmic radiation or low-voltage operation may corrupt DNN computations, distort driving decisions, and lead to real-world incidents. This paper presents a SpatioTemporal-Aware Fault Injection (STAFI) framework to locate critical fault sites in DNNs for ADAS efficiently. Spatially, we propose a Progressive Metric-guided Bit Search (PMBS) that efficiently identifies critical network weight bits whose corruption causes the largest deviations in driving behavior (e.g., unintended acceleration or steering). Furthermore, we develop a Critical Fault Time Identification (CFTI) mechanism that determines when to trigger these faults, taking into account the context of real-time systems and environmental states, to maximize the safety impact. Experiments on DNNs for a production ADAS demonstrate that STAFI uncovers 29.56x more hazard-inducing critical faults than the strongest baseline.
[LG-69] Algebraic Diversity: Group-Theoretic Spectral Estimation from Single Observations
链接: https://arxiv.org/abs/2604.03634
作者: Mitchell A. Thornton
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注:
Abstract:We prove that temporal averaging over multiple observations can be replaced by algebraic group action on a single observation for second-order statistical estimation. A General Replacement Theorem establishes conditions under which a group-averaged estimator from one snapshot achieves equivalent subspace decomposition to multi-snapshot covariance estimation, and an Optimality Theorem proves that the symmetric group is universally optimal (yielding the KL transform). The framework unifies the DFT, DCT, and KLT as special cases of group-matched spectral transforms, with a closed-form double-commutator eigenvalue problem for polynomial-time optimal group selection. Five applications are demonstrated: MUSIC DOA estimation from a single snapshot, massive MIMO channel estimation with 64% throughput gain, single-pulse waveform classification at 90% accuracy, graph signal processing with non-Abelian groups, and a new algebraic analysis of transformer LLMs revealing that RoPE uses the wrong algebraic group for 70-80% of attention heads across five models (22,480 head observations), that the optimal group is content-dependent, and that spectral-concentration-based pruning improves perplexity at the 13B scale. All diagnostics require a single forward pass with no gradients or training.
[LG-70] BlazeFL: Fast and Deterministic Federated Learning Simulation CVPR2026 CVPR
链接: https://arxiv.org/abs/2604.03606
作者: Kitsuya Azuma,Takayuki Nishio
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures. Accepted to the FedVision at CVPR 2026 (CVPRW)
Abstract:Federated learning (FL) research increasingly relies on single-node simulations with hundreds or thousands of virtual clients, making both efficiency and reproducibility essential. Yet parallel client training often introduces nondeterminism through shared random state and scheduling variability, forcing researchers to trade throughput for reproducibility or to implement custom control logic within complex frameworks. We present BlazeFL, a lightweight framework for single-node FL simulation that alleviates this trade-off through free-threaded shared-memory execution and deterministic randomness management. BlazeFL uses thread-based parallelism with in-memory parameter exchange between the server and clients, avoiding serialization and inter-process communication overhead. To support deterministic execution, BlazeFL assigns isolated random number generator (RNG) streams to clients. Under a fixed software/hardware stack, and when stochastic operators consume BlazeFL-managed generators, this design yields bitwise-identical results across repeated high-concurrency runs in both thread-based and process-based modes. In CIFAR-10 image-classification experiments, BlazeFL substantially reduces execution time relative to a widely used open-source baseline, achieving up to 3.1 \times speedup on communication-dominated workloads while preserving a lightweight dependency footprint. Our open-source implementation is available at: this https URL.
[LG-71] Evaluation of Bagging Predictors with Kernel Density Estimation and Bagging Score
链接: https://arxiv.org/abs/2604.03599
作者: Philipp Seitz,Jan Schmitt,Andreas Schiffler
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, 2 tables, 1 algorithm, 9th International Conference on Advances in Artificial Intelligence (ICAAI 2025)
Abstract:For a larger set of predictions of several differently trained machine learning models, known as bagging predictors, the mean of all predictions is taken by default. Nevertheless, this proceeding can deviate from the actual ground truth in certain parameter regions. An approach is presented to determine a representative y_BS from such a set of predictions using Kernel Density Estimation (KDE) in nonlinear regression with Neural Networks (NN) which simultaneously provides an associated quality criterion beta_BS, called Bagging Score (BS), that reflects the confidence of the obtained ensemble prediction. It is shown that working with the new approach better predictions can be made than working with the common use of mean or median. In addition to this, the used method is contrasted to several approaches of nonlinear regression from the literatur, resulting in a top ranking in each of the calculated error values without using any optimization or feature selection technique.
[LG-72] Simple yet Effective: Low-Rank Spatial Attention for Neural Operators
链接: https://arxiv.org/abs/2604.03582
作者: Zherui Yang,Haiyang Xin,Tao Du,Ligang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural operators have emerged as data-driven surrogates for solving partial differential equations (PDEs), and their success hinges on efficiently modeling the long-range, global coupling among spatial points induced by the underlying physics. In many PDE regimes, the induced global interaction kernels are empirically compressible, exhibiting rapid spectral decay that admits low-rank approximations. We leverage this observation to unify representative global mixing modules in neural operators under a shared low-rank template: compressing high-dimensional pointwise features into a compact latent space, processing global interactions within it, and reconstructing the global context back to spatial points. Guided by this view, we introduce Low-Rank Spatial Attention (LRSA) as a clean and direct instantiation of this template. Crucially, unlike prior approaches that often rely on non-standard aggregation or normalization modules, LRSA is built purely from standard Transformer primitives, i.e., attention, normalization, and feed-forward networks, yielding a concise block that is straightforward to implement and directly compatible with hardware-optimized kernels. In our experiments, such a simple construction is sufficient to achieve high accuracy, yielding an average error reduction of over 17% relative to second-best methods, while remaining stable and efficient in mixed-precision training.
[LG-73] Choosing the Right Regularizer for Applied ML: Simulation Benchmarks of Popular Scikit-learn Regularization Frameworks
链接: https://arxiv.org/abs/2604.03541
作者: Benjamin S. Knight,Ahsaas Bajaj
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This study surveys the historical development of regularization, tracing its evolution from stepwise regression in the 1960s to recent advancements in formal error control, structured penalties for non-independent features, Bayesian methods, and l0-based regularization (among other techniques). We empirically evaluate the performance of four canonical frameworks – Ridge, Lasso, ElasticNet, and Post-Lasso OLS – across 134,400 simulations spanning a 7-dimensional manifold grounded in eight production-grade machine learning models. Our findings demonstrate that for prediction accuracy when the sample-to-feature ratio is sufficient (n/p = 78), Ridge, Lasso, and ElasticNet are nearly interchangeable. However, we find that Lasso recall is highly fragile under multicollinearity; at high condition numbers (kappa) and low SNR, Lasso recall collapses to 0.18 while ElasticNet maintains 0.93. Consequently, we advise practitioners against using Lasso or Post-Lasso OLS at high kappa with small sample sizes. The analysis concludes with an objective-driven decision guide to assist machine learning engineers in selecting the optimal scikit-learn-supported framework based on observable feature space attributes.
[LG-74] Online learning of smooth functions on mathbbR
链接: https://arxiv.org/abs/2604.03525
作者: Jesse Geneson,Kuldeep Singh,Alexander Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study adversarial online learning of real-valued functions on \mathbbR . In each round the learner is queried at x_t\in\mathbbR , predicts \hat y_t , and then observes the true value f(x_t) ; performance is measured by cumulative p -loss \sum_t\ge 1|\hat y_t-f(x_t)|^p . For the class [ \mathcalG_q=\Bigl\f:\mathbbR\to\mathbbR\ \textabsolutely continuous:\ \int_\mathbbR|f’(x)|^q,dx\le 1\Bigr, ] we show that the standard model becomes ill-posed on \mathbbR : for every p\ge 1 and q1 , an adversary can force infinite loss. Motivated by this obstruction, we analyze three modified learning scenarios that limit the influence of queries that are far from previously observed inputs. In Scenario 1 the adversary must choose each new query within distance 1 of some past query. In Scenario 2 the adversary may query anywhere, but the learner is penalized only on rounds whose query lies within distance 1 of a past query. In Scenario 3 the loss in round t is multiplied by a weight g(\min_jt|x_t-x_j|) . We obtain sharp characterizations for Scenarios 1-2 in several regimes. For Scenario 3 we identify a clean threshold phenomenon: if g decays too slowly, then the adversary can force infinite weighted loss. In contrast, for rapidly decaying weights such as g(z)=e^-cz we obtain finite and sharp guarantees in the quadratic case p=q=2 . Finally, we study a natural multivariable slice generalization \mathcalG_q,d of \mathcalG_q on \mathbbR^d and show a sharp dichotomy: while the one-dimensional case admits finite opt-values in certain regimes, for every d\ge 2 the slice class \mathcalG_q,d is too permissive, and even under Scenarios 1-3 an adversary can force infinite loss. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.03525 [cs.LG] (or arXiv:2604.03525v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.03525 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-75] Improving Feasibility via Fast Autoencoder-Based Projections
链接: https://arxiv.org/abs/2604.03489
作者: Maria Chzhen,Priya L. Donti
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Enforcing complex (e.g., nonconvex) operational constraints is a critical challenge in real-world learning and control systems. However, existing methods struggle to efficiently enforce general classes of constraints. To address this, we propose a novel data-driven amortized approach that uses a trained autoencoder as an approximate projector to provide fast corrections to infeasible predictions. Specifically, we train an autoencoder using an adversarial objective to learn a structured, convex latent representation of the feasible set. This enables rapid correction of neural network outputs by projecting their associated latent representations onto a simple convex shape before decoding into the original feasible set. We test our approach on a diverse suite of constrained optimization and reinforcement learning problems with challenging nonconvex constraints. Results show that our method effectively enforces constraints at a low computational cost, offering a practical alternative to expensive feasibility correction techniques based on traditional solvers.
[LG-76] Investigating Data Interventions for Subgroup Fairness: An ICU Case Study
链接: https://arxiv.org/abs/2604.03478
作者: Erin Tan,Judy Hanwen Shen,Irene Y. Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:In high-stakes settings where machine learning models are used to automate decision-making about individuals, the presence of algorithmic bias can exacerbate systemic harm to certain subgroups of people. These biases often stem from the underlying training data. In practice, interventions to “fix the data” depend on the actual additional data sources available – where many are less than ideal. In these cases, the effects of data scaling on subgroup performance become volatile, as the improvements from increased sample size are counteracted by the introduction of distribution shifts in the training set. In this paper, we investigate the limitations of combining data sources to improve subgroup performance within the context of healthcare. Clinical models are commonly trained on datasets comprised of patient electronic health record (EHR) data from different hospitals or admission departments. Across two such datasets, the eICU Collaborative Research Database and the MIMIC-IV dataset, we find that data addition can both help and hurt model fairness and performance, and many intuitive strategies for data selection are unreliable. We compare model-based post-hoc calibration and data-centric addition strategies to find that the combination of both is important to improve subgroup performance. Our work questions the traditional dogma of “better data” for overcoming fairness challenges by comparing and combining data- and model-based approaches.
[LG-77] Super Agents and Confounders: Influence of surrounding agents on vehicle trajectory prediction
链接: https://arxiv.org/abs/2604.03463
作者: Daniel Jost,Luca Paparusso,Martin Stoll,Jörg Wagner,Raghu Rajan,Joschka Bödecker
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:In highly interactive driving scenes, trajectory prediction is conditioned on information from surrounding traffic participants such as cars and pedestrians. Our main contribution is a comprehensive analysis of state-of-the-art trajectory predictors, which reveals a surprising and critical flaw: many surrounding agents degrade prediction accuracy rather than improve it. Using Shapley-based attribution, we rigorously demonstrate that models learn unstable and non-causal decision-making schemes that vary significantly across training runs. Building on these insights, we propose to integrate a Conditional Information Bottleneck (CIB), which does not require additional supervision and is trained to effectively compress agent features as well as ignore those that are not beneficial for the prediction task. Comprehensive experiments using multiple datasets and model architectures demonstrate that this simple yet effective approach not only improves overall trajectory prediction performance in many cases but also increases robustness to different perturbations. Our results highlight the importance of selectively integrating contextual information, which can often contain spurious or misleading signals, in trajectory prediction. Moreover, we provide interpretable metrics for identifying non-robust behavior and present a promising avenue towards a solution.
[LG-78] Earth Embeddings Reveal Diverse Urban Signals from Space
链接: https://arxiv.org/abs/2604.03456
作者: Wenjing Gong,Udbhav Srivastava,Yuchen Wang,Yuhao Jia,Qifan Wu,Weishan Bai,Yifan Yang,Xiao Huang,Xinyue Ye
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 30 pages, 18 figures
Abstract:Conventional urban indicators derived from censuses, surveys, and administrative records are often costly, spatially inconsistent, and slow to update. Recent geospatial foundation models enable Earth embeddings, compact satellite image representations transferable across downstream tasks, but their utility for neighborhood-scale urban monitoring remains unclear. Here, we benchmark three Earth embedding families, AlphaEarth, Prithvi, and Clay, for urban signal prediction across six U.S. metropolitan areas from 2020 to 2023. Using a unified supervised-learning framework, we predict 14 neighborhood-level indicators spanning crime, income, health, and travel behavior, and evaluate performance under four settings: global, city-wise, year-wise, and city-year. Results show that Earth embeddings capture substantial urban variation, with the highest predictive skill for outcomes more directly tied to built-environment structure, including chronic health burdens and dominant commuting modes. By contrast, indicators shaped more strongly by fine-scale behavior and local policy, such as cycling, remain difficult to infer. Predictive performance varies markedly across cities but remains comparatively stable across years, indicating strong spatial heterogeneity alongside temporal robustness. Exploratory analysis suggests that cross-city variation in predictive performance is associated with urban form in task-specific ways. Controlled dimensionality experiments show that representation efficiency is critical: compact 64-dimensional AlphaEarth embeddings remain more informative than 64-dimensional reductions of Prithvi and Clay. This study establishes a benchmark for evaluating Earth embeddings in urban remote sensing and demonstrates their potential as scalable, low-cost features for SDG-aligned neighborhood-scale urban monitoring.
[LG-79] Neural Operators for Multi-Task Control and Adaptation
链接: https://arxiv.org/abs/2604.03449
作者: David Sewell,Xingjian Li,Stepan Tretiakov,Krishna Kumar,David Fridovich-Keil
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 25 pages, 10 figures, 2 tables
Abstract:Neural operator methods have emerged as powerful tools for learning mappings between infinite-dimensional function spaces, yet their potential in optimal control remains largely unexplored. We focus on multi-task control problems, whose solution is a mapping from task description (e.g., cost or dynamics functions) to optimal control law (e.g., feedback policy). We approximate these solution operators using a permutation-invariant neural operator architecture. Across a range of parametric optimal control environments and a locomotion benchmark, a single operator trained via behavioral cloning accurately approximates the solution operator and generalizes to unseen tasks, out-of-distribution settings, and varying amounts of task observations. We further show that the branch-trunk structure of our neural operator architecture enables efficient and flexible adaptation to new tasks. We develop structured adaptation strategies ranging from lightweight updates to full-network fine-tuning, achieving strong performance across different data and compute settings. Finally, we introduce meta-trained operator variants that optimize the initialization for few-shot adaptation. These methods enable rapid task adaptation with limited data and consistently outperform a popular meta-learning baseline. Together, our results demonstrate that neural operators provide a unified and efficient framework for multi-task control and adaptation.
[LG-80] Adversarial Robustness of Deep State Space Models for Forecasting
链接: https://arxiv.org/abs/2604.03427
作者: Sribalaji C. Anand,George J. Pappas
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, 5 figures, conference submission
Abstract:State-space model (SSM) for time-series forecasting have demonstrated strong empirical performance on benchmark datasets, yet their robustness under adversarial perturbations is poorly understood. We address this gap through a control-theoretic lens, focusing on the recently proposed Spacetime SSM forecaster. We first establish that the decoder-only Spacetime architecture can represent the optimal Kalman predictor when the underlying data-generating process is autoregressive - a property no other SSM possesses. Building on this, we formulate robust forecaster design as a Stackelberg game against worst-case stealthy adversaries constrained by a detection budget, and solve it via adversarial training. We derive closed-form bounds on adversarial forecasting error that expose how open-loop instability, closed-loop instability, and decoder state dimension each amplify vulnerability - offering actionable principles towards robust forecaster design. Finally, we show that even adversaries with no access to the forecaster can nonetheless construct effective attacks by exploiting the model’s locally linear input-output behavior, bypassing gradient computations entirely. Experiments on the Monash benchmark datasets highlight that model-free attacks, without any gradient computation, can cause at least 33% more error than projected gradient descent with a small step size.
[LG-81] Adaptive Threshold-Driven Continuous Greedy Method for Scalable Submodular Optimization
链接: https://arxiv.org/abs/2604.03419
作者: Mohammadreza Rostami,Solmaz S. Kia
类目: Machine Learning (cs.LG); Combinatorics (math.CO)
*备注:
Abstract:Submodular maximization under matroid constraints is a fundamental problem in combinatorial optimization with applications in sensing, data summarization, active learning, and resource allocation. While the Sequential Greedy (SG) algorithm achieves only a \frac12 -approximation due to irrevocable selections, Continuous Greedy (CG) attains the optimal \bigl(1-\frac1e\bigr) -approximation via the multilinear relaxation, at the cost of a progressively dense decision vector that forces agents to exchange feature embeddings for nearly every ground-set element. We propose \textitATCG (\underlineAdaptive \underlineThresholded \underlineContinuous \underlineGreedy), which gates gradient evaluations behind a per-partition progress ratio \eta_i , expanding each agent’s active set only when current candidates fail to capture sufficient marginal gain, thereby directly bounding which feature embeddings are ever transmitted. Theoretical analysis establishes a curvature-aware approximation guarantee with effective factor \tau_\mathrmeff=\max\tau,1-c\ , interpolating between the threshold-based guarantee and the low-curvature regime where \textitATCG recovers the performance of CG. Experiments on a class-balanced prototype selection problem over a subset of the CIFAR-10 animal dataset show that \textitATCG achieves objective values comparable to those of the full CG method while substantially reducing communication overhead through adaptive active-set expansion.
[LG-82] Beauty in the Eye of AI: Aligning LLM s and Vision Models with Human Aesthetics in Network Visualization
链接: https://arxiv.org/abs/2604.03417
作者: Peng Zhang,Xuefeng Li,Xiaoqi Wang,Han-Wei Shen,Yifan Hu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Network visualization has traditionally relied on heuristic metrics, such as stress, under the assumption that optimizing them leads to aesthetic and informative layouts. However, no single metric consistently produces the most effective results. A data-driven alternative is to learn from human preferences, where annotators select their favored visualization among multiple layouts of the same graphs. These human-preference labels can then be used to train a generative model that approximates human aesthetic preferences. However, obtaining human labels at scale is costly and time-consuming. As a result, this generative approach has so far been tested only with machine-labeled data. In this paper, we explore the use of large language models (LLMs) and vision models (VMs) as proxies for human judgment. Through a carefully designed user study involving 27 participants, we curated a large set of human preference labels. We used this data both to better understand human preferences and to bootstrap LLM/VM labelers. We show that prompt engineering that combines few-shot examples and diverse input formats, such as image embeddings, significantly improves LLM-human alignment, and additional filtering by the confidence score of the LLM pushes the alignment to human-human levels. Furthermore, we demonstrate that carefully trained VMs can achieve VM-human alignment at a level comparable to that between human annotators. Our results suggest that AI can feasibly serve as a scalable proxy for human labelers.
[LG-83] Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking
链接: https://arxiv.org/abs/2604.03404
作者: Haotian Xiang,Qin Lu,Yaakov Bar-Shalom
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Active multi-target tracking requires a mobile robot to balance exploration for undetected targets with exploitation of uncertain tracked ones. Diffusion policies have emerged as a powerful approach for capturing diverse behavioral strategies by learning action sequences from expert demonstrations. However, existing methods implicitly select among strategies through the denoising process, without uncertainty quantification over which strategy to execute. We formulate expert selection for diffusion policies as an offline contextual bandit problem and propose a Bayesian framework for pessimistic, uncertainty-aware strategy selection. A multi-head Variational Bayesian Last Layer (VBLL) model predicts the expected tracking performance of each expert strategy given the current belief state, providing both a point estimate and predictive uncertainty. Following the pessimism principle for offline decision-making, a Lower Confidence Bound (LCB) criterion then selects the expert whose worst-case predicted performance is best, avoiding overcommitment to experts with unreliable predictions. The selected expert conditions a diffusion policy to generate corresponding action sequences. Experiments on simulated indoor tracking scenarios demonstrate that our approach outperforms both the base diffusion policy and standard gating methods, including Mixture-of-Experts selection and deterministic regression baselines.
[LG-84] SDVDiag: Using Context-Aware Causality Mining for the Diagnosis of Connected Vehicle Functions
链接: https://arxiv.org/abs/2604.03391
作者: Matthias Weiß,Falk Dettinger,Elias Detrois,Nasser Jazdi,Michael Weyrich
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures, to be submitted to the VTC2026
Abstract:Real-world implementations of connected vehicle functions are spreading steadily, yet operating these functions reliably remains challenging due to their distributed nature and the complexity of the underlying cloud, edge, and networking infrastructure. Quick diagnosis of problems and understanding the error chains that lead to failures is essential for reducing downtime. However, diagnosing these systems is still largely performed manually, as automated analysis techniques are predominantly data-driven and struggle with hidden relationships and the integration of context information. This paper addresses this gap by introducing a multimodal approach that integrates human feedback and system-specific information into the causal analysis process. Reinforcement Learning from Human Feedback is employed to continuously train a causality mining model while incorporating expert knowledge. Additional modules leverage distributed tracing data to prune false-positive causal links and enable the injection of domain-specific relationships to further refine the causal this http URL is performed using an automated valet parking application operated in a connected vehicle test field. Results demonstrate a significant increase in precision from 14% to 100% for the detection of causal edges and improved system interpretability compared to purely data-driven approaches, highlighting the potential for system operators in the connected vehicle domain.
[LG-85] Scalable Variational Bayesian Fine-Tuning of LLM s via Orthogonalized Low-Rank Adapters
链接: https://arxiv.org/abs/2604.03388
作者: Haotian Xiang,Bingcong Li,Qin Lu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:When deploying large language models (LLMs) to safety-critical applications, uncertainty quantification (UQ) is of utmost importance to self-assess the reliability of the LLM-based decisions. However, such decisions typically suffer from overconfidence, particularly after parameter-efficient fine-tuning (PEFT) for downstream domain-specific tasks with limited data. Existing methods to alleviate this issue either rely on Laplace approximation based post-hoc framework, which may yield suboptimal calibration depending on the training trajectory, or variational Bayesian training that requires multiple complete forward passes through the entire LLM backbone at inference time for Monte Carlo estimation, posing scalability challenges for deployment. To address these limitations, we build on the Bayesian last layer (BLL) model, where the LLM-based deterministic feature extractor is followed by random last layer parameters for uncertainty reasoning. Since existing low-rank adapters (LoRA) for PEFT have limited expressiveness due to rank collapse, we address this with Polar-decomposed Low-rank Adapter Representation (PoLAR), an orthogonalized parameterization paired with Riemannian optimization to enable more stable and expressive adaptation. Building on this PoLAR-BLL model, we leverage the variational (V) inference framework to put forth a scalable Bayesian fine-tuning approach which jointly seeks the PoLAR parameters and approximate posterior of the last layer parameters via alternating optimization. The resulting PoLAR-VBLL is a flexible framework that nicely integrates architecture-enhanced optimization with scalable Bayesian inference to endow LLMs with well-calibrated UQ. Our empirical results verify the effectiveness of PoLAR-VBLL in terms of generalization and uncertainty estimation on both in-distribution and out-of-distribution data for various common-sense reasoning tasks.
[LG-86] he limits of bio-molecular modeling with large language models : a cross-scale evaluation
链接: https://arxiv.org/abs/2604.03361
作者: Yaxin Xu,Yue Zhou,Tianyu Zhao,Fengwei An,Zhixiang Ren
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:The modeling of bio-molecular system across molecular scales remains a central challenge in scientific research. Large language models (LLMs) are increasingly applied to bio-molecular discovery, yet systematic evaluation across multi-scale biological problems and rigorous assessment of their tool-augmented capabilities remain limited. We reveal a systematic gap between LLM performance and mechanistic understanding through the proposed cross-scale bio-molecular benchmark: BioMol-LLM-Bench, a unified framework comprising 26 downstream tasks that covers 4 distinct difficulty levels, and computational tools are integrated for a more comprehensive evaluation. Evaluation on 13 representative models reveals 4 main findings: chain-of-thought data provides limited benefit and may even reduce performance on biological tasks; hybrid mamba-attention architectures are more effective for long bio-molecular sequences; supervised fine-tuning improves specialization at the cost of generalization; and current LLMs perform well on classification tasks but remain weak on challenging regression tasks. Together, these findings provide practical guidance for future LLM-based modeling of molecular systems.
[LG-87] Hardware-Oriented Inference Complexity of Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2604.03345
作者: Bilal Khalid,Pedro Freire,Sergei K. Turitsyn,Jaroslaw E. Prilepsky
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Kolmogorov-Arnold Networks (KANs) have recently emerged as a powerful architecture for various machine learning applications. However, their unique structure raises significant concerns regarding their computational overhead. Existing studies primarily evaluate KAN complexity in terms of Floating-Point Operations (FLOPs) required for GPU-based training and inference. However, in many latency-sensitive and power-constrained deployment scenarios, such as neural network-driven non-linearity mitigation in optical communications or channel state estimation in wireless communications, training is performed offline and dedicated hardware accelerators are preferred over GPUs for inference. Recent hardware implementation studies report KAN complexity using platform-specific resource consumption metrics, such as Look-Up Tables, Flip-Flops, and Block RAMs. However, these metrics require a full hardware design and synthesis stage that limits their utility for early-stage architectural decisions and cross-platform comparisons. To address this, we derive generalized, platform-independent formulae for evaluating the hardware inference complexity of KANs in terms of Real Multiplications (RM), Bit Operations (BOP), and Number of Additions and Bit-Shifts (NABS). We extend our analysis across multiple KAN variants, including B-spline, Gaussian Radial Basis Function (GRBF), Chebyshev, and Fourier KANs. The proposed metrics can be computed directly from the network structure and enable a fair and straightforward inference complexity comparison between KAN and other neural network architectures.
[LG-88] NativeTernary: A Self-Delimiting Binary Encoding with Unary Run-Length Hierarchy Markers for Ternary Neural Network Weights Structured Data and General Computing Infrastructure
链接: https://arxiv.org/abs/2604.03336
作者: Maharshi Savdhariya
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 9 pages. Patent filed, Indian Patent Office, March 2026. C implementation forthcoming: this https URL . v2 planned with GGUF benchmarks. Keywords: ternary encoding, BitNet b1.58, 1-bit LLMs, ternary weights, GGUF, IoT compression, run-length encoding, embedded systems
Abstract:BitNet b1.58 (Ma et al., 2024) demonstrates that large language models can operate entirely on ternary weights -1, 0, +1, yet no native binary wire format exists for such models. NativeTernary closes this gap. We present NativeTernary, a binary encoding scheme that partitions the 2-bit pair space into three data symbols representing ternary values – either balanced -1, 0, +1 or unsigned 0, 1, 2 – and a reserved structural delimiter. The central contribution is the use of unary run-length encoding to represent semantic hierarchy depth: a sequence of N consecutive delimiter pairs denotes a boundary of level N, encoding character, word, sentence, paragraph, and topic boundaries at cost 2, 4, 6, 8, and 10 bits respectively – proportional to boundary rarity. The choice of which 2-bit pair serves as the delimiter is a design parameter: 11 is the primary embodiment, offering simple OR-gate detection; 00 is an alternative embodiment optimised for ultra-low-power CMOS systems, minimising switching activity. All four bit-pair choices are covered by the patent claims. We present three encoding variants: (1) the primary scheme with 11 as sole delimiter; (2) a dual-starter variant where both 10 and 11 initiate distinct symbol namespaces; and (3) an analysis of unsigned versus balanced ternary data mappings. We describe a path toward ternary-native general computing infrastructure requiring no hardware changes, and outline applications spanning ternary neural network weight storage, hierarchical natural language encoding, edge computing, IoT and satellite telemetry, industrial sensors, automotive systems, medical devices, gaming, and financial tick data. The decoder is a 10-line stateless state machine resilient to bitstream corruption.
[LG-89] Apparent Age Estimation: Challenges and Outcomes
链接: https://arxiv.org/abs/2604.03335
作者: Justin Rainier Go,Lorenz Bernard Marqueses,Mikaella Kaye Martinez,John Kevin Patrick Sarmiento,Abien Fred Agarap
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted for oral presentation at Philippine Computing Science Congress 2026
Abstract:Apparent age estimation is a valuable tool for business personalization, yet current models frequently exhibit demographic biases. We review prior works on the DEX method by applying distribution learning techniques such as Mean-Variance Loss (MVL) and Adaptive Mean-Residue Loss (AMRL), and evaluate them in both accuracy and fairness. Using IMDB-WIKI, APPA-REAL, and FairFace, we demonstrate that while AMRL achieves state-of-the-art accuracy, trade-offs between precision and demographic equity persist. Despite clear age clustering in UMAP embeddings, our saliency maps indicate inconsistent feature focus across demographics, leading to significant performance degradation for Asian and African American populations. We argue that technical improvements alone are insufficient; accurate and fair apparent age estimation requires the integration of localized and diverse datasets, and strict adherence to fairness validation protocols.
[LG-90] InsightBoard: An Interactive Multi-Metric Visualization and Fairness Analysis Plugin for TensorBoard
链接: https://arxiv.org/abs/2604.03323
作者: Ray Zeyao Chen,Christan Grant
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:Modern machine learning systems deployed in safety-critical domains require visibility not only into aggregate performance but also into how training dynamics affect subgroup fairness over time. Existing training dashboards primarily support single-metric monitoring and offer limited support for examining relationships between heterogeneous metrics or diagnosing subgroup disparities during training. We present InsightBoard, an interactive TensorBoard plugin that integrates synchronized multi-metric visualization with slice-based fairness diagnostics in a unified interface. InsightBoard enables practitioners to jointly inspect training dynamics, performance metrics, and subgroup disparities through linked multi-view plots, correlation analysis, and standard group fairness indicators computed over user-defined slices. Through case studies with YOLOX on the BDD100k dataset, we demonstrate that models achieving strong aggregate performance can still exhibit substantial demographic and environmental disparities that remain hidden under conventional monitoring. By making fairness diagnostics available during training, InsightBoard supports earlier, more informed model inspection without modifying existing training pipelines or introducing additional data stores.
[LG-91] Computer Architectures AlphaZero Moment: Automated Discovery in an Encircled World
链接: https://arxiv.org/abs/2604.03312
作者: Karthikeyan Sankaralingam
类目: Hardware Architecture (cs.AR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:The end of Moore’s Law and Dennard scaling has fundamentally changed the economics of computer architecture. With transistor scaling delivering diminishing returns, architectural innovation is now the primary - and perhaps only - remaining lever for performance improvement. However, we argue that human-driven architecture research is fundamentally ill-suited for this new era. The architectural design space is vast (effectively infinite for practical purposes), yet human teams explore perhaps 50-100 designs per generation, sampling less than 0.001% of possibilities. This approach worked during the abundance era when Moore’s Law provided a rising tide that lifted all designs. In the current scarcity paradigm, where every architecture must deliver 2X performance improvements using essentially the same transistor budget, systematic exploration becomes critical. We propose a concrete alternative: automated idea factories that generate and evaluate thousands of candidate architectures weekly through multi-tiered evaluation pipelines, learning from deployed telemetry data in a continuous feedback loop. Early results suggest that such systems can compress architectural design cycles from double-digit months to single-digit weeks by exploring orders of magnitude more candidates than any human team, and do it much faster. We predict that within 2 years, purely human-driven architecture research will be as obsolete as human chess players competing against engines.
[LG-92] ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs ISCA2026
链接: https://arxiv.org/abs/2604.03298
作者: Jinwu Yang,Jiaan Wu,Zedong Liu,Xinyang Ma,Hairui Zhao,Yida Gu,Yuanhong Huang,Xingchen Liu,Wenjing Huang,Zheng Wei,Jing Xing,Yili Ma,Qingyi Zhang,Baoyi An,Zhongzhe Hu,Shaoteng Liu,Xia Zhu,Jiaxun Lu,Guangming Tan,Dingwen Tao
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted by ISCA 2026, 15 pages, 13 figures, 7 tables
Abstract:The rapid scaling of Large Language Models presents significant challenges for their deployment and inference, particularly on resource-constrained specialized AI hardware accelerators such as Huawei’s Ascend NPUs, where weight data transfer has become a critical performance bottleneck. While lossless compression can preserve model accuracy and reduce data volume, existing lossless compression algorithms exhibit extremely low throughput when ported to the Ascend NPU architecture. In this paper, we propose ENEC, a novel lossless compression method specifically customized for AI model weights and optimized for Ascend Neural Processing Units. ENEC adopts a block-based fixed-length encoding scheme and incorporates a series of NPU-specific optimizations: bit-width quantization with hierarchical halving bit-packing, vectorized branch-free integer transformation, and dependency-decoupled intra-segment scan for efficient prefix-sum computation. Experimental results demonstrate that ENEC outperforms existing state-of-the-art NPU compressors in both compression ratio and throughput. Compared to leading GPU solutions, ENEC achieves a 3.43X higher throughput than DietGPU and a 1.12X better compression ratio than nvCOMP. By reducing weight transmission overhead, ENEC significantly improves end-to-end inference performance, achieving up to a 6.3X speedup. On Ascend NPUs, ENEC is the first open-source lossless compression algorithm for model weights that achieves performance comparable to state-of-the-art GPU compressors, offering an effective solution for deploying large-scale AI models.
[LG-93] DRAFT: Task Decoupled Latent Reasoning for Agent Safety
链接: https://arxiv.org/abs/2604.03242
作者: Lin Wang,Junfeng Fang,Dan Zhang,Fei Shen,Xiang Wang,Tat-Seng Chua
类目: Machine Learning (cs.LG)
*备注:
Abstract:The advent of tool-using LLM agents shifts safety monitoring from output moderation to auditing long, noisy interaction trajectories, where risk-critical evidence is sparse-making standard binary supervision poorly suited for credit assignment. To address this, we propose DRAFT (Task Decoupled Latent Reasoning for Agent Safety), a latent reasoning framework that decouples safety judgment into two trainable stages: an Extractor that distills the full trajectory into a compact continuous latent draft, and a Reasoner that jointly attends to the draft and the original trajectory to predict safety. DRAFT avoids lossy explicit summarize-then-judge pipelines by performing evidence aggregation in latent space, enabling end-to-end differentiable this http URL benchmarks including ASSEBench and R-Judge, DRAFT consistently outperforms strong baselines, improving accuracy from 63.27% (LoRA) to 91.18% averaged over benchmarks, and learns more separable representations. Ablations demonstrate a clear synergy between the Extractor and the this http URL, DRAFT suggests that continuous latent reasoning prior to readout is a practical path to robust agent safety under long-context supervision with sparse evidence.
[LG-94] Integrating Artificial Intelligence Physics and Internet of Things: A Framework for Cultural Heritage Conservation
链接: https://arxiv.org/abs/2604.03233
作者: Carmine Valentino,Federico Pichi,Francesco Colace,Dajana Conte,Gianluigi Rozza
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:The conservation of cultural heritage increasingly relies on integrating technological innovation with domain expertise to ensure effective monitoring and predictive maintenance. This paper presents a novel framework to support the preservation of cultural assets, combining Internet of Things (IoT) and Artificial Intelligence (AI) technologies, enhanced with the physical knowledge of phenomena. The framework is structured into four functional layers that permit the analysis of 3D models of cultural assets and elaborate simulations based on the knowledge acquired from data and physics. A central component of the proposed framework consists of Scientific Machine Learning, particularly Physics-Informed Neural Networks (PINNs), which incorporate physical laws into deep learning models. To enhance computational efficiency, the framework also integrates Reduced Order Methods (ROMs), specifically Proper Orthogonal Decomposition (POD), and is also compatible with classical Finite Element (FE) methods. Additionally, it includes tools to automatically manage and process 3D digital replicas, enabling their direct use in simulations. The proposed approach offers three main contributions: a methodology for processing 3D models of cultural assets for reliable simulation; the application of PINNs to combine data-driven and physics-based approaches in cultural heritage conservation; and the integration of PINNs with ROMs to efficiently model degradation processes influenced by environmental and material parameters. The reproducible and open-access experimental phase exploits simulated scenarios on complex and real-life geometries to test the efficacy of the proposed framework in each of its key components, allowing the possibility of dealing with both direct and inverse problems. Code availability: this https URL
[LG-95] PINNs in PDE Constrained Optimal Control Problems: Direct vs Indirect Methods
链接: https://arxiv.org/abs/2604.04920
作者: Zhen Zhang,Shanqing Liu,Alessandro Alla,Jerome Darbon,George Em Karniadakis
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures
Abstract:We study physics-informed neural networks (PINNs) as numerical tools for the optimal control of semilinear partial differential equations. We first recall the classical direct and indirect viewpoints for optimal control of PDEs, and then present two PINN formulations: a direct formulation based on minimizing the objective under the state constraint, and an indirect formulation based on the first-order optimality system. For a class of semilinear parabolic equations, we derive the state equation, the adjoint equation, and the stationarity condition in a form consistent with continuous-time Pontryagin-type optimality conditions. We then specialize the framework to an Allen-Cahn control problem and compare three numerical approaches: (i) a discretize-then-optimize adjoint method, (ii) a direct PINN, and (iii) an indirect PINN. Numerical results show that the PINN parameterization has an implicit regularizing effect, in the sense that it tends to produce smoother control profiles. They also indicate that the indirect PINN more faithfully preserves the PDE contraint and optimality structure and yields a more accurate neural approximation than the direct PINN.
[LG-96] A Robust SINDy Autoencoder for Noisy Dynamical System Identification
链接: https://arxiv.org/abs/2604.04829
作者: Kairui Ding
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 27 pages
Abstract:Sparse identification of nonlinear dynamics (SINDy) has been widely used to discover the governing equations of a dynamical system from data. It uses sparse regression techniques to identify parsimonious models of unknown systems from a library of candidate functions. Therefore, it relies on the assumption that the dynamics are sparsely represented in the coordinate system used. To address this limitation, one seeks a coordinate transformation that provides reduced coordinates capable of reconstructing the original system. Recently, SINDy autoencoders have extended this idea by combining sparse model discovery with autoencoder architectures to learn simplified latent coordinates together with parsimonious governing equations. A central challenge in this framework is robustness to measurement error. Inspired by noise-separating neural network structures, we incorporate a noise-separation module into the SINDy autoencoder architecture, thereby improving robustness and enabling more reliable identification of noisy dynamical systems. Numerical experiments on the Lorenz system show that the proposed method recovers interpretable latent dynamics and accurately estimates the measurement noise from noisy observations.
[LG-97] Hybrid Fourier Neural Operator for Surrogate Modeling of Laser Processing with a Quantum-Circuit Mixer
链接: https://arxiv.org/abs/2604.04828
作者: Mateusz Papierz,Asel Sagingalieva,Alix Benoit,Toni Ivas,Elia Iseli,Alexey Melnikov
类目: Quantum Physics (quant-ph); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 24 pages, 10 figures, 6 tables
Abstract:Data-driven surrogates can replace expensive multiphysics solvers for parametric PDEs, yet building compact, accurate neural operators for three-dimensional problems remains challenging: in Fourier Neural Operators, dense mode-wise spectral channel mixing scales linearly with the number of retained Fourier modes, inflating parameter counts and limiting real-time deployability. We introduce HQ-LP-FNO, a hybrid quantum-classical FNO that replaces a configurable fraction of these dense spectral blocks with a compact, mode-shared variational quantum circuit mixer whose parameter count is independent of the Fourier mode budget. A parameter-matched classical bottleneck control is co-designed to provide a rigorous evaluation framework. Evaluated on three-dimensional surrogate modeling of high-energy laser processing, coupling heat transfer, melt-pool convection, free-surface deformation, and phase change, HQ-LP-FNO reduces trainable parameters by 15.6% relative to a classical baseline while lowering phase-fraction mean absolute error by 26% and relative temperature MAE from 2.89% to 2.56%. A sweep over the quantum-channel budget reveals that a moderate VQC allocation yields the best temperature metrics across all tested configurations, including the fully classical baseline, pointing toward an optimal classical-quantum partitioning. The ablation confirms that mode-shared mixing, naturally implemented by the VQC through its compact circuit structure, is the dominant contributor to these improvements. A noisy-simulator study under backend-calibrated noise from ibm-torino confirms numerical stability of the quantum mixer across the tested shot range. These results demonstrate that VQC-based parameter-efficient spectral mixing can improve neural operator surrogates for complex multiphysics problems and establish a controlled evaluation protocol for hybrid quantum operator learning in practice.
[LG-98] A Muon-Accelerated Algorithm for Low Separation Rank Tensor Generalized Linear Models
链接: https://arxiv.org/abs/2604.04726
作者: Xiao Liang,Shuang Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Tensor-valued data arise naturally in multidimensional signal and imaging problems, such as biomedical imaging. When incorporated into generalized linear models (GLMs), naive vectorization can destroy their multi-way structure and lead to high-dimensional, ill-posed estimation. To address this challenge, Low Separation Rank (LSR) decompositions reduce model complexity by imposing low-rank multilinear structure on the coefficient tensor. A representative approach for estimating LSR-based tensor GLMs (LSR-TGLMs) is the Low Separation Rank Tensor Regression (LSRTR) algorithm, which adopts block coordinate descent and enforces orthogonality of the factor matrices through repeated QR-based projections. However, the repeated projection steps can be computationally demanding and slow convergence. Motivated by the need for scalable estimation and classification from such data, we propose LSRTR-M, which incorporates Muon (MomentUm Orthogonalized by Newton-Schulz) updates into the LSRTR framework. Specifically, LSRTR-M preserves the original block coordinate scheme while replacing the projection-based factor updates with Muon steps. Across synthetic linear, logistic, and Poisson LSR-TGLMs, LSRTR-M converges faster in both iteration count and wall-clock time, while achieving lower normalized estimation and prediction errors. On the Vessel MNIST 3D task, it further improves computational efficiency while maintaining competitive classification performance.
[LG-99] owards protein folding pathways by reconstructing protein residue networks with a policy-driven model
链接: https://arxiv.org/abs/2604.04677
作者: Susan Khor
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, 3 tables
Abstract:A method that reconstructs protein residue networks using suitable node selection and edge recovery policies produced numerical observations that correlate strongly (Pearson’s correlation coefficient -0.83) with published folding rates for 52 two-state folders and 21 multi-state folders; correlations are also strong at the fold-family level. These results were obtained serendipitously with the ND model, which was introduced previously, but is here extended with policies that dictate actions according to feature states. This result points to the importance of both the starting search point and the prevailing condition (random seed) for the quick success of policy search by a simple hill-climber. The two conditions, suitable policies and random seed, which (evidenced by the strong correlation statistic) setup a conducive environment for modelling protein folding within ND, could be compared to appropriate physiological conditions required by proteins to fold naturally. Of interest is an examination of the sequence of restored edges for potential as plausible protein folding pathways. Towards this end, trajectory data is collected for analysis and further model evaluation and development.
[LG-100] Minimaxity and Admissibility of Bayesian Neural Networks
链接: https://arxiv.org/abs/2604.04673
作者: Daniel Andrew Coulson,Martin T. Wells
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 95 pages and 6 figures
Abstract:Bayesian neural networks (BNNs) offer a natural probabilistic formulation for inference in deep learning models. Despite their popularity, their optimality has received limited attention through the lens of statistical decision theory. In this paper, we study decision rules induced by deep, fully connected feedforward ReLU BNNs in the normal location model under quadratic loss. We show that, for fixed prior scales, the induced Bayes decision rule is not minimax. We then propose a hyperprior on the effective output variance of the BNN prior that yields a superharmonic square-root marginal density, establishing that the resulting decision rule is simultaneously admissible and minimax. We further extend these results from the quadratic loss setting to the predictive density estimation problem with Kullback–Leibler loss. Finally, we validate our theoretical findings numerically through simulation.
[LG-101] Interpretation of Crystal Energy Landscapes with Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2604.04636
作者: Gen Zu,Ning Mao,Claudia Felser,Yang Zhang
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Characterizing crystalline energy landscapes is essential to predicting thermodynamic stability, electronic structure, and functional behavior. While machine learning (ML) enables rapid property predictions, the “black-box” nature of most models limits their utility for generating new scientific insights. Here, we introduce Kolmogorov-Arnold Networks (KANs) as an interpretable framework to bridge this gap. Unlike conventional neural networks with fixed activation functions, KANs employ learnable functions that reveal underlying physical relationships. We developed the Element-Weighted KAN, a composition-only model that achieves state-of-the-art accuracy in predicting formation energy, band gap, and work function across large-scale datasets. Crucially, without any explicit physical constraints, KANs uncover interpretable chemical trends aligned with the periodic table and quantum mechanical principles through embedding analysis, correlation studies, and principal component analysis. These results demonstrate that KANs provide a powerful framework with high predictive performance and scientific interpretability, establishing a new paradigm for transparent, chemistry-based materials informatics.
[LG-102] Noisy Nonreciprocal Pairwise Comparisons: Scale Variation Noise Calibration and Admissible Ranking Regions
链接: https://arxiv.org/abs/2604.04588
作者: Jean-Pierre Magnot
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:
Abstract:Pairwise comparisons are widely used in decision analysis, preference modeling, and evaluation problems. In many practical situations, the observed comparison matrix is not reciprocal. This lack of reciprocity is often treated as a defect to be corrected immediately. In this article, we adopt a different point of view: part of the nonreciprocity may reflect a genuine variation in the evaluation scale, while another part is due to random perturbations. We introduce an additive model in which the unknown underlying comparison matrix is consistent but not necessarily reciprocal. The reciprocal component carries the global ranking information, whereas the symmetric component describes possible scale variation. Around this structured matrix, we add a random perturbation and show how to estimate the noise level, assess whether the scale variation remains moderate, and assign probabilities to admissible ranking regions in the sense of strict ranking by pairwise comparisons. We also compare this approach with the brutal projection onto reciprocal matrices, which suppresses all symmetric information at once. The Gaussian perturbation model is used here not because human decisions are exactly Gaussian, but because observed judgment errors often result from the accumulation of many small effects. In such a context, the central limit principle provides a natural heuristic justification for Gaussian noise. This makes it possible to derive explicit estimators and probability assessments while keeping the model interpretable for decision problems. Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST) MSC classes: 90B50, 91B06, 91B08 Cite as: arXiv:2604.04588 [stat.ML] (or arXiv:2604.04588v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.04588 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-103] Generative Modeling under Non-Monotonic MAR Missingness via Approximate Wasserstein Gradient Flows
链接: https://arxiv.org/abs/2604.04567
作者: Gitte Kremling,Jeffrey Näf,Johannes Lederer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The prevalence of missing values in data science poses a substantial risk to any further analyses. Despite a wealth of research, principled nonparametric methods to deal with general non-monotone missingness are still scarce. Instead, ad-hoc imputation methods are often used, for which it remains unclear whether the correct distribution can be recovered. In this paper, we propose FLOWGEM, a principled iterative method for generating a complete dataset from a dataset with values Missing at Random (MAR). Motivated by convergence results of the ignoring maximum likelihood estimator, our approach minimizes the expected Kullback-Leibler (KL) divergence between the observed data distribution and the distribution of the generated sample over different missingness patterns. To minimize the KL divergence, we employ a discretized particle evolution of the corresponding Wasserstein Gradient Flow, where the velocity field is approximated using a local linear estimator of the density ratio. This construction yields a data generation scheme that iteratively transports an initial particle ensemble toward the target distribution. Simulation studies and real-data benchmarks demonstrate that FLOWGEM achieves state-of-the-art performance across a range of settings, including the challenging case of non-monotonic MAR mechanisms. Together, these results position FLOWGEM as a principled and practical alternative to existing imputation methods, and a decisive step towards closing the gap between theoretical rigor and empirical performance.
[LG-104] Minimising Willm ore Energy via Neural Flow
链接: https://arxiv.org/abs/2604.04321
作者: Edward Hirst,Henrique N. Sá Earp,Tomás S. R. Silva
类目: Differential Geometry (math.DG); Machine Learning (cs.LG)
*备注: 16+5 pages, 9 figures
Abstract:The neural Willmore flow of a closed oriented 2 -surface in \mathbbR^3 is introduced as a natural evolution process to minimise the Willmore energy, which is the squared L^2 -norm of mean curvature. Neural architectures are used to model maps from topological 2d domains to 3d Euclidean space, where the learning process minimises a PINN-style loss for the Willmore energy as a functional on the embedding. Training reproduces the expected round sphere for genus 0 surfaces, and the Clifford torus for genus 1 surfaces, respectively. Furthermore, the experiment in the genus 2 case provides a novel approach to search for minimal Willmore surfaces in this open problem.
[LG-105] CavMerge: Merging K-means Based on Local Log-Concavity
链接: https://arxiv.org/abs/2604.04302
作者: Zhili Qiao,Wangqian Ju,Peng Liu
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
Abstract:K-means clustering, a classic and widely-used clustering technique, is known to exhibit suboptimal performance when applied to non-linearly separable data. Numerous adjustments and modifications have been proposed to address this issue, including methods that merge K-means results from a relatively large K to obtain a final cluster assignment. However, existing methods of this nature often encounter computational inefficiencies and suffer from hyperparameter tuning. Here we present \emphCavMerge, a novel K-means merging algorithm that is intuitive, free of parameter tuning, and computationally efficient. Operating under minimal local distributional assumptions, our algorithm demonstrates strong consistency and rapid convergence guarantees. Empirical studies on various simulated and real datasets demonstrate that our method yields more reliable clusters in comparison to current state-of-the-art algorithms.
[LG-106] Avoiding Non-Integrable Beliefs in Expectation Propagation
链接: https://arxiv.org/abs/2604.04264
作者: Zilu Zhao,Jichao Chen,Dirk Slock
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Applications (stat.AP)
*备注:
Abstract:Expectation Propagation (EP) is a widely used iterative message-passing algorithm that decomposes a global inference problem into multiple local ones. It approximates marginal distributions as beliefs'' using intermediate functions called messages’'. It has been shown that the stationary points of EP are the same as corresponding constrained Bethe Free Energy (BFE) optimization problem. Therefore, EP is an iterative method of optimizing the constrained BFE. However, the iterative method may fall out of the feasible set of the BFE optimization problem, i.e., the beliefs are not integrable. In most literature, the authors use various methods to keep all the messages integrable. In most Bayesian estimation problems, limiting the messages to be integrable shrinks the actual feasible set. Furthermore, in extreme cases where the factors are not integrable, making the message itself integrable is not enough to have integrable beliefs. In this paper, two EP frameworks are proposed to ensure that EP has integrable beliefs. Both of the methods allows non-integrable messages. We then investigate the signal recovery problem in Generalized Linear Model (GLM) using our proposed methods.
[LG-107] Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization
链接: https://arxiv.org/abs/2604.04218
作者: Soham Bonnerjee,Zhipeng Lou,Wei Biao Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant ( \eta_t\equiv \eta ) or polynomially decaying ( \eta_t = \eta t^-\alpha ) learning schedules. However, it is well known that these choices suffer from either persistent bias or prohibitively slow convergence. In contrast, the recently proposed linear decay to zero (\textttLD2Z: \eta_t,n=\eta(1-t/n) ) schedule has shown appreciable empirical performance, but its theoretical and statistical properties remain largely unexplored, especially in the Q-learning setting. We address this gap in the literature by first considering a general class of power-law decay to zero (\textttPD2Z- \nu : \eta_t,n=\eta(1-t/n)^\nu ). Proceeding step-by-step, we present a sharp non-asymptotic error bound for Q-learning with \textttPD2Z- \nu schedule, which then is used to derive a central limit theory for a new \textittail Polyak-Ruppert averaging estimator. Finally, we also provide a novel time-uniform Gaussian approximation (also known as \textitstrong invariance principle) for the partial sum process of Q-learning iterates, which facilitates bootstrap-based inference. All our theoretical results are complemented by extensive numerical experiments. Beyond being new theoretical and statistical contributions to the Q-learning literature, our results definitively establish that \textttLD2Z and in general \textttPD2Z- \nu achieve a best-of-both-worlds property: they inherit the rapid decay from initialization (characteristic of constant step-sizes) while retaining the asymptotic convergence guarantees (characteristic of polynomially decaying schedules). This dual advantage explains the empirical success of \textttLD2Z while providing practical guidelines for inference through our results.
[LG-108] Relay-Assisted Activation-Integrated SIM for Wireless Physical Neural Networks
链接: https://arxiv.org/abs/2604.04212
作者: Meng Hua,Deniz Gündüz
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Wireless physical neural networks (WPNNs) have emerged as a promising paradigm for performing neural computation directly in the physical layer of wireless systems, offering low latency and high energy efficiency. However, most existing WPNN implementations primarily rely on linear physical transformations, which fundamentally limits their expressiveness. In this work, we propose a relay-assisted WPNN architecture based on activation-integrated stacked intelligent metasurfaces (AI-SIMs), where each passive metasurface layer enabling linear wave manipulation is cascaded with an activation metasurface layer that realizes nonlinear processing in the analog domain. By deliberately structuring multi-hop wireless propagation, the relay amplification matrix and the metasurface phase-shift matrices jointly act as trainable network weights, while hardware-implemented activation functions provide essential nonlinearity. Simulation results demonstrate that the proposed architecture achieves high classification accuracy, and that incorporating hardware-based activation functions significantly improves representational capability and performance compared with purely linear physical implementations.
[LG-109] Non-Equilibrium Stochastic Dynamics as a Unified Framework for Insight and Repetitive Learning: A Kramers Escape Approach to Continual Learning
链接: https://arxiv.org/abs/2604.04154
作者: Gunn Kim
类目: atistical Mechanics (cond-mat.stat-mech); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 12 pages, 4 figures
Abstract:Continual learning in artificial neural networks is fundamentally limited by the stability–plasticity dilemma: systems that retain prior knowledge tend to resist acquiring new knowledge, and vice versa. Existing approaches, most notably elastic weight consolidation~(EWC), address this empirically without a physical account of why plasticity eventually collapses as tasks accumulate. Separately, the distinction between sudden insight and gradual skill acquisition through repetitive practice has lacked a unified theoretical description. Here, we show that both problems admit a common resolution within non-equilibrium statistical physics. We model the state of a learning system as a particle evolving under Langevin dynamics on a double-well energy landscape, with the noise amplitude governed by a time-dependent effective temperature T(t) . The probability density obeys a Fokker–Planck equation, and transitions between metastable states are governed by the Kramers escape rate k = (\omega_0\omega_b/2\pi),e^-\Delta E/T . We make two contributions. First, we identify the EWC penalty term as an energy barrier whose height grows linearly with the number of accumulated tasks, yielding an exponential collapse of the transition rate predicted analytically and confirmed numerically. Second, we show that insight and repetitive learning correspond to two qualitatively distinct temperature protocols within the same Fokker–Planck equation: insight events produce transient spikes in T(t) that drive rapid barrier crossing, whereas repetitive practice operates at a modestly elevated but fixed temperature, achieving transitions through sustained stochastic diffusion. These results establish a physically grounded framework for understanding plasticity and its failure in continual learning systems, and suggest principled design criteria for adaptive noise schedules in artificial intelligence.
[LG-110] Primal-Dual Methods for Nonsmooth Nonconvex Optimization with Orthogonality Constraints
链接: https://arxiv.org/abs/2604.04130
作者: Linglingzhi Zhu,Wentao Ding,Shangyuan Liu,Anthony Man-Cho So
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Recent advancements in data science have significantly elevated the importance of orthogonally constrained optimization problems. The Riemannian approach has become a popular technique for addressing these problems due to the advantageous computational and analytical properties of the Stiefel manifold. Nonetheless, the interplay of nonsmoothness alongside orthogonality constraints introduces substantial challenges to current Riemannian methods, including scalability, parallelizability, complicated subproblems, and cumulative numerical errors that threaten feasibility. In this paper, we take a retraction-free primal-dual approach and propose a linearized smoothing augmented Lagrangian method specifically designed for nonsmooth and nonconvex optimization with orthogonality constraints. Our proposed method is single-loop and free of subproblem solving. We establish its iteration complexity of O(\epsilon^-3) for finding \epsilon -KKT points, matching the best-known results in the Riemannian optimization literature. Additionally, by invoking the standard Kurdyka-Lojasiewicz (KL) property, we demonstrate asymptotic sequential convergence of the proposed algorithm. Numerical experiments on both smooth and nonsmooth orthogonal constrained problems demonstrate the superior computational efficiency and scalability of the proposed method compared with state-of-the-art algorithms.
[LG-111] opological Sensitivity in Connectome-Constrained Neural Networks
链接: https://arxiv.org/abs/2604.04033
作者: Nalin Dhiman
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 17 pages, 5 fig
Abstract:Connectome-constrained neural networks are often evaluated against sparse random controls and then interpreted as evidence that biological graph topology improves learning efficiency. We revisit that claim in a controlled flyvis-based study using a Drosophila connectome, a naive self-loop-matched random graph, and a degree-preserving rewired null. Under weak controls, in which both models were recovered from a connectome-trained checkpoint and the null matched only global graph counts, the connectome appeared substantially better in early loss, mean activity, and runtime. That picture changed under stricter controls. Training both graphs from a shared random initialization removed the early loss advantage, and replacing the naive null by a degree-preserving null removed the apparent activity advantage. A five-sample degree-preserving ensemble and a pre-training activity-scale diagnostic further strengthened this revised interpretation. We also report a descriptive mechanism analysis of the earlier weak-control comparison, but we treat it as behavioral characterization rather than proof of causal superiority. We show that previously reported topology advantages in connectome-constrained neural networks can arise from initialization and null-model confounds, and largely disappear under fair from-scratch initialization and degree-preserving controls.
[LG-112] Nearly Optimal Best Arm Identification for Semiparametric Bandits AISTATS2026
链接: https://arxiv.org/abs/2604.03969
作者: Seok-Jin Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: To appear at AISTATS 2026
Abstract:We study fixed-confidence Best Arm Identification (BAI) in semiparametric bandits, where rewards are linear in arm features plus an unknown additive baseline shift. Unlike linear-bandit BAI, this setting requires orthogonalized regression, and its instance-optimal sample complexity has remained open. For the transductive setting, we establish an attainable instance-dependent lower bound characterized by the corresponding linear-bandit complexity on shifted features. We then propose a computationally efficient phase-elimination algorithm based on a new XY -design for orthogonalized regression. Our analysis yields a nearly optimal high-probability sample-complexity upper bound, up to log factors and an additive d^2 term, and experiments on synthetic instances and the Jester dataset show clear gains over prior baselines.
[LG-113] Fused Multinomial Logistic Regression Utilizing Summary-Level External Machine-learning Information
链接: https://arxiv.org/abs/2604.03939
作者: Chi-Shian Dai,Jun Shao
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 2 figures
Abstract:In many modern applications, a carefully designed primary study provides individual-level data for interpretable modeling, while summary-level external information is available through black-box, efficient, and nonparametric machine-learning predictions. Although summary-level external information has been studied in the data integration literature, there is limited methodology for leveraging external nonparametric machine-learning predictions to improve statistical inference in the primary study. We propose a general empirical-likelihood framework that incorporates external predictions through moment constraints. An advantage of nonparametric machine-learning prediction is that it induces a rich class of valid moment restrictions that remain robust to covariate shift under a mild overlap condition without requiring explicit density-ratio modeling. We focus on multinomial logistic regression as the primary model and address common data-quality issues in external sources, including coarsened outcomes, partially observed covariates, covariate shift, and heterogeneity in generating mechanisms known as concept shift. We establish large-sample properties of the resulting fused estimator, including consistency and asymptotic normality under regularity conditions. Moreover, we provide mild sufficient conditions under which incorporating external predictions delivers a strict efficiency gain relative to the primary-only estimator. Simulation studies and an application to the National Health and Nutrition Examination Survey on multiclass blood-pressure classification.
[LG-114] Biconvex Biclustering
链接: https://arxiv.org/abs/2604.03936
作者: Sam Rosen,Eric C. Chi,Jason Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 34 pages, 5 figures
Abstract:This article proposes a biconvex modification to convex biclustering in order to improve its performance in high-dimensional settings. In contrast to heuristics that discard a subset of noisy features a priori, our method jointly learns and accordingly weighs informative features while discovering biclusters. Moreover, the method is adaptive to the data, and is accompanied by an efficient algorithm based on proximal alternating minimization, complete with detailed guidance on hyperparameter tuning and efficient solutions to optimization subproblems. These contributions are theoretically grounded; we establish finite-sample bounds on the objective function under sub-Gaussian errors, and generalize these guarantees to cases where input affinities need not be uniform. Extensive simulation results reveal our method consistently recovers underlying biclusters while weighing and selecting features appropriately, outperforming peer methods. An application to a gene microarray dataset of lymphoma samples recovers biclusters matching an underlying classification, while giving additional interpretation to the mRNA samples via the column groupings and fitted weights.
[LG-115] PhaseFlow4D: Physically Constrained 4D Beam Reconstruction via Feedback-Guided Latent Diffusion
链接: https://arxiv.org/abs/2604.03885
作者: Alexander Scheinker,Alexander Plastun,Peter Ostroumov
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG)
*备注:
Abstract:We address the problem of recovering a time-varying 4D distribution from a sparse sequence of 2D projections - analogous to novel-view synthesis from sparse cameras, but applied to the 4D transverse phase space density \rho(x,p_x,y,p_y) of charged particle beams. Direct single shot measurement of this high-dimensional distribution is physically impossible in real particle accelerator systems; only limited 1D or 2D projections are accessible. We propose PhaseFlow4D, a feedback-guided latent diffusion model that reconstructs and tracks the full 4D phase space from incomplete 2D observations alone, with built-in hard physics constraints. Our core technical contribution is a 4D VAE whose decoder generates the full 4D phase space tensor, from which 2D projections are analytically computed and compared against 2D beam measurements. This projection-consistency constraint guarantees physical correctness by construction - not as a soft penalty, but as an architectural prior. An adaptive feedback loop then continuously tunes the conditioning vector of the latent diffusion model to track time-varying distributions online without retraining. We validate on multi-particle simulations of heavy-ion beams at the Facility for Rare Isotope Beams (FRIB), where full physics simulations require \sim 6 hours on a 100-core HPC system. PhaseFlow4D achieves accurate 4D reconstructions 11000 \times faster while faithfully tracking distribution shifts under time-varying source conditions - demonstrating that principled generative reconstruction under incomplete observations transfers robustly beyond visual domains.
[LG-116] New insights into Elo algorithm for practitioners and statisticians
链接: https://arxiv.org/abs/2604.03840
作者: Leszek Szczecinski
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
Abstract:This work reconciles two perspectives on the Elo ranking that coexist in the literature: the practitioner’s view as a heuristic feedback rule, and the statistician’s view as online maximum likelihood estimation via stochastic gradient ascent. Both perspectives coincide exactly in the binary case (iff the expected score is the logistic function). However, estimation noise forces a principled decoupling between the model used for ranking and the model used for prediction: the effective scale and home-field advantage parameter must be adjusted to account for the noise. We provide both closed-form corrections and a data-driven identification procedure. For multilevel outcomes, an exact relationship exists when outcome scores are uniformly spaced, but approximations are preferred in general: they account for estimation noise and better fit the data. The decoupled approach substantially outperforms the conventional one that reuses the ranking model for prediction, and serves as a diagnostic of convergence status. Applied to six years of FIFA men’s ranking, we find that the ranking had not converged for the vast majority of national teams. The paper is written in a semi-tutorial style accessible to practitioners, with all key results accompanied by closed-form expressions and numerical examples. Subjects: Methodology (stat.ME); Machine Learning (cs.LG) Cite as: arXiv:2604.03840 [stat.ME] (or arXiv:2604.03840v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2604.03840 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-117] Debiased Machine Learning for Conformal Prediction of Counterfactual Outcomes Under Runtime Confounding
链接: https://arxiv.org/abs/2604.03772
作者: Keith Barnatchez,Kevin P. Josey,Rachel C. Nethery,Giovanni Parmigiani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Data-driven decision making frequently relies on predicting counterfactual outcomes. In practice, researchers commonly train counterfactual prediction models on a source dataset to inform decisions on a possibly separate target population. Conformal prediction has arisen as a popular method for producing assumption-lean prediction intervals for counterfactual outcomes that would arise under different treatment decisions in the target population of interest. However, existing methods require that every confounding factor of the treatment-outcome relationship used for training on the source data is additionally measured in the target population, risking miscoverage if important confounders are unmeasured in the target population. In this paper, we introduce a computationally efficient debiased machine learning framework that allows for valid prediction intervals when only a subset of confounders is measured in the target population, a common challenge referred to as runtime confounding. Grounded in semiparametric efficiency theory, we show the resulting prediction intervals achieve desired coverage rates with faster convergence compared to standard methods. Through numerous synthetic and semi-synthetic experiments, we demonstrate the utility of our proposed method.
[LG-118] he Generalised Kernel Covariance Measure
链接: https://arxiv.org/abs/2604.03721
作者: Luca Bergen,Dino Sejdinovic,Vanessa Didelez
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Accepted for the 5th Conference on Causal Learning and Reasoning (CLeaR 2026)
Abstract:We consider the problem of conditional independence (CI) testing and adopt a kernel-based approach. Kernel-based CI tests embed variables in reproducing kernel Hilbert spaces, regress their embeddings on the conditioning variables, and test the resulting residuals for marginal independence. This approach yields tests that are sensitive to a broad range of conditional dependencies. Existing methods, however, rely heavily on kernel ridge regression, which is computationally expensive when properly tuned and yields poorly calibrated tests when left untuned, which limits their practical usefulness. We propose the Generalised Kernel Covariance Measure (GKCM), a regression-model-agnostic kernel-based CI test that accommodates a broad class of regression estimators. Building on the Generalised Hilbertian Covariance Measure framework (Lundborg et al., 2022), we characterise conditions under which GKCM satisfies uniform asymptotic level guarantees. In simulations, GKCM paired with tree-based regression models frequently outperforms state-of-the-art CI tests across a diverse range of data-generating processes, achieving better type I error control and competitive or superior power.
[LG-119] Nonparametric Regression Discontinuity Designs with Survival Outcomes
链接: https://arxiv.org/abs/2604.03502
作者: Maximilian Schuessler,Erik Sverdrup,Robert Tibshirani,Stefan Wager
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Quasi-experimental evaluations are central for generating real-world causal evidence and complementing insights from randomized trials. The regression discontinuity design (RDD) is a quasi-experimental design that can be used to estimate the causal effect of treatments that are assigned based on a running variable crossing a threshold. Such threshold-based rules are ubiquitous in healthcare, where predictive and prognostic biomarkers frequently guide treatment decisions. However, standard RD estimators rely on complete outcome data, an assumption often violated in time-to-event analyses where censoring arises from loss to follow-up. To address this issue, we propose a nonparametric approach that leverages doubly robust censoring corrections and can be paired with existing RD estimators. Our approach can handle multiple survival endpoints, long follow-up times, and covariate-dependent variation in survival and censoring. We discuss the relevance of our approach across multiple areas of applications and demonstrate its usefulness through simulations and the prostate component of the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial where our new approach offers several advantages, including higher efficiency and robustness to misspecification. We have also developed an open-source software package, \textttrdsurvival , for the \textttR language.
[LG-120] Recurrent Quantum Feature Maps for Reservoir Computing
链接: https://arxiv.org/abs/2604.03469
作者: Utkarsh Singh,Aaron Z. Goldberg,Christoph Simon,Khabat Heshami
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 11 pages, 13 figures
Abstract:Reservoir computing promises a fast method for handling large amounts of temporal data. This hinges on constructing a good reservoir–a dynamical system capable of transforming inputs into a high-dimensional representation while remembering properties of earlier data. In this work, we introduce a reservoir based on recurrent quantum feature maps where a fixed quantum circuit is reused to encode both current inputs and a classical feedback signal derived from previous outputs. We evaluate the model on the Mackey-Glass time-series prediction task using our recently introduced CP feature map, and find that it achieves lower mean squared error than standard classical baselines, including echo state networks and multilayer perceptrons, while maintaining compact circuit depth and qubit requirements. We further analyze memory capacity and show that the model effectively retains temporal information, consistent with its forecasting accuracy. Finally, we study the impact of realistic noise and find that performance is robust to several noise channels but remains sensitive to two-qubit gate errors, identifying a key limitation for near-term implementations.
[LG-121] Physics-Constrained Adaptive Flow Matching for Climate Downscaling
链接: https://arxiv.org/abs/2604.03459
作者: Kevin Debeire,Aytaç Paçal,Pierre Gentine,Luis Medrano-Navarro,Nils Thuerey,Veronika Eyring
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: submitted
Abstract:Regional climate information at kilometer scales is essential for assessing the impacts of climate change, but generating it with global climate models is too expensive due to their high computational costs. Machine learning models offer a fast alternative, yet they often violate basic physical laws and degrade when applied to climates outside of their training distribution. We present Physics-Constrained Adaptive Flow Matching (PC-AFM), a generative downscaling model that addresses both problems. Building on the Adaptive Flow Matching (AFM) model of Fotiadis et al. (2025) as our baseline, we add soft conservation constraints that keep the downscaled output consistent with the large-scale input for precipitation and humidity, and use gradient surgery via the ConFIG algorithm to prevent these constraints from interfering with the generative objective. We train the model on Central Europe climate data, evaluate it on a 10-time downscaling task (63km to 6.3km) over six variables (near-surface temperature, precipitation, specific humidity, surface pressure, and horizontal wind components) across a comprehensive set of metrics including bias, ensemble skill scores, power spectra, and conservation error, and test the generalization on two held-out climate regions. Within the training distribution, PC-AFM reduces conservation errors and improves ensemble calibration while matching the baseline on standard skill metrics. Outside the training distribution, where unconstrained models develop large systematic errors by extrapolating learned statistics, PC-AFM halves precipitation wet bias, reduces conservation error and improves extreme-quantile accuracy, all without any information about the target climate at inference time. These results indicate that physical consistency is a practical requirement for deploying generative downscaling models in real-world applications.
[LG-122] Expressibility of neural quantum states: a Walsh-complexity perspective
链接: https://arxiv.org/abs/2604.03294
作者: Taige Wang
类目: rongly Correlated Electrons (cond-mat.str-el); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 8 pages, 2 figures
Abstract:Neural quantum states are powerful variational wavefunctions, but it remains unclear which many-body states can be represented efficiently by modern additive architectures. We introduce Walsh complexity, a basis-dependent measure of how broadly a wavefunction is spread over parity patterns. States with an almost uniform Walsh spectrum require exponentially large Walsh complexity from any good approximant. We show that shallow additive feed-forward networks cannot generate such complexity in the tame regime, e.g. polynomial activations with subexponential parameter scaling. As a concrete example, we construct a simple dimerized state prepared by a single layer of disjoint controlled- Z gates. Although it has only short-range entanglement and a simple tensor-network description, its Walsh complexity is maximal. Full-cube fits across system size and depth are consistent with the complexity bound: for polynomial activations, successful fitting appears only once depth reaches a logarithmic scale in N , whereas activation saturation in \tanh produces a sharp threshold-like jump already at depth 3 . Walsh complexity therefore provides an expressibility axis complementary to entanglement and clarifies when depth becomes an essential resource for additive neural quantum states.
附件下载


